You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

24 October 2025

Tongan Speech Recognition Based on Layer-Wise Fine-Tuning Transfer Learning and Lexicon Parameter Enhancement

,
,
,
,
,
and
1
Beijing Research Institute of Automation for Machinery Industry Co., Ltd., No.1 Jiaochangkou Deshengmenwai, Xicheng District, Beijing 100120, China
2
School of Automation and Intelligence, Beijing Jiaotong University, No.3 Shangyuancun, Haidian District, Beijing 100044, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Techniques and Applications of Natural Language Processing

Abstract

Speech recognition, as a key driver of artificial intelligence and global communication, has advanced rapidly in major languages, while studies on low-resource languages remain limited. Tongan, a representative Polynesian language, carries significant cultural value. However, Tongan speech recognition faces three main challenges: data scarcity, limited adaptability of transfer learning, and weak dictionary modeling. This study proposes improvements in adaptive transfer learning and NBPE-based dictionary modeling to address these issues. An adaptive transfer learning strategy with layer-wise unfreezing and dynamic learning rate adjustment is introduced, enabling effective adaptation of pretrained models to the target language while improving accuracy and efficiency. In addition, the MEA-AGA is developed by combining the Mind Evolutionary Algorithm (MEA) with the Adaptive Genetic Algorithm (AGA) to optimize the number of byte-pair encoding (NBPE) parameters, thereby enhancing recognition accuracy and speed. The collected Tongan speech data were expanded and preprocessed, after which the experiments were conducted on an NVIDIA RTX 4070 GPU (16 GB) using CUDA 11.8 under the Ubuntu 18.04 operating system. Experimental results show that the proposed method achieved a word error rate (WER) of 26.18% and a word-per-second (WPS) rate of 68, demonstrating clear advantages over baseline methods and confirming its effectiveness for low-resource language applications. Although the proposed approach demonstrates promising performance, this study is still limited by the relatively small corpus size and the early stage of research exploration. Future work will focus on expanding the dataset, refining adaptive transfer strategies, and enhancing cross-lingual generalization to further improve the robustness and scalability of the model.

1. Introduction

Speech recognition, as a core technology of artificial intelligence, is widely applied in security, education, healthcare, and defense, enhancing work efficiency and facilitating global communication. Within the global linguistic landscape, Tongan is a representative low-resource language. Although spoken by a relatively small population, it embodies rich cultural traditions and historical heritage, giving it significant value for cultural preservation and transmission. As vital channels for intercultural communication, low-resource languages have gained growing importance in global research and cultural exchange. Tongan, which integrates indigenous traditions with Western influences, represents a unique linguistic system that deserves greater attention in computational linguistics. Consequently, advancing research on Tongan speech recognition is not only academically significant but also contributes to the digital preservation and international dissemination of its linguistic resources.
In recent years, deep learning has greatly advanced speech recognition for resource-rich languages such as English and Chinese, thereby facilitating international linguistic communication [1]. In contrast, low-resource languages like Tongan still encounter major challenges, including data scarcity, complex dialectal variation, and significant cultural-contextual differences. Globally, more than 7000 languages exist, the majority of which are low-resource [2], and they generally lack sufficient annotated corpora as well as lexical and grammatical resources. For example, while high-resource languages such as English or Mandarin typically have access to over 10,000 h of labeled speech data, many low-resource languages possess fewer than 100 h of transcribed material, which severely limits model training and generalization. This deficiency severely constrains the effectiveness of conventional speech recognition systems. Furthermore, low-resource languages often encode unique cultural knowledge, while their dialectal diversity and expressive variability increase the difficulty of model training. Traditional methods based on Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), or Convolutional Neural Networks (CNNs) provide limited modeling capacity for such languages and struggle to capture complex acoustic features and fine-grained semantic patterns. In addition, extreme data sparsity makes it difficult to train deep models directly, as this frequently leads to overfitting and hinders further improvements in recognition performance.
Against this backdrop, transfer learning and end-to-end modeling emerge as promising strategies for low-resource language speech recognition. Transfer learning leverages pretrained models on high-resource languages and applies adaptive fine-tuning to capture phonetic features of the target language, thereby improving modeling efficiency and recognition performance [3]. This approach not only mitigates the shortage of annotated data but also enhances adaptability to linguistic variation. End-to-end systems, in turn, adopt unified neural architectures to directly map speech signals to text, simplifying the traditional pipeline of feature extraction, acoustic modeling, and language modeling [4]. Such frameworks achieve higher recognition accuracy and exhibit strong robustness to dialectal and pronunciation variations common in low-resource languages. Recent advances in neural architectures, such as Transformer and DeepSpeech, have further expanded opportunities in this field [5]. Despite these advances, models are prone to overfitting during transfer due to the scarcity of labeled data, while existing strategies often rely on fixed, coarse-grained parameter adjustments that lack fine-grained and dynamic adaptation across layers and linguistic features. Moreover, within end-to-end frameworks, the design of sub-word units and construction of efficient, expressive vocabularies remain critical obstacles—particularly for languages with complex structures and heterogeneous pronunciations. Therefore, developing targeted optimizations in transfer mechanisms and dictionary modeling is essential to further enhance the accuracy and robustness of speech recognition systems for Tongan and other low-resource languages.
To address these challenges, this study introduces three key innovations, focusing on data augmentation and sampling, cross-lingual transfer strategy, and dictionary parameter optimization for low-resource languages. First, to mitigate the issues of limited corpus size and imbalanced class distribution, a Supervised Random Augmentation with Deep Random Forest Filtering (SRA-DRF) algorithm is developed. This method integrates generative adversarial networks (GAN) with signal processing techniques to generate synthetic speech data and applies similarity-based filtering to retain high-quality samples. In addition, a weighted K-means stratified sampling strategy ensures balanced class representation, thereby enhancing feature learning and model generalization. Second, in the transfer learning stage, an adaptive layer-wise unfreezing strategy is proposed. This approach dynamically adjusts learning rates and performs top-down fine-tuning, preserving low-level general features while improving adaptability to target language characteristics. The effectiveness of this strategy is further validated through Centered Kernel Alignment (CKA) visualization. Finally, for dictionary modeling, a hybrid MEA-AGA—combining the Mind Evolutionary Algorithm with the Adaptive Genetic Algorithm—is employed to automatically search for the optimal number of byte-pair encoding (NBPE) units, thereby improving recognition accuracy and decoding efficiency.
This paper is organized into seven sections as follows: Section 2 presents related work, reviewing the current research status of low-resource speech recognition, with a focus on challenges in data scarcity, transfer learning optimization, and dictionary parameter tuning, and briefly outlines the proposed strategies in this study. Section 3 details the proposed methods, including the design of the layer-wise adaptive transfer learning strategy and the MEA-AGA for dictionary parameter optimization. Section 4 describes the experimental setup, covering dataset sources, model configurations, and evaluation metrics. Section 5 reports the experimental results, including data augmentation, transfer learning, and dictionary optimization experiments, along with comparative analysis against mainstream methods to validate the effectiveness of the proposed approaches. Section 6 summarizes the overall contributions of this work in light of the experimental findings. Section 7 outlines future directions, addressing the current limitations and proposing potential areas for further exploration.

3. Methodology

To provide a clear overview of the proposed framework, Figure 1 presents the overall model pipeline designed for low-resource speech recognition. The process begins with raw audio waveforms, which are subjected to data augmentation and preprocessing to expand the effective training corpus. The augmented data are then utilized within an adaptive transfer learning framework, where the pretrained Mixformer model is progressively fine-tuned to capture language-specific phonetic representations. Subsequently, the MEA-AGA is applied to optimize the NBPE parameters for dictionary modeling, enabling efficient subword decoding and accurate text generation.
Figure 1. The model pipeline of the proposed low-resource speech recognition system.

3.1. Data Augmentation and Partitioning

In low-resource speech recognition tasks, data scarcity and class imbalance remain major factors that hinder model performance [40]. To improve training effectiveness and recognition accuracy, this section focuses on preprocessing and augmenting the original Tongan speech corpus, with the goal of constructing a balanced, diverse, and high-quality training dataset.
Data augmentation techniques are generally divided into signal processing-based and deep learning-based approaches. The former, such as speed perturbation, pitch shifting, and noise addition [41], are easy to implement but often introduce pronunciation distortions and fail to preserve language-specific features in low-resource settings. The latter, represented by Generative Adversarial Networks [42,43,44], generate higher-quality data but demand large datasets and significant computational resources, limiting their use in low-resource languages. To overcome these limitations, this study proposes an SRA-DRF algorithm that integrates signal processing techniques with deep learning methods. This hybrid strategy not only increases data diversity but also reduces reliance on large datasets and computational resources, thereby enabling effective augmentation of Tongan speech data. The overall workflow is illustrated in Figure 2.
Figure 2. Framework of the SRA-DRF (Supervised Random Augmentation with Deep Random Forest Filtering) algorithm.
Let C denote the original Tongan audio feature matrix. Random segments are removed from C , and the erased time spans are stored in a separate matrix referred to as the audio segment pool. The residual features, after the erasure process, are preserved in a new matrix denoted as C R .
Y = C M e r a s e = y 11 y 12 y 1 N y 21 y 21 y 2 N y M 1 y M 2 y M N   , Y M × N
y i j = x i j m i j = 1 0 m i j = 0
C R = C ( 1 M e r a s e ) ,   C R M × N
M e r a s e is a masking matrix composed of 0 and 1, having the same dimensionality as Y . A value of 0 indicates that the corresponding data point is masked, while 1 means it is retained. y i j denotes the amplitude of the j t h frame of the i t h utterance after the original data has been masked.
In the audio segment pool, signal processing and deep learning methods are jointly employed. On the one hand, the extracted audio segments are augmented through speed perturbation, pitch shifting, and noise addition. On the other hand, a GAN-based network is applied to generate additional realistic audio segments. The corresponding formulation is defined as follows:
Y P R O = F ( F 1 ( Y ) a ) , if   speed   perturbation Y e 2 π i b f , if   pitch   shifting Y + γ N , if   noise   addition
Y G A N = G ( Ζ ; θ ; Y ) , Y G A N M × N
where a is the speed perturbation factor applied to modify speaking rate, b is the pitch shift factor for adjusting vocal pitch. N denotes the noise matrix that follows a Gaussian distribution, and γ is the noise strength coefficient. G is the generator in the adversarial network; Z represents the noise input vector, which is randomly sampled and fed into the generator; θ denotes the generator’s parameters which control the properties of the generated audio. Y P R O is the processed audio segment using signal-based operations on C R , while Y G A N represents the audio segment generated by the GAN. These two types of augmented data are then combined into a unified audio segment pool Y P O O L :
Y P O O L = Y P R O + Y G A N , Y P O O L M × N
Next, several candidate segments from the audio segment pool Y P O O L , which stores the erased portions of the original audio, are aligned with the residual speech features C R . An attention-based encoder–decoder (AED) module is then applied to integrate these segments, producing a supplementary data matrix C :
C = C R + soft max ( C R , Y P O O L ) = C R + soft max C R + Y P O O L T N
Finally, the original Tongan corpus is utilized to train a DRF model. The augmented dataset is subsequently input to the model for evaluation, with the metric defined in Equation (8) adopted as the evaluation criterion.
s c o r e = sim ( C , C ) = C C C C
Audio segments with higher scores are classified as Yes and retained as valid augmentations, while the rest are discarded.
After augmenting the speech data, the dataset must be partitioned in an appropriate manner to enhance the model’s generalization capability. A commonly adopted approach is K-Means clustering, which facilitates balanced class distribution [45,46,47]. Its mathematical formulation is defined as follows:
S S E = i = 1 k x j C i x j μ i 2
Here, C i denotes the i t h cluster, x j represents the sample in the cluster, and μ i is the cluster centroid.
This method performs well when clustering data with balanced class distributions. However, for tasks involving low-resource languages such as Tongan, the small dataset size and class imbalance may result in biased cluster centers. Direct application of traditional clustering often leads to convergence issues, misalignment between training and test sets, and reduced model stability and generalization capability [48].
To overcome these limitations, we modify the objective function by introducing a class-aware weighting term, which balances the influence of minority-class samples during clustering. The revised formulation is defined as:
S S E = i = 1 k x j C i ω j x j μ i 2
The weight ω j is used to adjust the influence of each sample in the cluster based on its class distribution, and is defined as:
ω j = 1 f r e q u e n c y ( c ( x j ) )
where frequency ( c ( x j ) ) denotes the frequency of the class label c ( x j ) , to which sample x j belongs in the dataset. The detailed definition is:
f r e q u e n c y ( c ( x j ) ) = n c ( x j ) N
Here, N is the total number of samples in the dataset, and n c ( x j ) is the number of samples belonging to class c ( x j ) , which is calculated as:
n c ( x j ) = x i D I ( c ( x i ) = c ( x j ) )
I ( ) is the indicator function, which equals 1 if the condition is satisfied, and 0 otherwise. D denotes the full dataset. Finally, by substituting the above expressions, the improved K-Means clustering objective function is given as:
S S E = i = 1 k x j C i N x i D I ( c ( x i ) = c ( x j ) ) x j μ i 2
To achieve more balanced data partitioning, this study integrates stratified sampling with the improved weighted K-Means clustering method. Specifically, samples are first stratified according to their distances from the cluster centroids, and proportional sampling is then applied within each cluster to ensure balanced class representation across the training, validation, and test sets. The overall procedure of this data partitioning method is illustrated in Figure 3.
Figure 3. Tongan data partitioning flowchart based on weighted stratified sampling.
The basic steps of the algorithm are as follows:
(1) Based on the features extracted from the input Tongan speech data, perform statistical analysis to initialize k cluster centers μ i (where i = 1, 2, …, k and k is the number of clusters).
(2) Calculate the weighted distance from each sample to each cluster center and assign the sample to the nearest cluster. For a given sample x j and cluster center C i , the distance is computed according to Equation (14).
(3) Update each cluster center based on the current cluster members and their weights. The new cluster center μ i is computed as:
μ i = x j C i ω j x j x j C i ω j
(4) Check whether the new cluster centers meet the convergence condition. If convergence is achieved, the algorithm terminates; otherwise, return to step (2) for further updates.
(5) After clustering, compute the proportion of each class within each cluster:
p i j = n i j n i
Here, p i j , n i j and n i represent, respectively, the proportion, the number of samples, and the total number of samples of class c j within cluster C i .
(6) Finally, stratified sampling is performed within each cluster based on the class distribution and the specified data split ratios. The total number of samples selected from each cluster is then used to construct the training, validation, and test sets. The formulas are defined as follows:
N T = i = 1 k j = 1 m p i j × n i j T
N V = i = 1 k j = 1 m p i j × n i j V
N T = i = 1 k j = 1 m p i j × n i j S
Here, N T , N V and N S represent the total number of training, validation, and test samples, respectively. k is the number of clusters, m is the number of classes, and n i j T , n i j V and n i j S represent the number of training, validation, and test samples from class c j within cluster C i .

3.2. Layer-Wise Unfreezing for Adaptive Transfer Learning

Transfer learning, which adapts knowledge from a source domain to a target task, has proven particularly effective in low-resource or highly specialized scenarios and has therefore attracted considerable academic attention [49]. However, for large-scale models such as Transformer, Wav2Vec, and GPT, significant challenges remain, including the risk of negative transfer and high computational costs. To address these issues, this study proposes an adaptive transfer learning approach based on layer-wise unfreezing. The overall network architecture and algorithmic workflow are illustrated in the figure below.
The model architecture adopted in this study incorporates a fine-grained feedback mechanism with a clear distinction between frozen and tunable layers, thereby improving the precision and efficiency of transfer learning. The unfreezing process follows a top-down strategy, where the number of trainable layers and the corresponding learning rates are initialized according to cross-lingual similarity. This design preserves pre-trained knowledge while gradually adapting the model to the target task. The overall architecture of the proposed adaptive transfer learning network is illustrated in Figure 4. As training progresses, a loss-driven adjustment strategy is applied to dynamically refine the unfreezing process, enabling more adaptive and task-specific optimization. The detailed training procedure is summarized in Algorithm 1.
Figure 4. Layer-wise unfreezing adaptive transfer learning network architecture.
The specific steps of the proposed algorithm are as follows:
(1) Model Initialization: This includes setting the maximum number of iterations n , the pretrained model M p r e , the transfer model M t r a n s , the number of model layers N , and the initial learning rate η 0 .
(2) Similarity and Loss Evaluation: The similarity between source and target languages is calculated using Cosine similarity, denoted as sim score . Simultaneously, the overall training loss L t o t a l is computed to guide the subsequent unfreezing strategy.
(3) Determining the Number of Unfrozen Layers: The initial number of unfrozen layers L 0 is determined based on the computed similarity score. The definition and iterative update of the unfreezing depth are formulated as follows:
L 0 = N · ( 1 sim score )
L t + 1 = L t + s i g n ( Δ L t o t a l ( t ) Δ L t o t a l ( t 1 ) ) × Δ L
where Δ L t o t a l ( t ) and Δ L t o t a l ( t 1 ) represent the change in loss values at iterations t and t 1 , respectively, and the step size Δ L is set to 1.
(4) Learning Rate Adjustment: The learning rate η is dynamically adjusted based on similarity and loss:
η = η 0 · e α ( t ) · sim score β ( t ) · L t o t a l , i f   unfrozen   l a y e r 0 , i f   frozen   l a y e r
where α ( t ) and β ( t ) are time-dependent weighting coefficients that control the influence of similarity and loss, respectively. At the early stage of training, α ( t ) is set relatively high and decreases gradually, while β ( t ) starts low and increases gradually as training progresses.
α ( t ) = α 0 e γ t , β ( t ) = β 0 ( 1 e γ t )
where γ is a positive constant (set to 0.1) to control the rate of decay and growth. The initial values are α 0 = 0.9 and β 0 = 0.1 .
(5) Unfreezing Execution: Based on the determined unfreezing depth, the model is incrementally unfrozen from the top layer downward. The learning rate is applied accordingly to fine-tune each layer.
(6) Convergence Check: The process checks whether convergence criteria are met. If not, the algorithm returns to step (2) for further updates.
(7) Final Output: Once convergence is achieved, transfer training is completed, and the final target language recognition model is obtained.
To further assess the feature changes during the transfer process, this study employs the CKA similarity matrix to quantitatively analyze key layers of the network before and after fine-tuning. CKA measures the alignment of representations in kernel space across different layers or stages and is widely used in neural network visualization and interpretability research [50]. By comparing similarity scores across layers, the analysis reveals how hierarchical features are adjusted during transfer, providing theoretical support for the effectiveness of the layer-wise unfreezing strategy.
Algorithm 1: Adaptive layer-wise unfreezing and learning rate scheduling for transfer learning
Require: Mpre, n, η0, α0, β0, L
Ensure: Mtrans
1:
  MtransMpre; ηη0; L ⇐ 0;
2:
  for t = 1 to n do
3:
   /* Stage 1: Similarity and Loss Calculation */
4:
   Compute simscore and training loss Ltotal (t);
5:
   /* Stage 2: Layer Unfreezing Decision */
6:
   if t == 1 then
7:
     L ⇐ floor[N · (1 − simscore)];
8:
   else
9:
     L ⇐ L + sign(Ltotal (t) − Ltotal (t-1)) × ΔL;
10:
   end if
11:
   /* Stage 3: Learning Rate Adjustment */
12:
   if unfreeze then
13:
     ηη0 · exp(−α(t) · simscore − β(t) ·Ltotal (t));
14:
   end if
15:
   α(t) ⇐ α0 · exp(−γ · t); β(t) ⇐ β0 · (1 − exp(−γ · t));
16:
   /* Stage 4: Layer-Wise Training */
17:
   Freeze layers < L; unfreeze layers ≥ L;
18:
   Train Mtrans with η;
19:
   /* Stage 5: Termination Check */
20:
   if stopping criterion met then
21:
     break;
22:
   end if
23:
  end for

3.3. Dictionary Parameter Optimization Driven by MEA-AGA

In automatic speech recognition, dictionary construction is a critical component, typically achieved through sub-word unit segmentation to standardize output. Among these, NBPE serves as a core parameter that directly impacts vocabulary coverage, sequence length, and model complexity. A small NBPE can increase coverage and reduce annotation cost, but it leads to longer sequences and greater training difficulty. Conversely, a large NBPE may simplify training but could weaken the model’s linguistic expressiveness [36]. In low-resource scenarios such as Tongan, optimizing NBPE is crucial to improving recognition performance. To this end, this study proposes the MEA-AGA method to determine the optimal dictionary configuration for Tongan.
The genetic algorithm (GA) is an optimization method inspired by natural evolution, which searches for optimal solutions through selection, crossover, and mutation operations. However, traditional GA uses fixed parameter settings, making it prone to local optima and lacking adaptability. The adaptive genetic algorithm (AGA) addresses this by introducing dynamic parameter adjustment mechanisms, allowing the crossover and mutation rates to evolve throughout the search process, thereby enhancing global exploration and algorithm robustness. The functions are defined as follows:
P c = k 1 f max f f max f a v g , f f a v g k 2 , f < f a v g
P m = k 3 f max f f max f avg   , f f avg   k 4 , f < f avg  
In the equations, k 1 , k 2 , k 3 and k 4 are constants between 0 and 1. f max and f a v g denote the maximum and average fitness of the current population, respectively; f represents the fitness of the better individual among the two selected for crossover; f corresponds to the fitness of the individual undergoing mutation.
However, Equations (24) and (25) consider only individual fitness and neglect the overall evolution of the population, making it difficult to capture population-wide trends and potentially leading to local optima. To address this limitation, this study introduces improvements to both the crossover and mutation probability functions:
p c = p c max e f f avg f max f avg , f f avg p c max , f < f avg
p m = p m max e f f avg f max f avg , f f avg p m max , f < f avg
P c max represents the maximum crossover probability, and P m max is the maximum mutation probability. As shown in the equations, regardless of an individual’s fitness value, both the crossover and mutation probabilities are prevented from dropping to zero. This not only effectively preserves high-performing individuals but also avoids premature convergence to local optima.
However, the exponential terms in the equations tend to magnify differences in fitness values. When an individual’s fitness deviates substantially from the population mean, the resulting rapid growth or decay of the exponential component may induce excessive fluctuations in crossover and mutation probabilities. Such instability can lead to over-exploration and undermine the robustness of the algorithm. To mitigate this issue, fitness variance is introduced to further refine and stabilize the equations:
p c = p c max e f f avg V a r ( f ) , f f avg p c max     , f < f avg
p m = p m max e f f avg V a r ( f ) , f f avg p m max , f < f avg
V a r ( f ) is a statistical measure used to quantify the dispersion of fitness values within the population. A larger variance indicates greater differences in individual fitness, while a smaller one implies a more uniform population. The calculation formula is as follows:
V a r ( f ) = 1 N i = 1 N ( f i f a v g ) 2
By incorporating fitness variance, the crossover and mutation probabilities are adjusted more smoothly, thereby suppressing drastic fluctuations induced by extreme fitness values. This adjustment enhances the stability of the algorithm and strengthens its capability to explore the solution space effectively.
Although this strategy increases parameter flexibility, it is still susceptible to local optima in complex search spaces, particularly when fitness differences among individuals are minimal. This limitation reduces adaptability and hinders global exploration. Moreover, AGA encounters challenges in balancing evolutionary speed and search quality: overly aggressive adjustment rates may lead to divergence and inefficiency, whereas overly conservative rates restrict the search range and degrade performance. To address these limitations, this study integrates the MEA framework to further enhance AGA. MEA simulates human-like rapid evolution through learning and innovation, thereby improving both adaptability and global search capability. In each iteration, a randomly initialized population undergoes fitness-based selection to generate a central individual, which partitions the population into elite and temporary subpopulations. The elite group preserves high-quality solutions, while the temporary group performs exploratory search, together enabling cooperative optimization. The procedure is illustrated in Figure 5.
Figure 5. MEA flowchart.
As illustrated in the figure, the algorithm achieves a balance between global exploration and local optimization through a two-phase “diversification–convergence” strategy. During the diversification phase, population diversity is increased to promote broad exploration of the search space and identify multiple promising candidate solutions. In the subsequent convergence phase, these candidate solutions are refined to enhance solution quality and improve convergence efficiency. The detailed procedures of the diversification and convergence phases are depicted in Figure 6 and Figure 7, respectively.
Figure 6. Differentiation operation flowchart.
Figure 7. Aggregation operation flowchart.
In the global search phase, a diversification operation is first conducted. The population is divided into elite and temporary subpopulations according to fitness values, facilitating competitive interactions that help identify multiple promising global optima. If a temporary subpopulation outperforms the current elite subpopulation in terms of fitness, it replaces the elite group and is incorporated into the set of optimal solutions. Meanwhile, the temporary subpopulation with the lowest fitness is discarded and reinitialized across the entire solution space. Since this phase emphasizes search breadth and allows for larger parameter perturbations, the mutation mechanism from (29) is adopted to expand the search range.
In the local search phase, a convergence operation is carried out. This operation involves a two-stage refinement of the elite subpopulation obtained during the diversification phase. Guided by the fitness function, it seeks to further enhance solution quality and approximate the global optimum. To prevent excessive parameter perturbation, the crossover mechanism is applied, with the crossover rate calculated using (28), thereby improving the precision of local search. When the optimal individual remains stable across successive iterations and no longer changes, the subpopulation is considered matured, marking the completion of the convergence process.
In summary, the MEA-AGA is employed to optimize the NBPE parameter in Tongan speech recognition, with the objective of balancing recognition accuracy, decoding speed, and training cost. Prior to the optimization process, a fitness function is defined to comprehensively evaluate multiple factors in the recognition process, including model accuracy, decoding efficiency, and training time. The calculation formula is as follows:
fitness = α 1 W E R d e v + β 1 W E R t e s t + λ W P S 10 log 10 ( WPS ) + μ 10 log 10 ( t train ) t t r a i n
W E R d e v , W E R t e s t , W P S , and t t r a i n represent the word error rate on the validation set, the word error rate on the test set, the number of words recognized per second, and the model training time, respectively. The corresponding weight coefficients α , β , λ , and μ are set to 0.3, 0.3, 0.3, and 0.1, reflecting the equal importance of recognition accuracy and decoding speed, while giving relatively less emphasis to training time. In addition, to eliminate the influence of differing units of measurement, WPS and training time are normalized to ensure that all evaluation indicators fall within a comparable range.
The overall procedure of the proposed optimization algorithm is summarized in Algorithm 2, and its detailed steps are described as follows:
(1) Parameter Initialization: Set the maximum number of iterations, population size N , crossover probability P c , and mutation probability P m .
(2) Population Initialization: Randomly generate 10 initial individuals p o p 1 n ~ p o p 10 n .
(3) Fitness Evaluation: Calculate the fitness scores of all individuals in the population using (30).
(4) Diversification Operation: Sort individuals based on their fitness scores. The top five individuals form the elite subpopulation p o p b e s t 1 n ~ p o p b e s t 5 n , and the bottom five form the temporary subpopulation p o p t e m p 1 n ~ p o p t e m p 5 n . Determine whether the fitness score of the newly generated temporary subpopulation is higher than that of the elite subpopulation. If so, apply mutation (29) to the lowest-scoring temporary individual to generate a new temporary subpopulation p o p temp 1 n ~ p o p t e m p 5 n , and return to step 3; otherwise, the diversification phase is complete, and the elite subpopulation is finalized.
(5) Fitness Calculation of the New Elite Subpopulation: Recalculate the fitness scores of all individuals in the new elite group using (30).
(6) Convergence Operation: Sort individuals in the elite subpopulation by fitness; the highest-scoring individual is identified as the winner. Determine whether the population has matured. If the winner changes, perform crossover (28) on the remaining individuals to generate new individuals, and return to step 5; otherwise, the convergence phase is complete, and the global optimal individual is obtained.
(7) Termination Check: Determine whether the stopping condition is met. If the maximum number of generations has been reached, the process ends and the current optimal NBPE value is output; otherwise, return to step 3 to continue iteration.
(8) Dictionary Construction: Based on the optimal NBPE value, segment Tongan words and construct the Tongan dictionary, which serves as a standard for both training and inference, ultimately enabling accurate and efficient Tongan speech recognition.
Algorithm 2: MEA-AGA for Dictionary Parameter Optimization
Require: N, nmax, Pc, Pm, pop1~pop10
Ensure: Optimal NBPE
1:
  /* Stage 1: Initialization */
2:
  Initialize parameters; generate population pop1~pop10;
3:
  for iteration = 1 to nmax do
4:
   /* Stage 2: Fitness Evaluation */
5:
   Compute fitness for all individuals;
6:
   /* Stage 3: Differentiation */
7:
   Rank population; select top 5 as popelite, others as poptemporary;
8:
   if poptemporary better than popelite then
9:
     Replace popelite; mutate poor individuals (see Equations (3)–(32));
10:
    else
11:
     Regenerate popelite;
12:
     continue;
13:
    end if
14:
    /* Stage 4: Aggregation */
15:
    Optimal NBPENBPE of best individual
16:
    if not converged then
17:
     crossover individuals (see Equations (3)–(31));
18:
     continue;
19:
    end if
20:
    /* Stage 5: Termination */
21:
    if termination met then
22:
     break;
23:
    end if
24:
  end for

4. Experimental Setup

4.1. Dataset Configuration

In transfer learning, the choice of source language plays a crucial role in determining model performance. This is especially relevant for low-resource speech recognition, where selecting a source language with abundant resources and phonetic structures similar to the target language can significantly improve transfer effectiveness and generalization. Tongan, the official language of the Kingdom of Tonga, belongs to the Polynesian language family and represents a typical low-resource case [51]. Due to historical colonization and the influence of modern education systems, English has gradually become one of the official languages and is widely taught as a second language in schools [52,53]. Prolonged language contact has resulted in extensive borrowing from English, both in vocabulary and phonological systems, creating notable similarities in phoneme distribution and speech structure. Furthermore, English, as the most resource-rich language, offers large-scale annotated corpora and well-developed pretrained models, providing a strong foundation for transfer learning. Therefore, this study selects English as the source language for building pretrained models to facilitate Tongan speech recognition.
This study employs two datasets. For English, the publicly available LibriSpeech corpus (960 h) is used for pretraining. For Tongan, the speech data were recorded in our laboratory by several professional researchers to ensure both speaker diversity and consistent recording quality. All recordings were conducted in a quiet indoor environment using a high-quality condenser microphone positioned approximately 20 cm from the speaker’s mouth. Each utterance was recorded by a single speaker without overlap to avoid multi-speaker interference. All recordings were captured in mono format, with no speaker diarization applied, as each utterance was produced by a single speaker. The sampling rate was set to 16 kHz. The transcription process was manually performed using Notepad++ and strictly aligned with the reference text to guarantee complete labeling accuracy. The recordings feature clear and standardized pronunciation, making them suitable for evaluating model performance and recognition robustness under low-resource conditions. In total, 1.44 h of Tongan speech data were obtained and divided into training, validation, and test sets. Detailed statistics are provided in Table 1.
Table 1. Tongan corpus configuration (raw data).
As shown in the table, the Tongan dataset is relatively small in scale. Direct training on such limited data is likely to cause overfitting and reduce the model’s generalization capability. Therefore, in the subsequent experiments, the data will be appropriately augmented and expanded to reach a more reasonable scale.

4.2. Experimental Design

This study adopts the Mixformer network as the base architecture for transfer learning. The pre-trained weights are obtained from previous English speech recognition experiments that were trained on the LibriSpeech corpus [54]. The Mixformer model, containing approximately 90 million parameters, performs loss-based weighted fusion of the Conformer, Unified Conformer, and U2++ Conformer architectures to improve overall robustness. During decoding, it combines both CTC and attention mechanisms, incorporating a penalty function to dynamically optimize decoding paths, thereby achieving a balance between recognition accuracy and inference efficiency. Building on this foundation, the proposed layer-wise adaptive transfer strategy is applied to fine-tune the model, improving its adaptability and training efficiency for the Tongan language task.
To provide a clearer view of the model’s architecture and performance, Table 2 and Table 3 present the Mixformer’s core configurations, along with its training settings and recognition results on the English corpus.
Table 2. Model architecture.
Table 3. Parameter configuration and recognition results in English pretraining stage.
Experiments were conducted on an NVIDIA RTX 4070 GPU (16 GB, NVIDIA Corporation, Santa Clara, CA, USA) with CUDA 11.8, running on 64-bit Ubuntu 18.04. The environment was configured with Python 3.8 and PyTorch 2.1.2. Detailed parameter settings for the Tongan training stage are summarized in Table 4.
Table 4. Parameter configuration in Tongan training stage.
In addition, standard training management techniques were employed, including early stopping based on validation loss and automatic checkpoint saving after each epoch. Furthermore, a warm-up followed by cosine-annealing learning rate scheduling was utilized to stabilize the training process and promote smoother convergence.

4.3. Performance Metrics

In speech recognition, evaluating model performance is critical. This study adopts Word Error Rate (WER) as the primary evaluation metric. WER is calculated by measuring the minimum edit distance—comprising substitutions, insertions, and deletions—between the recognized text and the reference text and dividing it by the total number of words in the reference. The formula is as follows:
W E R = S w + D w + I w N w
Here, S w , D w , and I w represent the number of substitutions, deletions, and insertions, respectively, while N w denotes the total number of words. A lower W E R indicates higher recognition accuracy.
In addition to recognition accuracy, decoding speed is also an important evaluation criterion. This study uses Words Per Second (WPS) to reflect the model’s real-time processing capability. The calculation formula is as follows:
W P S = N w T I
T I represents the time required to complete an inference of speech recognition.
In addition, to systematically evaluate the effectiveness of the data augmentation and partitioning strategies, this study conducts quantitative analysis from two perspectives: the similarity quality of augmented data and the class balance of the partitioned subsets.
For similarity evaluation, the proposed augmentation algorithm incorporates a DRF module to assess feature similarity between synthetic pseudo-samples and original samples. DRF filters and retains high-quality augmented samples based on a scoring mechanism, as detailed in (8).
For class balance evaluation, given that audio signal features are high-dimensional and structurally complex, making them difficult to interpret directly, this study employs the t-SNE dimensionality reduction algorithm to project the features into a two-dimensional space for intuitive visualization of data clustering. To quantitatively assess distributional divergence between subsets, a weighted total variance metric is introduced as follows:
σ t o t a l 2 = n t r a i n σ t r a i n 2 + n v a l σ v a l 2 + n t e s t σ t e s t 2 n t r a i n + n v a l + n t e s t
σ t r a i n 2 , σ v a l 2 and σ t e s t 2 denote the variances of the training, validation and test sets, respectively; n t r a i n , n v a l and n t e s t represent the number of samples in the training, validation and test sets.

5. Experimental Results

5.1. Data Augmentation Experiments

To address data scarcity and class imbalance in Tongan speech recognition, this study augments the original dataset using SRA-DRF combined with weighted stratified sampling. The effectiveness is evaluated from two perspectives: data partitioning and augmentation similarity.
For the data partitioning experiment, given the large volume of augmented samples, qualitative analysis uses only a subset for t-SNE visualization to avoid visual clutter. It is important to note that the x- and y-axes in the t-SNE plots reflect only relative sample distribution without physical meaning; therefore, axis ticks are omitted in the corresponding figures. In contrast, quantitative analysis is performed on the full dataset to compute the total variance, ensuring comprehensive and representative evaluation. The data are clustered into 3 to 7 groups, and samples are further categorized into near, medium, and far subgroups according to their distances from cluster centers. Stratified sampling is then applied to construct the training, validation, and test sets. Finally, the proposed method is compared with the baseline approach both qualitatively and quantitatively, as shown below.
Figure 8 presents the clustering results of the traditional algorithm and the proposed method under different numbers of clusters (3–7). When the number of clusters is set to 3, the clustering is relatively coarse, making it difficult to distinguish the data effectively. As the number increases to 4 or 5, the data distribution becomes clearer, and clustering quality improves markedly. However, with 6 or 7 clusters, the boundaries become blurred and class overlap intensifies, leading to a decline in clustering performance. Comparing performance across different cluster settings, the traditional K-Means performs reasonably well with fewer clusters but exhibits severe class mixing as the number increases. In contrast, the proposed method maintains clearer separation even with higher cluster counts, alleviating the overlap issue to some extent and consistently outperforming the baseline.
Figure 8. Comparison of clustering results.
Table 5 presents the dataset partitioning results obtained using K-Means and weighted stratified K-Means under different cluster numbers (three to seven). The experimental results indicate that although the total variance of K-Means generally decreases as the number of clusters increases, its performance becomes unstable at higher cluster counts (six or seven), particularly on the test set. In contrast, the proposed method maintains better balance even with fewer clusters and exhibits more stable variance trends as the cluster number increases, demonstrating improved consistency and uniformity in partitioning. Based on the overall evaluation, the configuration with five clusters—yielding the most balanced results—is selected for subsequent data partitioning. The final statistics are reported in Table 6 and Table 7.
Table 5. Quantitative comparison of dataset partitioning results.
Table 6. Dataset partition statistics.
Table 7. Final sample classification statistics.
To validate the effectiveness of the augmentation method, eight controlled experiments were conducted to examine the impact of different augmentation techniques and class sample distributions on classification accuracy. The experimental configurations and results are summarized in Table 8.
Table 8. Data augmentation effectiveness experiment.
In this study, the dataset consists of five categories (A, B, C, D, E), clustered by the weighted stratified sampling algorithm. In Experiment 1, data augmentation was performed using a Generative Adversarial Network (GAN), which achieved a similarity of 79.60%. Experiment 2 applied traditional signal processing techniques such as speed and pitch perturbation, achieving a similarity of 88.62%. Experiment 3 integrated both approaches with the proposed augmentation algorithm, achieving the highest similarity of 90.63%. This result outperformed Experiments 1 and 2, demonstrating that the proposed method effectively enhances data quality and improves model robustness. Experiments 4–8 involved removing one category at a time from the Tongan dataset. The results show that removing any category led to a significant drop in accuracy, indicating that maintaining class balance plays a crucial role in data augmentation. Therefore, the dataset generated in Experiment 3, which achieved the highest classification accuracy, was selected for subsequent transfer learning experiments.
After data augmentation and partitioning, the final Tongan speech dataset consists of 8686 audio samples, divided into five categories: A (1946 samples), B (1820), C (1870), D (1804), and E (1846), totaling approximately 11.44 h. All audio files are stored in FLAC format and sampled at 16 kHz. The dataset distribution is shown in Table 9.
Table 9. Configuration of the expanded data sample set.
In summary, the proposed data augmentation method not only expanded the corpus size but also preserved a high similarity in feature distributions between the augmented samples and the original data, confirming the effectiveness and practicality of the augmented corpus. The weighted stratified sampling strategy effectively improved class balance during dataset partitioning and enhanced the consistency across the training, validation, and test sets. Together, these two strategies enabled the development of a stable, high-quality low-resource speech dataset, providing a solid foundation for subsequent model training.

5.2. Transfer Learning Experiments

To comprehensively evaluate the effectiveness of the proposed transfer learning strategy, this section conducts experimental validation from two perspectives: a visual analysis of the model migration process and a comparison of recognition performance after transfer.
The experiment begins with a visualization-based analysis to examine structural changes in the model before and after fine-tuning. Given the complexity of the adopted architecture, it is impractical to cover all layers. To address this, a representative selection strategy is adopted: the lower layers are sampled using stratified sampling, and the last 30 upper layers are selected to align with the layer-wise unfreezing process and to observe the evolution of feature representations. The corresponding experimental results are presented below.
Figure 9 and Figure 10 present the CKA similarity matrices of four models before and after training, focusing on the lower and upper layers. The results show that the similarity patterns in the lower layers are highly consistent across all models, with values along the main diagonal close to 1. This suggests that lower-level features remain largely unchanged during fine-tuning and are minimally affected by parameter updates. In contrast, the upper-layer similarities gradually decrease with increasing depth, as reflected by the fading of the main diagonal. This trend suggests that parameter adjustments intensify progressively in higher layers. Such a pattern aligns well with the layer-wise unfreezing strategy discussed in Section 3.2, further confirming that higher-level representations gradually adapt to the target task, whereas lower-level structures remain stable. This demonstrates the effective implementation of the proposed transfer strategy.
Figure 9. CKA similarity of low-level structures before and after transfer across models.
Figure 10. CKA similarity of high-level structures before and after transfer across models.
Figure 11 quantitatively illustrates the changes in feature similarity across different layers. The similarity scores in the lower layers remain close to 1, indicating that these features remain stable during fine-tuning. In contrast, similarity in the upper layers decreases progressively with increasing depth, with the largest drop observed in the top layers, thereby confirming the effectiveness of the layer-wise unfreezing strategy. The extent of high-layer variation differs among models: Conformer exhibits the smallest change, while Unified Conformer and U2++ Conformer show more noticeable decreases. Mixformer demonstrates the most substantial drop, suggesting a more thorough adaptation of high-level features, which may enhance its potential for Tongan speech recognition tasks.
Figure 11. Layer-wise similarity variation before and after transfer in different models.
Building on the preceding visualization analysis, this section further evaluates recognition performance after transfer learning. Table 10, Figure 12 and Figure 13 summarizes the experimental results for the combinations of three data augmentation strategies (without augmentation, random augmentation, and the proposed method) and two training approaches (direct training and layer-wise transfer learning).
Table 10. Comparison of data augmentation methods.
Figure 12. Development set comparison of data augmentation methods.
Figure 13. Test set comparison of data augmentation methods.
The results show that direct training without augmentation results in poor recognition accuracy. Random augmentation provides only modest and limited improvements. In contrast, the proposed method combined with layer-wise transfer markedly improves recognition accuracy while halving the required training epochs. A plausible explanation for this improvement lies in two main aspects. First, the SRA-DRF augmentation expands the training data with higher quantity and quality, avoiding invalid or overly dissimilar samples that may arise in random augmentation. Its selective mechanism guarantees that the generated audio remains acoustically consistent with the original recordings, thereby enhancing the model’s generalization ability and reducing learning bias. Second, the adopted transfer learning strategy enables the model to leverage the pretrained English representations, which share similar phonetic and acoustic characteristics with Tongan. This facilitates more efficient feature adaptation, leading to faster convergence and higher recognition accuracy. These findings highlight both the effectiveness and efficiency of the proposed method in addressing data scarcity in low-resource languages, providing a promising modeling approach for Tongan speech recognition.

5.3. Dictionary Parameter Optimization Experiments

This section performs automatic optimization of the core dictionary parameter, NBPE, using the MEA-AGA, in order to determine the optimal dictionary size. The process begins with the differentiation phase to generate the initial population. The results are presented below.
Table 11 lists ten sets of NBPE parameters generated during the initial Differentiation phase, along with their corresponding WER, training time, and WPS performance metrics. The fitness scores, calculated according to (31), are used to rank these candidates The top five individuals are designated as the elite subpopulation, providing the foundation for subsequent optimization of dictionary parameters, while the remaining five are classified as temporary individuals, forming the temporary subpopulation.
Table 11. Results of differentiation operation.
Building on this foundation, an Aggregation operation is applied to both the elite and temporary subpopulations to enable fine-grained search and improve overall optimization performance. Figure 14 illustrates the variation in fitness scores of the two subpopulations throughout the Aggregation process.
Figure 14. Iteration process of the dominant and temporary subpopulation.
The results indicate that the MEA-AGA achieves stable performance in dictionary parameter optimization, with fitness scores steadily improving during the early stages and gradually converging in later iterations. The elite subpopulation consistently outperforms the temporary subpopulation, and the optimal solution stabilizes at a fitness value of 0.865. These findings validate the effectiveness of the algorithm. Detailed iterative results for each subpopulation are provided in Table 12 and Table 13.
Table 12. Results of aggregation operation (dominant subpopulation).
Table 13. Results of aggregation operation (temporary subpopulation).
Finally, the NBPE parameter was optimized using the MEA-AGA strategy. The experimental results indicate that when NBPE is set to 401, the model achieves WERs of 26.18% on the Dev set and 28.64% on the Test set, with a decoding speed of 68 WPS. These outcomes demonstrate a favorable balance between recognition accuracy and decoding efficiency, thereby validating the practical value of the proposed optimization strategy for low-resource language recognition tasks.

5.4. Comparative Analysis of Model Performance

To systematically evaluate the effectiveness of the proposed optimization strategies, a series of comparative experiments were conducted. The evaluation metrics included Dev-WER, Test-WER, and WPS. The results are summarized in Table 14, where “√” indicates that the corresponding strategy was applied, and “——” denotes that it was not.
Table 14. Comparative analysis of model performance under different optimization strategy.
The experimental results are further visualized in Figure 15, which illustrates the performance variations under different optimization strategy combinations.
Figure 15. Comparative analysis of model performance under different optimization strategy.
Experiment 1 serves as the baseline model without any optimization strategies. In Experiment 2, the proposed data augmentation and partitioning strategy is applied, which ensures both the quantity and quality of the expanded data while also addressing class imbalance. As a result, both dev-WER and test-WER are significantly reduced. This demonstrates its effectiveness in improving recognition accuracy and generalization, particularly for low-resource languages like Tongan. Experiment 3 additionally incorporates the layer-wise adaptive transfer learning strategy. By leveraging the similarity between English (source language) and Tongan (target language) in phonetic structures, the model achieves faster convergence and a further reduction in WER. These results highlight the adaptability and practicality of the proposed method for low-resource tasks. In Experiment 4, the NBPE value is optimized to refine vocabulary granularity, thereby enhancing the model’s capacity to represent Tongan linguistic features and improving both recognition accuracy and decoding speed.
In addition, Table 15 compares the performance of several mainstream models on the Tongan dataset. The results indicate that the proposed method surpasses the others in both recognition accuracy and inference efficiency, thereby demonstrating superior overall performance.
Table 15. Comparative analysis of speech recognition performance of state-of-the-art models.
To further evaluate the overall performance of the proposed model, a horizontal comparison was conducted between the recognition results for Tongan and those reported for other low-resource languages. As shown in Table 16 and Figure 16, although the available Tongan corpus is relatively limited (11.44 h), the achieved recognition accuracy remains within a reasonable range, demonstrating performance that consistent with the dataset scale. This finding further underscores the practicality and adaptability of the proposed method in low-resource scenarios.
Table 16. Performance comparison of speech recognition on other low-resource languages.
Figure 16. Performance comparison of speech recognition on low-resource languages.

6. Conclusions

With the rapid development of deep learning and artificial intelligence, speech recognition technology has been widely adopted worldwide. However, most existing research has focused on resource-rich languages such as English and Chinese, while studies on low-resource languages like Tongan remain scarce. These languages typically lack systematic corpora and effective modeling approaches, leading to challenges such as data scarcity, limited model transferability, and inadequate dictionary modeling mechanisms.
To overcome these limitations, this study first constructs a pretrained model using resource-rich English corpora and then fine-tunes it on the target language, Tongan, through a layer-wise adaptive transfer learning strategy. This approach enables both efficient and accurate speech recognition. Moreover, it provides valuable theoretical and technical support for the preservation of Tongan linguistic and cultural resources, while also promoting international cultural exchange. The main contributions of this study are summarized as follows:
(1) To address the issue of limited Tongan language data, this study proposes an SRA-DRF algorithm. By combining GAN-based synthetic data generation with traditional signal processing techniques, high-quality audio samples are produced. The effectiveness of the augmented data is validated through similarity comparison, demonstrating that the dataset size increases from 1.43 h to 11.44 h, with a similarity score of 90.63% between the augmented and original data. This effectively alleviates the problem of data scarcity. Furthermore, a weighted stratified sampling strategy is employed to achieve class-balanced partitioning, ensuring that the training, validation, and test sets maintain a complete and balanced sample distribution, thereby enabling the model to fully learn the features of each category.
(2) In the transfer learning phase, this study introduces a layer-wise adaptive strategy that preserves the low-level general features of the pretrained model while dynamically adjusting the learning rates of the higher layers according to loss values and source–target language similarity. Fine-tuning is applied primarily to the higher layers. The effectiveness and rationality of this strategy are further demonstrated through CKA similarity matrix analysis, which reveals distinct patterns of hierarchical feature adaptation.
(3) For optimizing the critical NBPE parameter in Tongan dictionary construction, this study proposes the MEA-AGA optimization algorithm. The NBPE value is optimized from an initial setting of 300 to 401. With this configuration, the optimized model achieves a Dev-WER of 26.18% and a Test-WER of 28.64% on the Tongan dataset, along with a decoding speed (WPS) of 68. Compared with the baseline model (Dev-WER 53.77%, Test-WER 54.93%, WPS 49.22), these results represent approximately 51.3% and 47.9% relative reductions in WER, and a 38.2% increase in decoding speed. These results demonstrate substantial improvements in both recognition accuracy and inference efficiency. Compared with mainstream approaches, the proposed method exhibits clear advantages and achieves performance within a reasonable accuracy range for low-resource language speech recognition.
Beyond its technical contributions, the development of speech recognition for low-resource languages such as Tongan also carries important ethical and societal significance. From a cultural perspective, it supports linguistic diversity by promoting the digital preservation and accessibility of endangered languages, helping safeguard intangible cultural heritage and enabling inclusive global communication. From an ethical standpoint, enhancing AI inclusivity helps narrow the digital divide and prevent the marginalization of smaller linguistic groups. At the societal level, improving recognition for minority languages contributes to cultural sustainability and supports education and social inclusion in multilingual regions.

7. Future Work

Although the proposed transfer strategy and parameter optimization algorithm have achieved promising results in Tongan speech recognition, several limitations remain. Future research can therefore be pursued in the following directions:
(1) Dataset Expansion: The current Tongan speech corpus remains relatively small. Although the proposed data augmentation and balanced partitioning strategies have partially alleviated data scarcity and mitigated imbalance to some extent, the corpus size and distribution are still limited compared with mainstream languages. This constraint may affect the model’s generalization performance across phonetic categories. Future work should prioritize the collection of larger and more balanced Tongan speech datasets with higher quality to further enhance model robustness and generalizability.
(2) Optimization of Transfer Strategies: Although the proposed framework effectively adapts pretrained models to low-resource speech recognition, it still relies on supervised fine-tuning. This dependency may limit adaptability when labeled data are scarce or domain variations occur. Future research could explore more flexible transfer learning paradigms to overcome these limitations. For instance, unsupervised or semi-supervised adaptation strategies could reduce reliance on annotated data, while zero-shot or few-shot learning mechanisms may enhance model generalization under extremely low-resource conditions.
(3) Broader Source Language Selection: In this study, English was selected as the source language due to its phonetic similarity to Tongan. Future research could investigate multilingual joint transfer strategies to enhance cross-lingual generalization. Moreover, given the demonstrated effectiveness of the proposed method for Tongan, its applicability to other low-resource languages deserves systematic evaluation to validate its universality and scalability.

Author Contributions

Conceptualization, J.G. and D.J.; methodology, J.G. and Z.L.; software, Z.L. and W.Z.; validation, J.G. and D.J.; formal analysis, J.G.; investigation, Z.H. and W.Z.; resources, N.W. and R.C.; data curation, N.W. and W.Z.; writing—original draft preparation, J.G.; writing—review and editing, D.J.; visualization, J.G.; supervision, D.J.; project administration, D.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program (Project No. 2023YFF0612100).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available because the research is still ongoing, and early disclosure may affect subsequent work. However, the data can be made available from the corresponding author upon reasonable request.

Acknowledgments

The authors express their sincere gratitude to Jia for his invaluable guidance and support throughout this research. We also thank the students who participated in this project for their efforts and dedication, which were essential to the success of this study. Generative AI tools were used exclusively to improve the language and grammar of this manuscript. All scientific content, analyses, interpretations, and conclusions were entirely conceived, written, and verified by the authors.

Conflicts of Interest

Author Junhao Geng was employed by the company Beijing Research Institute of Automation for Machinery Industry Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Deng, L.; Hinton, G.; Kingsbury, B. New types of deep neural network learning for speech recognition and related applications: An overview. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 8599–8603. [Google Scholar] [CrossRef]
  2. Haddow, B.; Bawden, R.; Barone, A.V.M.; Helcl, J.; Birch, A. Survey of Low-Resource Machine Translation. Comput. Linguist. 2022, 48, 673–732. [Google Scholar] [CrossRef]
  3. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
  4. Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar] [CrossRef]
  5. Cho, J.; Baskar, M.K.; Li, R.; Wiesner, M.; Mallidi, S.H.; Yalta, N. Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 521–527. [Google Scholar] [CrossRef]
  6. Singh, S.; Hou, F.; Wang, R. A Novel Self-Training Approach for Low-Resource Speech Recognition. arXiv 2023, arXiv:2308.05269. [Google Scholar]
  7. Mukhamadiyev, A.; Khujayarov, I.; Djuraev, O.; Cho, J. Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language. Sensors 2022, 22, 3683. [Google Scholar] [CrossRef] [PubMed]
  8. van der Westhuizen, E.; Kamper, H.; Menon, R.; Quinn, J.; Niesler, T. Feature Learning for Efficient ASR-Free Keyword Spotting in Low-Resource Languages. Comput. Speech Lang. 2022, 71, 101275. [Google Scholar] [CrossRef]
  9. Mamyrbayev, O.; Oralbekova, D.; Kydyrbekova, A.; Turdalykyzy, T.; Bekarystankyzy, A. End-to-End Model Based on RNN-T for Kazakh Speech Recognition. In Proceedings of the 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), Nagoya, Japan, 25–27 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 163–167. [Google Scholar] [CrossRef]
  10. Zhi, T.; Shi, Y.; Du, W.; Li, G.; Wang, D. M2ASR-MONGO: A Free Mongolian Speech Database and Accompanied Baselines. In Proceedings of the 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Singapore, 18–20 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 140–145. [Google Scholar] [CrossRef]
  11. Oukas, N.; Zerrouki, T.; Haboussi, S.; Djettou, H. Arabic Speech Recognition Using Deep Learning and Common Voice Dataset. In Proceedings of the 2022 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), Sakheer, Bahrain, 20–21 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 642–647. [Google Scholar] [CrossRef]
  12. Changrampadi, M.H.; Shahina, A.; Narayanan, M.B.; Khan, A.N. End-to-End Speech Recognition of Tamil Language. Intell. Autom. Soft Comput. 2022, 32, 1049–1064. [Google Scholar] [CrossRef]
  13. Shetty, V.M.; Mysore Sathyendra, N.J. Improving the Performance of Transformer-Based Low-Resource Speech Recognition for Indian Languages. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 8279–8283. [Google Scholar] [CrossRef]
  14. Pan, L.; Li, S.; Wang, L.; Dang, J. Effective Training of End-to-End ASR Systems for Low-Resource Lhasa Dialect of Tibetan Language. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 1152–1156. [Google Scholar] [CrossRef]
  15. Anoop, C.S.; Ramakrishnan, A.G. CTC-Based End-to-End ASR for the Low-Resource Sanskrit Language with Spectrogram Augmentation. In Proceedings of the 2021 National Conference on Communications (NCC), Bangalore, India, 27–30 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
  16. Taylor, M.E.; Stone, P. Transfer Learning for Reinforcement Learning Domains: A Survey. J. Mach. Learn. Res. 2009, 10, 1633–1685. [Google Scholar]
  17. Raghu, M.; Zhang, C.; Kleinberg, J.; Bengio, S. Transfusion: Understanding Transfer Learning for Medical Imaging. In Advances in Neural Information Processing Systems 32; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2019. [Google Scholar]
  18. Hussain, M.; Bird, J.J.; Faria, D.R. A Study on CNN Transfer Learning for Image Classification. In Proceedings of the 18th UK Workshop on Computational Intelligence (UKCI 2018), Nottingham, UK, 5–7 September 2018; pp. 191–202. [Google Scholar] [CrossRef]
  19. Li, W.; Huang, R.; Li, J.; Liao, Y.; Chen, Z.; He, G.; Yan, R.; Gryllias, K. A Perspective Survey on Deep Transfer Learning for Fault Diagnosis in Industrial Scenarios: Theories, Applications and Challenges. Mech. Syst. Signal Process. 2022, 167, 108487. [Google Scholar] [CrossRef]
  20. Mainzinger, J.; Levow, G.-A. Fine-Tuning ASR Models for Very Low-Resource Languages: A Study on Mvskoke. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Bangkok, Thailand, 11–16 August 2024; pp. 76–82. [Google Scholar] [CrossRef]
  21. Abdullah, A.A.; Tabibian, S.; Veisi, H.; Mahmudi, A.; Rashid, T. End-to-End Transformer-Based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach. arXiv 2024, arXiv:2410.16330. [Google Scholar]
  22. Cui, J.; Yang, J. Front-End Fusion and Large-Scale Weakly Supervised Decoding Module Based Myanmar Speech Recognition. In Proceedings of the 2023 International Conference on Asian Language Processing (IALP), Singapore, 17–19 November 2023; pp. 348–352. [Google Scholar] [CrossRef]
  23. Zeng, Z.; Pham, V.T.; Xu, H.; Khassanov, Y.; Chng, E.S.; Ni, C. Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning. In Proceedings of the 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), Hong Kong, China, 24–27 January 2021; pp. 1–5. [Google Scholar] [CrossRef]
  24. Hjortnaes, N.; Arkhangelskiy, T.; Partanen, N.; Rießler, M.; Tyers, F. Improving the Language Model for Low-Resource ASR with Online Text Corpora. In Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), Marseille, France, 11–12 May 2020; pp. 1–6. [Google Scholar]
  25. Bansal, S.; Kamper, H.; Livescu, K.; Lopez, A.; Goldwater, S. Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation. arXiv 2018, arXiv:1809.01431. [Google Scholar]
  26. Lee, H.; Yoon, H.-W.; Kim, J.-H.; Kim, J.-M. Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model. arXiv 2023, arXiv:2306.02579. [Google Scholar] [CrossRef]
  27. Baller, T.; Bennett, K.; Hamilton, H.J. Transfer Learning and Language Model Adaption for Low Resource Speech Recognition. In Proceedings of the 34th Canadian Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–28 May 2021. [Google Scholar] [CrossRef]
  28. Anoop, C.S.; Ramakrishnan, A.G. Meta-Learning for Indian Languages: Performance Analysis and Improvements with Linguistic Similarity Measures. IEEE Access 2023, 11, 82050–82064. [Google Scholar] [CrossRef]
  29. Yadav, H.; Sitaram, S. A Survey of Multilingual Models for Automatic Speech Recognition. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022), Paris, France, 20–25 June 2022; pp. 5071–5079. [Google Scholar]
  30. Yu, Z.; Zhang, Y.; Qian, K.; Fu, Y.; Lin, Y. Master-ASR: Achieving Multilingual Scalability and Low-Resource Adaptation in ASR with Modular Learning. arXiv 2023, arXiv:2306.15686. [Google Scholar]
  31. Yan, B.; Wiesner, M.; Klejch, O.; Jyothi, P.; Watanabe, S. Towards Zero-Shot Code-Switched Speech Recognition. arXiv 2022, arXiv:2211.01458. [Google Scholar]
  32. Li, X.; Dalmia, S.; Mortensen, D.R.; Metze, F.; Black, A.W. Zero-Shot Learning for Speech Recognition with Universal Phonetic Model. In Proceedings of the International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  33. Hsiao, R.; Deng, L.; McDermott, E.; Travadi, R.; Zhuang, X. Optimizing Byte-Level Representation for End-to-End ASR. In Proceedings of the 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, China, 2–5 December 2024; pp. 462–467. [Google Scholar] [CrossRef]
  34. Deng, L.; Hsiao, R.; Ghoshal, A. Bilingual End-to-End ASR with Byte-Level Subwords. In Proceedings of the ICASSP 2022—IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6417–6421. [Google Scholar] [CrossRef]
  35. Ma, J.; Duo, L.; Wei, G.; Tang, J. End-to-End Speech Recognition Based on Thresholded BPE-Dropout Multi-Task Learning. J. Jilin Univ. (Sci. Ed.) 2024, 62, 674–682. [Google Scholar] [CrossRef]
  36. Cai, Y.; Wang, C.; Ren, D.; Zhu, Y.; Zhang, J.; Nima, Z. End-to-End Tibetan Speech Recognition Study Based on Byte Pair Coding. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (CCL 2024), Taiyuan, China, 25–28 July 2024; Volume 1, pp. 305–313. [Google Scholar]
  37. Gaurav, M.S.A.I.S. Investigating Transformer Architecture for Automatic Speech Recognition and Machine Translation in Indian Languages. Ph.D. Thesis, Indian Institute of Technology Madras, Chennai, India, 2021. [Google Scholar]
  38. Abu, T.; Shi, Y.; Zheng, T.F.; Wang, D. Sagalee: An Open Source Automatic Speech Recognition Dataset for Oromo Language. In Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025. [Google Scholar] [CrossRef]
  39. Patel, T.; Scharenborg, O. Using Cross-Model Learnings for the Gram Vaani ASR Challenge 2022. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH 2022), Incheon, Republic of Korea, 18–22 September 2022; pp. 4880–4884. [Google Scholar]
  40. Abayomi-Alli, O.O.; Damaševičius, R.; Qazi, A.; Adedoyin-Olowe, M.; Misra, S. Data Augmentation and Deep Learning Methods in Sound Classification: A Systematic Review. Electronics 2022, 11, 3795. [Google Scholar] [CrossRef]
  41. Nugroho, K.; Noersasongko, E.; Purwanto; Muljono; Setiadi, D.R.I.M. Enhanced Indonesian Ethnic Speaker Recognition Using Data Augmentation Deep Neural Network. J. King Saud Univ.—Comput. Inf. Sci. 2022, 34, 4375–4384. [Google Scholar] [CrossRef]
  42. Ramesh, V.; Vatanparvar, K.; Nemati, E.; Nathan, V.; Rahman, M.M.; Kuang, J. CoughGAN: Generating Synthetic Coughs That Improve Respiratory Disease Classification. In Proceedings of the 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montréal, QC, Canada, 20–24 July 2020; pp. 5682–5688. [Google Scholar] [CrossRef]
  43. Esmaeilpour, M.; Cardinal, P.; Koerich, A.L. Unsupervised Feature Learning for Environmental Sound Classification Using Weighted Cycle-Consistent Generative Adversarial Network. Appl. Soft Comput. 2020, 86, 105912. [Google Scholar] [CrossRef]
  44. Yella, N.; Rajan, B. Data Augmentation Using GAN for Sound-Based COVID-19 Diagnosis. In Proceedings of the 2021 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Cracow, Poland, 22–25 September 2021; Volume 2, pp. 606–609. [Google Scholar] [CrossRef]
  45. Raj, S.; Anant, A.; Suryadevara, H.; Kumar, R.K. Hybrid Deep Learning Architecture with K-means Clustering for Weapon Detection in CCTV Surveillance. In Proceedings of the 2024 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), Prayagraj, India, 19–20 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
  46. Goel, A.; Majumdar, A.; Chouzenoux, E.; Chierchia, G. Deep Convolutional K-means Clustering. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 211–215. [Google Scholar] [CrossRef]
  47. Guo, W.; Lin, K.; Ye, W. Deep Embedded K-means Clustering. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), Auckland, New Zealand, 7–10 December 2021; pp. 686–694. [Google Scholar] [CrossRef]
  48. Nam, Y.; Han, S. Random Forest Variable Importance-Based Selection Algorithm in Class Imbalance Problem. J. Classif. 2025, 1–14. [Google Scholar] [CrossRef]
  49. Kheddar, H.; Himeur, Y.; Al-Maadeed, S.; Amira, A.; Bensaali, F. Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization. Knowl.-Based Syst. 2023, 277, 110851. [Google Scholar] [CrossRef]
  50. Davari, M.R.; Horoi, S.; Natik, A.; Lajoie, G.; Wolf, G.; Belilovsky, E. Reliability of CKA as a Similarity Measure in Deep Learning. arXiv 2022, arXiv:2210.16156. [Google Scholar] [CrossRef]
  51. Stalker, P. A Guide to Countries of the World; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
  52. Moala, K. Search of the Friendly Islands; Pasifika Foundation Press: Suva, Fiji, 2009. [Google Scholar]
  53. Tod, D. Rhythm in the Kingdom: A Variationist Analysis of Speech Rhythm in Tongan English. Engl. Lang. Linguist. 2024, 28, 203–225. [Google Scholar] [CrossRef]
  54. Geng, J.; Jia, D.; He, Z.; Wu, N.; Li, Z. Enhanced Conformer-Based Speech Recognition via Model Fusion and Adaptive Decoding with Dynamic Rescoring. Appl. Sci. 2024, 14, 11583. [Google Scholar] [CrossRef]
  55. Zevallos, R.; Bel, N.; Cámbara, G.; Farrús, M.; Luque, J. Data Augmentation for Low-Resource Quechua ASR Improvement. arXiv 2022, arXiv:2207.06872. [Google Scholar] [CrossRef]
  56. Coto-Solano, R.; Nicholas, S.A.; Datta, S.; Quint, V.; Wills, P.; Powell, E.N.; Koka‘ua, L.; Tanveer, S.; Feldman, I. Development of Automatic Speech Recognition for the Documentation of Cook Islands Māori. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022), Marseille, France, 20–25 June 2022; pp. 3872–3882. [Google Scholar]
  57. Mak, F.; Govender, A.; Badenhorst, J. Exploring ASR Fine-Tuning on Limited Domain-Specific Data for Low-Resource Languages. J. Digit. Humanit. Assoc. S. Afr. 2024, 5. [Google Scholar] [CrossRef]
  58. Ren, Z.; Yolwas, N.; Slamu, W.; Cao, R.; Wang, H. Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition. Sensors 2022, 22, 7319. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.