3.1. Data Augmentation and Partitioning
In low-resource speech recognition tasks, data scarcity and class imbalance remain major factors that hinder model performance [
40]. To improve training effectiveness and recognition accuracy, this section focuses on preprocessing and augmenting the original Tongan speech corpus, with the goal of constructing a balanced, diverse, and high-quality training dataset.
Data augmentation techniques are generally divided into signal processing-based and deep learning-based approaches. The former, such as speed perturbation, pitch shifting, and noise addition [
41], are easy to implement but often introduce pronunciation distortions and fail to preserve language-specific features in low-resource settings. The latter, represented by Generative Adversarial Networks [
42,
43,
44], generate higher-quality data but demand large datasets and significant computational resources, limiting their use in low-resource languages. To overcome these limitations, this study proposes an SRA-DRF algorithm that integrates signal processing techniques with deep learning methods. This hybrid strategy not only increases data diversity but also reduces reliance on large datasets and computational resources, thereby enabling effective augmentation of Tongan speech data. The overall workflow is illustrated in
Figure 2.
Let
denote the original Tongan audio feature matrix. Random segments are removed from
, and the erased time spans are stored in a separate matrix referred to as the audio segment pool. The residual features, after the erasure process, are preserved in a new matrix denoted as
.
is a masking matrix composed of 0 and 1, having the same dimensionality as . A value of 0 indicates that the corresponding data point is masked, while 1 means it is retained. denotes the amplitude of the frame of the utterance after the original data has been masked.
In the audio segment pool, signal processing and deep learning methods are jointly employed. On the one hand, the extracted audio segments are augmented through speed perturbation, pitch shifting, and noise addition. On the other hand, a GAN-based network is applied to generate additional realistic audio segments. The corresponding formulation is defined as follows:
where
is the speed perturbation factor applied to modify speaking rate,
is the pitch shift factor for adjusting vocal pitch.
denotes the noise matrix that follows a Gaussian distribution, and
is the noise strength coefficient.
is the generator in the adversarial network;
represents the noise input vector, which is randomly sampled and fed into the generator;
denotes the generator’s parameters which control the properties of the generated audio.
is the processed audio segment using signal-based operations on
, while
represents the audio segment generated by the GAN. These two types of augmented data are then combined into a unified audio segment pool
:
Next, several candidate segments from the audio segment pool
, which stores the erased portions of the original audio, are aligned with the residual speech features
. An attention-based encoder–decoder (AED) module is then applied to integrate these segments, producing a supplementary data matrix
:
Finally, the original Tongan corpus is utilized to train a DRF model. The augmented dataset is subsequently input to the model for evaluation, with the metric defined in Equation (8) adopted as the evaluation criterion.
Audio segments with higher scores are classified as Yes and retained as valid augmentations, while the rest are discarded.
After augmenting the speech data, the dataset must be partitioned in an appropriate manner to enhance the model’s generalization capability. A commonly adopted approach is K-Means clustering, which facilitates balanced class distribution [
45,
46,
47]. Its mathematical formulation is defined as follows:
Here, denotes the cluster, represents the sample in the cluster, and is the cluster centroid.
This method performs well when clustering data with balanced class distributions. However, for tasks involving low-resource languages such as Tongan, the small dataset size and class imbalance may result in biased cluster centers. Direct application of traditional clustering often leads to convergence issues, misalignment between training and test sets, and reduced model stability and generalization capability [
48].
To overcome these limitations, we modify the objective function by introducing a class-aware weighting term, which balances the influence of minority-class samples during clustering. The revised formulation is defined as:
The weight
is used to adjust the influence of each sample in the cluster based on its class distribution, and is defined as:
where
denotes the frequency of the class label
, to which sample
belongs in the dataset. The detailed definition is:
Here,
is the total number of samples in the dataset, and
is the number of samples belonging to class
, which is calculated as:
is the indicator function, which equals 1 if the condition is satisfied, and 0 otherwise.
denotes the full dataset. Finally, by substituting the above expressions, the improved K-Means clustering objective function is given as:
To achieve more balanced data partitioning, this study integrates stratified sampling with the improved weighted K-Means clustering method. Specifically, samples are first stratified according to their distances from the cluster centroids, and proportional sampling is then applied within each cluster to ensure balanced class representation across the training, validation, and test sets. The overall procedure of this data partitioning method is illustrated in
Figure 3.
The basic steps of the algorithm are as follows:
(1) Based on the features extracted from the input Tongan speech data, perform statistical analysis to initialize k cluster centers (where i = 1, 2, …, k and k is the number of clusters).
(2) Calculate the weighted distance from each sample to each cluster center and assign the sample to the nearest cluster. For a given sample and cluster center , the distance is computed according to Equation (14).
(3) Update each cluster center based on the current cluster members and their weights. The new cluster center
is computed as:
(4) Check whether the new cluster centers meet the convergence condition. If convergence is achieved, the algorithm terminates; otherwise, return to step (2) for further updates.
(5) After clustering, compute the proportion of each class within each cluster:
Here, , and represent, respectively, the proportion, the number of samples, and the total number of samples of class within cluster .
(6) Finally, stratified sampling is performed within each cluster based on the class distribution and the specified data split ratios. The total number of samples selected from each cluster is then used to construct the training, validation, and test sets. The formulas are defined as follows:
Here, , and represent the total number of training, validation, and test samples, respectively. is the number of clusters, is the number of classes, and , and represent the number of training, validation, and test samples from class within cluster .
3.2. Layer-Wise Unfreezing for Adaptive Transfer Learning
Transfer learning, which adapts knowledge from a source domain to a target task, has proven particularly effective in low-resource or highly specialized scenarios and has therefore attracted considerable academic attention [
49]. However, for large-scale models such as Transformer, Wav2Vec, and GPT, significant challenges remain, including the risk of negative transfer and high computational costs. To address these issues, this study proposes an adaptive transfer learning approach based on layer-wise unfreezing. The overall network architecture and algorithmic workflow are illustrated in the figure below.
The model architecture adopted in this study incorporates a fine-grained feedback mechanism with a clear distinction between frozen and tunable layers, thereby improving the precision and efficiency of transfer learning. The unfreezing process follows a top-down strategy, where the number of trainable layers and the corresponding learning rates are initialized according to cross-lingual similarity. This design preserves pre-trained knowledge while gradually adapting the model to the target task. The overall architecture of the proposed adaptive transfer learning network is illustrated in
Figure 4. As training progresses, a loss-driven adjustment strategy is applied to dynamically refine the unfreezing process, enabling more adaptive and task-specific optimization. The detailed training procedure is summarized in Algorithm 1.
The specific steps of the proposed algorithm are as follows:
(1) Model Initialization: This includes setting the maximum number of iterations , the pretrained model , the transfer model , the number of model layers , and the initial learning rate .
(2) Similarity and Loss Evaluation: The similarity between source and target languages is calculated using Cosine similarity, denoted as . Simultaneously, the overall training loss is computed to guide the subsequent unfreezing strategy.
(3) Determining the Number of Unfrozen Layers: The initial number of unfrozen layers
is determined based on the computed similarity score. The definition and iterative update of the unfreezing depth are formulated as follows:
where
and
represent the change in loss values at iterations
and
, respectively, and the step size
is set to 1.
(4) Learning Rate Adjustment: The learning rate
is dynamically adjusted based on similarity and loss:
where
and
are time-dependent weighting coefficients that control the influence of similarity and loss, respectively. At the early stage of training,
is set relatively high and decreases gradually, while
starts low and increases gradually as training progresses.
where
is a positive constant (set to 0.1) to control the rate of decay and growth. The initial values are
and
.
(5) Unfreezing Execution: Based on the determined unfreezing depth, the model is incrementally unfrozen from the top layer downward. The learning rate is applied accordingly to fine-tune each layer.
(6) Convergence Check: The process checks whether convergence criteria are met. If not, the algorithm returns to step (2) for further updates.
(7) Final Output: Once convergence is achieved, transfer training is completed, and the final target language recognition model is obtained.
To further assess the feature changes during the transfer process, this study employs the CKA similarity matrix to quantitatively analyze key layers of the network before and after fine-tuning. CKA measures the alignment of representations in kernel space across different layers or stages and is widely used in neural network visualization and interpretability research [
50]. By comparing similarity scores across layers, the analysis reveals how hierarchical features are adjusted during transfer, providing theoretical support for the effectiveness of the layer-wise unfreezing strategy.
| Algorithm 1: Adaptive layer-wise unfreezing and learning rate scheduling for transfer learning |
Require: Mpre, n, η0, α0, β0, L Ensure: Mtrans- 1:
Mtrans ⇐ Mpre; η ⇐ η0; L ⇐ 0; - 2:
for t = 1 to n do - 3:
/* Stage 1: Similarity and Loss Calculation */ - 4:
Compute simscore and training loss Ltotal (t); - 5:
/* Stage 2: Layer Unfreezing Decision */ - 6:
if t == 1 then - 7:
L ⇐ floor[N · (1 − simscore)]; - 8:
else - 9:
L ⇐ L + sign(Ltotal (t) − Ltotal (t-1)) × ΔL; - 10:
end if - 11:
/* Stage 3: Learning Rate Adjustment */ - 12:
if unfreeze then - 13:
η ⇐ η0 · exp(−α(t) · simscore − β(t) ·Ltotal (t)); - 14:
end if - 15:
α(t) ⇐ α0 · exp(−γ · t); β(t) ⇐ β0 · (1 − exp(−γ · t)); - 16:
/* Stage 4: Layer-Wise Training */ - 17:
Freeze layers < L; unfreeze layers ≥ L; - 18:
Train Mtrans with η; - 19:
/* Stage 5: Termination Check */
- 20:
if stopping criterion met then - 21:
break; - 22:
end if - 23:
end for
|
3.3. Dictionary Parameter Optimization Driven by MEA-AGA
In automatic speech recognition, dictionary construction is a critical component, typically achieved through sub-word unit segmentation to standardize output. Among these, NBPE serves as a core parameter that directly impacts vocabulary coverage, sequence length, and model complexity. A small NBPE can increase coverage and reduce annotation cost, but it leads to longer sequences and greater training difficulty. Conversely, a large NBPE may simplify training but could weaken the model’s linguistic expressiveness [
36]. In low-resource scenarios such as Tongan, optimizing NBPE is crucial to improving recognition performance. To this end, this study proposes the MEA-AGA method to determine the optimal dictionary configuration for Tongan.
The genetic algorithm (GA) is an optimization method inspired by natural evolution, which searches for optimal solutions through selection, crossover, and mutation operations. However, traditional GA uses fixed parameter settings, making it prone to local optima and lacking adaptability. The adaptive genetic algorithm (AGA) addresses this by introducing dynamic parameter adjustment mechanisms, allowing the crossover and mutation rates to evolve throughout the search process, thereby enhancing global exploration and algorithm robustness. The functions are defined as follows:
In the equations, , , and are constants between 0 and 1. and denote the maximum and average fitness of the current population, respectively; represents the fitness of the better individual among the two selected for crossover; corresponds to the fitness of the individual undergoing mutation.
However, Equations (24) and (25) consider only individual fitness and neglect the overall evolution of the population, making it difficult to capture population-wide trends and potentially leading to local optima. To address this limitation, this study introduces improvements to both the crossover and mutation probability functions:
represents the maximum crossover probability, and is the maximum mutation probability. As shown in the equations, regardless of an individual’s fitness value, both the crossover and mutation probabilities are prevented from dropping to zero. This not only effectively preserves high-performing individuals but also avoids premature convergence to local optima.
However, the exponential terms in the equations tend to magnify differences in fitness values. When an individual’s fitness deviates substantially from the population mean, the resulting rapid growth or decay of the exponential component may induce excessive fluctuations in crossover and mutation probabilities. Such instability can lead to over-exploration and undermine the robustness of the algorithm. To mitigate this issue, fitness variance is introduced to further refine and stabilize the equations:
is a statistical measure used to quantify the dispersion of fitness values within the population. A larger variance indicates greater differences in individual fitness, while a smaller one implies a more uniform population. The calculation formula is as follows:
By incorporating fitness variance, the crossover and mutation probabilities are adjusted more smoothly, thereby suppressing drastic fluctuations induced by extreme fitness values. This adjustment enhances the stability of the algorithm and strengthens its capability to explore the solution space effectively.
Although this strategy increases parameter flexibility, it is still susceptible to local optima in complex search spaces, particularly when fitness differences among individuals are minimal. This limitation reduces adaptability and hinders global exploration. Moreover, AGA encounters challenges in balancing evolutionary speed and search quality: overly aggressive adjustment rates may lead to divergence and inefficiency, whereas overly conservative rates restrict the search range and degrade performance. To address these limitations, this study integrates the MEA framework to further enhance AGA. MEA simulates human-like rapid evolution through learning and innovation, thereby improving both adaptability and global search capability. In each iteration, a randomly initialized population undergoes fitness-based selection to generate a central individual, which partitions the population into elite and temporary subpopulations. The elite group preserves high-quality solutions, while the temporary group performs exploratory search, together enabling cooperative optimization. The procedure is illustrated in
Figure 5.
As illustrated in the figure, the algorithm achieves a balance between global exploration and local optimization through a two-phase “diversification–convergence” strategy. During the diversification phase, population diversity is increased to promote broad exploration of the search space and identify multiple promising candidate solutions. In the subsequent convergence phase, these candidate solutions are refined to enhance solution quality and improve convergence efficiency. The detailed procedures of the diversification and convergence phases are depicted in
Figure 6 and
Figure 7, respectively.
In the global search phase, a diversification operation is first conducted. The population is divided into elite and temporary subpopulations according to fitness values, facilitating competitive interactions that help identify multiple promising global optima. If a temporary subpopulation outperforms the current elite subpopulation in terms of fitness, it replaces the elite group and is incorporated into the set of optimal solutions. Meanwhile, the temporary subpopulation with the lowest fitness is discarded and reinitialized across the entire solution space. Since this phase emphasizes search breadth and allows for larger parameter perturbations, the mutation mechanism from (29) is adopted to expand the search range.
In the local search phase, a convergence operation is carried out. This operation involves a two-stage refinement of the elite subpopulation obtained during the diversification phase. Guided by the fitness function, it seeks to further enhance solution quality and approximate the global optimum. To prevent excessive parameter perturbation, the crossover mechanism is applied, with the crossover rate calculated using (28), thereby improving the precision of local search. When the optimal individual remains stable across successive iterations and no longer changes, the subpopulation is considered matured, marking the completion of the convergence process.
In summary, the MEA-AGA is employed to optimize the NBPE parameter in Tongan speech recognition, with the objective of balancing recognition accuracy, decoding speed, and training cost. Prior to the optimization process, a fitness function is defined to comprehensively evaluate multiple factors in the recognition process, including model accuracy, decoding efficiency, and training time. The calculation formula is as follows:
, , , and represent the word error rate on the validation set, the word error rate on the test set, the number of words recognized per second, and the model training time, respectively. The corresponding weight coefficients , , , and are set to 0.3, 0.3, 0.3, and 0.1, reflecting the equal importance of recognition accuracy and decoding speed, while giving relatively less emphasis to training time. In addition, to eliminate the influence of differing units of measurement, WPS and training time are normalized to ensure that all evaluation indicators fall within a comparable range.
The overall procedure of the proposed optimization algorithm is summarized in Algorithm 2, and its detailed steps are described as follows:
(1) Parameter Initialization: Set the maximum number of iterations, population size , crossover probability , and mutation probability .
(2) Population Initialization: Randomly generate 10 initial individuals .
(3) Fitness Evaluation: Calculate the fitness scores of all individuals in the population using (30).
(4) Diversification Operation: Sort individuals based on their fitness scores. The top five individuals form the elite subpopulation , and the bottom five form the temporary subpopulation . Determine whether the fitness score of the newly generated temporary subpopulation is higher than that of the elite subpopulation. If so, apply mutation (29) to the lowest-scoring temporary individual to generate a new temporary subpopulation , and return to step 3; otherwise, the diversification phase is complete, and the elite subpopulation is finalized.
(5) Fitness Calculation of the New Elite Subpopulation: Recalculate the fitness scores of all individuals in the new elite group using (30).
(6) Convergence Operation: Sort individuals in the elite subpopulation by fitness; the highest-scoring individual is identified as the winner. Determine whether the population has matured. If the winner changes, perform crossover (28) on the remaining individuals to generate new individuals, and return to step 5; otherwise, the convergence phase is complete, and the global optimal individual is obtained.
(7) Termination Check: Determine whether the stopping condition is met. If the maximum number of generations has been reached, the process ends and the current optimal NBPE value is output; otherwise, return to step 3 to continue iteration.
(8) Dictionary Construction: Based on the optimal NBPE value, segment Tongan words and construct the Tongan dictionary, which serves as a standard for both training and inference, ultimately enabling accurate and efficient Tongan speech recognition.
| Algorithm 2: MEA-AGA for Dictionary Parameter Optimization |
Require: N, nmax, Pc, Pm, pop1~pop10 Ensure: Optimal NBPE- 1:
/* Stage 1: Initialization */ - 2:
Initialize parameters; generate population pop1~pop10; - 3:
for iteration = 1 to nmax do - 4:
/* Stage 2: Fitness Evaluation */ - 5:
Compute fitness for all individuals; - 6:
/* Stage 3: Differentiation */
- 7:
Rank population; select top 5 as popelite, others as poptemporary;
- 8:
if poptemporary better than popelite then - 9:
Replace popelite; mutate poor individuals (see Equations (3)–(32));
- 10:
else - 11:
Regenerate popelite;
- 12:
continue; - 13:
end if - 14:
/* Stage 4: Aggregation */
- 15:
Optimal NBPE ⇐ NBPE of best individual
- 16:
if not converged then - 17:
crossover individuals (see Equations (3)–(31));
- 18:
continue; - 19:
end if - 20:
/* Stage 5: Termination */
- 21:
if termination met then - 22:
break; - 23:
end if - 24:
end for
|