Next Article in Journal
Dynamic Modeling and Structural Equation Analysis of Team Innovativeness Under the Influence of Social Capital and Conflict Mediation
Previous Article in Journal
Research on Synchronous Transfer Control Technology for Distribution Network Load Based on Imprecise Probability
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining

Department of English Linguistics and Language Technology, Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(20), 3300; https://doi.org/10.3390/math13203300
Submission received: 21 September 2025 / Revised: 10 October 2025 / Accepted: 13 October 2025 / Published: 16 October 2025

Abstract

Large language models demand substantial computational and data resources, motivating approaches that improve the training efficiency of small language models. While curriculum learning methods based on linguistic difficulty measures have been explored as a potential solution, prior approaches that rely on complex linguistic indices are often computationally expensive, difficult to interpret, or fail to yield consistent improvements. Moreover, existing methods rarely incorporate the cognitive and linguistic efficiency observed in human language acquisition. To address these gaps, we propose a readability-driven curriculum learning method based on the Flesch Reading Ease (FRE) score, which provides a simple, interpretable, and cognitively motivated measure of text difficulty. Across two dataset configurations and multiple curriculum granularities, our method yields consistent improvements over baseline models without curriculum learning, achieving substantial gains on BLiMP and MNLI. Reading behavior evaluations also reveal human-like sensitivity to textual difficulty. These findings demonstrate that a lightweight, interpretable curriculum design can enhance small language models under strict data constraints, offering a practical path toward more efficient training.

1. Introduction

Large language models achieve strong performance across a variety of linguistic tasks, but only at the cost of extensive training resources and massive data consumption. This raises a central question in model design: how can we optimize training efficiency under limited data conditions? One promising direction for improving efficiency is curriculum learning, which organizes training data in a sequence that reflects increasing task difficulty. This paradigm has been shown to benefit both machine learning and cognitive modeling [1,2], highlighting data ordering as an effective strategy for guiding model learning in a stable and progressive manner. A central challenge in this approach lies in defining a principled measure of curriculum difficulty, since difficulty effectively acts as a regulator of distributional entropy during training. Therefore, the choice of difficulty metric can critically determine how efficiently a model learns under limited data conditions.
To address this challenge, we introduce a readability-based function that organizes training inputs by difficulty using the Flesch Reading Ease (FRE) score [3]. FRE quantifies textual comprehensibility through a regression formula grounded in human reading behavior, providing a principled metric that reflects developmental learning stages. Drawing inspiration from child language acquisition, we note that children achieve remarkable linguistic competence from limited input [4,5,6], suggesting that curricula modeled on human learning trajectories may yield more data-efficient and cognitively grounded training methodologies [7].
This observation is supported by developmental evidence showing that children can infer grammatical structure and derive meaning from remarkably sparse input, acquiring core elements of syntax and morphology within the first few years of life, despite limited vocabulary and exposure [4,5,6]. Children achieve this level of linguistic ability from relatively low amounts of input—especially when compared to modern language models, which often require well over 10,000 words of training data for every single word a 13-year-old child has encountered [7]. This gap in learning efficiency highlights a critical opportunity: by modeling training trajectories after human development, we may move toward training methodologies that are not only more data-efficient, but also cognitively grounded and empirically motivated.
Thus, we propose a curriculum learning framework based on established readability metrics, which we call the Reading-level Guided Curriculum. Among readability indices, we employ the Flesch Reading Ease (FRE) score, a widely validated psycholinguistic measure closely associated with human reading development. Owing to its simplicity and interpretability, FRE provides a cognitively grounded and reproducible basis for structuring our Reading-level Guided Curriculum.
Furthermore, unlike large-scale models that primarily aim to maximize performance through massive data consumption, our study deliberately constrains the training data to a human-comparable scale. The goal is not to rival the raw performance of large models, but rather to test whether cognitively inspired training trajectories—analogous to human language acquisition under limited exposure—can yield efficient learning and competitive improvements even in data-restricted conditions.
Our results also show that a curriculum guided by the psycholinguistic FRE score improves efficiency in low-resource settings while aligning with insights from human language development. This highlights the potential of our method as a structured and cognitively inspired training strategy for small language models. Our contributions can be summarized as follows:
  • We propose a curriculum that is substantially simpler and more interpretable than existing approaches.
  • By controlling for the confounding factor of large-scale data, we isolate pretraining methodology as the primary driver of improvement.
  • Unlike prior curriculum learning approaches that often yielded inconclusive results, our method produces clear and positive learning signals.

2. Related Works

2.1. Importance of Curriculum for Language Models

2.1.1. Curriculum Learning

Curriculum learning is a training strategy in which examples are presented in an organized order—from simpler to more complex—rather than randomly, allowing models to learn progressively as humans do [1].
The concept can also be understood as a form of continuation method. Instead of tackling a hard non-convex optimization problem directly, training begins with a simplified version of the problem and gradually shifts to the original one. This staged transition helps the model find better minima, achieving lower training loss and stronger generalization.
Formally, following Bengio et al. [1], let s denote a sample drawn from the target distribution D ( s ) . At training stage α ( 0 α 1 ), the reweighted distribution is defined as:
Q α ( s ) W α ( s ) D ( s ) ,
where
  • s is a training sample;
  • D ( s ) is the original target distribution over samples;
  • W α ( s ) is a weighting function that determines the importance of sample s at stage α ;
  • Q α ( s ) is the reweighted distribution at stage α .
The weighting function W α ( s ) increases monotonically with α . At early stages, simpler data points are emphasized, while as α grows, increasingly complex examples are incorporated. At the final stage ( α = 1 ),
Q 1 ( s ) = D ( s )
meaning that the model is trained on the full target distribution without any reweighting.
A curriculum is then characterized by the following two conditions. First, the entropy of the training distribution must increase monotonically over time:
H ( Q α ) < H ( Q α + ε ) , ε > 0 .
where H ( Q α ) denotes the entropy of the reweighted data distribution Q α , which reflects the overall diversity of training samples at stage α , and ε represents a small positive increment indicating the transition to a slightly later stage. This condition ensures that while early stages emphasize a narrow band of difficulty, later stages introduce broader diversity. Second, the weights of individual examples must also grow monotonically with α :
W α + ε ( s ) W α ( s ) , s , ε > 0 .
where W α ( s ) denotes the weighting function that determines the importance of each training sample s at stage α , and ε represents a small positive increment indicating a transition to a subsequent stage. Once an example is included, its importance does not diminish; instead, more difficult examples are layered on top, enabling the learner to build upon earlier knowledge.
In this way, curriculum learning provides a stable and intuitive framework that guides models to handle increasingly complex inputs, making it particularly well-suited for domains such as language modeling, where gradual acquisition plays a central role.

2.1.2. Linguistic Indicators in Curriculum Learning

Several studies have explored ways to leverage linguistic indicators for curriculum learning in order to improve model performance [8,9,10]. These works are often motivated by the intuition that, much like in human learning, models benefit from progressing from simpler inputs to more complex ones. What distinguishes this line of research from other curriculum methods is that the idea of difficulty is defined through linguistic indicators, which makes the curriculum not only an effective training strategy but also a linguistically meaningful way. While these studies share the same overarching motivation, they differ substantially in how linguistic difficulty is measured.
Oba et al. [9] operationalized difficulty through syntactic complexity, using measures such as dependency tree depth and the number of syntactic constituents to approximate stages of linguistic acquisition. While this syntactic approach offered a linguistically grounded perspective, the observed performance gains were relatively limited. Interestingly, later investigations [10] showed that even surface-level indicators such as sentence length and word rarity could yield improvements when used to design curricula. Recently, a multi-view curriculum framework was introduced in 2023 [8], drawing on more than 200 linguistic complexity indices. Their results showed meaningful gains across downstream tasks, underscoring the potential of linguistically rich curricula. But at the same time, the reliance on such a large set of features made the framework less transparent and straightforward to apply.
These prior studies collectively demonstrate the promise of leveraging linguistic indicators for curriculum design, yet they also leave open questions about which indicators are most effective and how they should be operationalized. Syntactic-complexity-based curricula [9], while grounded in solid linguistic theory, translated into only modest performance gains in practice. At the other end of the spectrum, multi-view frameworks [8] that integrate hundreds of linguistic indices achieved stronger results but did so at the cost of interpretability and practical applicability, as their reliance on extensive feature sets made them cumbersome to adopt consistently. This tension suggests a broader gap: existing approaches either fall short in impact, oversimplify the notion of difficulty, or become too complex to implement reliably. Beyond such extremes, there is room to explore indicators that not only reflect surface-level statistics but also align more closely with cognitive perspectives on language learning.
Building on this view, we propose a Reading-level Guided Curriculum learning framework grounded in readability metrics that have been widely applied in educational and applied linguistic contexts. By relying on well-established readability measures, the framework keeps the training process computationally lightweight while remaining aligned with human intuitions about text difficulty. We investigate whether insights from human language development can promote more efficient learning dynamics in data-limited language models, regardless of model size or training scale.

2.2. Flesch Reading Ease Score

As discussed in the previous section, this study builds on deep insights from human developmental patterns and examines the Flesch Reading Ease (FRE) score as a readability-based measure that can serve as a cognitively grounded foundation. Among the many possible criteria, we focus on FRE for two main reasons: (1) it is a intuitive and quantitative metric that integrates sentence length and lexical complexity into a single interpretable value, and (2) it has been widely applied in educational contexts for English learners, aligning well with the philosophical goal of this work to treat language models as learning entities.

2.2.1. Interpretability Through Integration

The FRE score was originally designed to evaluate the accessibility of written English, integrating sentence length and lexical complexity into a single interpretable value. The metric was first developed using 363 passages from the McCall–Crabbs Standard Test Lessons in Reading, where children’s reading comprehension performance served as the foundation. The criterion, denoted as C 75 , was defined as the average grade level at which children could correctly answer 75% of comprehension questions, providing a numerical benchmark for the grade level required to understand a given text.
To model the relationship between textual features and grade-level comprehension, a multiple regression analysis was conducted with two predictors: syllables per 100 words ( w l ) and average sentence length in words ( s i ). The resulting regression equation was:
C 75 = 0.0846 w l + 0.1015 s i 5.6835 ( multiple correlation R = 0.7047 ) .
This indicates that texts with longer sentences and more syllabic words require higher grade levels to comprehend. To complement the regression analysis, Table 1 summarizes the descriptive statistics and pairwise correlations among the predictors ( w l , s i ) and the criterion C 75 .
As shown in the table, r ( w l , s i ) = 0.4644 indicates a moderate positive association, meaning that passages with longer sentences also tend to contain more syllables per 100 words. Likewise, r ( C 75 , w l ) = 0.6307 and r ( C 75 , s i ) = 0.6212 show that texts with more syllabic words and longer sentences tend to require higher grade levels.
Because this raw regression predicted grade levels rather than ease of reading, an additional transformation was introduced to yield an interpretable “ease” score. The predicted grade level was first multiplied by 10 to create a 0–100 scale (so that one point corresponds to one-tenth of a grade), and the signs of the predictors were reversed so that higher values would correspond to easier texts. The final FRE formula in terms of the original variables was:
FRE = 206.835 0.846 w l 1.015 s i .
As many implementations use average syllables per word ( A S W = w l / 100 ) rather than syllables per 100 words, the formula can equivalently be expressed as:
FRE = 206.835 1.015 ASL 84.6 ASW ,
where A S L denotes average sentence length (words per sentence) and A S W denotes average syllables per word.
In summary, the FRE score is derived through three steps: (1) defining a comprehension-based criterion from children’s performance, (2) linking this criterion to textual length features via regression, and (3) transforming the predicted grade level into a 0–100 ease scale. This process shows how sentence length and lexical complexity can be integrated into a single interpretable measure, thereby demonstrating interpretability through integration.

2.2.2. Practicality for Controlling Text Difficulty

As discussed above, the FRE score—originally designed to assess the accessibility of written English—has been widely adopted in educational contexts, particularly for evaluating materials targeted at children and second-language learners. In educational practice, it is employed to evaluate school texts for consistency with children’s developmental levels [11], with its role extending to research on child language development where FRE serves as an experimental tool for calibrating linguistic stimuli [12]. Studies on children’s reading comprehension and processing time often adopt FRE as a standardized scale for comparing textual difficulty [13].
While alternative readability or complexity indices such as syntactic depth, lexical sophistication, or multi-view linguistic feature sets mentioned above in Section 2.1.2 have also been explored in prior curriculum learning studies, these approaches often require extensive feature extraction, yield less interpretable signals, or introduce computational overhead. Notably, a recent study [14] also included FRE among multiple difficulty metrics and found it to be one of the most effective indicators for improving training efficiency. FRE score provides a single, transparent value that has been validated in both educational practice and psycholinguistic research. Its simplicity and interpretability make it particularly suitable for our goal of developing a lightweight yet cognitively motivated curriculum framework for small-model training.

3. Proposed Method

In this work, we extend the general framework of curriculum learning introduced by [1] by incorporating the FRE score as a measure of linguistic difficulty. We introduce a curriculum parameter, α ( 0 α 1 ), which controls the progression from easy to difficult inputs during training. At stage α , the distribution is defined as:
R α ( s ) U α ( s ) D ( s ) ,
where s denotes a training unit (e.g., a sentence or a paragraph depending on the experimental setting), D ( s ) is the target distribution over such units, and U α ( s ) is a weighting function explicitly determined by the FRE score of s. Intuitively, U α ( s ) prioritizes easier units at earlier stages and gradually introduces more complex ones, thereby making the progression of linguistic complexity.
At lower values of α , the weighting function favors sentences with higher FRE scores (i.e., easier sentences). As α increases, progressively lower FRE scores (i.e., more complex sentences) are incorporated. The progression of α thus corresponds to a transparent trajectory of increasing linguistic complexity:
  • α 0 : the distribution emphasizes easier sentences.
  • 0 < α < 1 : progressively more complex sentences are added.
  • α = 1 : the full target distribution D ( s ) is restored without any reweighting.
The mapping between FRE scores and the weighting function U α ( s ) can be illustrated through a simple example. Consider three sentences of different difficulty levels: Sentence A (FRE = 90, easy), Sentence B (FRE = 70, medium), and Sentence C (FRE = 40, hard). At the early stage of training ( α = 0 ), the weighting function prioritizes high-FRE sentences, and thus only Sentence A is emphasized. As α increases (e.g., α = 0.5 ), medium-level sentences such as Sentence B are also incorporated. Finally, at α = 1 , the full target distribution D ( s ) is recovered, and all sentences, including the more complex Sentence C, are included. This stepwise inclusion illustrates how the weighting function U α ( s ) is explicitly determined by the FRE score: higher values are favored earlier in the curriculum, and progressively lower values are introduced as training advances.
Unlike prior cognitively motivated curricula that struggled with effectiveness or relied on complex heuristics, our approach offers a lightweight and interpretable mechanism for controlling training difficulty. In this way, our formulation integrates FRE scores directly into the curriculum, providing a principled and interpretable mechanism for controlling linguistic difficulty during training.
By construction, R α preserves the two central conditions of curriculum learning: (1) monotonic increase of entropy—as α grows, the curriculum expands to cover a broader range of FRE scores; (2) monotonic growth of weights—once easy examples are introduced, their relative importance does not diminish, and harder examples are layered on top. Furthermore, we apply this simple equation to various curriculum designs and configurations in order to investigate its effectiveness in supporting efficient training and enhancing small model performance. Accordingly, the formulation offers a simple yet effective and interpretable approach to defining and regulating linguistic difficulty during training. The details of the experimental settings are presented in the following section.

4. Experimental Settings

To investigate how data diversity affects curriculum learning, we designed experiments under two dataset settings: single and merged. In the single setting, curricula were constructed at three units (sentence, group, and paragraph), while in the merged setting, only two units (sentence and grouped) were feasible. This results in five conditions in total, allowing us to analyze how reading-level guided difficulty ordering impacts learning. To operationalize text difficulty within these settings, we compute an FRE score for each setting. We then sort all text segments computed, dividing them into three equal-sized bins (tertiles). As a result, the top third is categorized as easy, the middle third as medium, and the bottom third as hard. The specific difficulty boundaries used for single and merged dataset settings are summarized in Table 2, which also reports the mean FRE score within each level.

4.1. Comparison of Settings

In our experiments, we consider a total of five configurations. First, based on the type of training dataset, we distinguish between two settings: the single setting, where curriculum learning is applied to a single corpus, and the merged setting, where different corpora are assigned to different curriculum levels to more closely simulate the developmental trajectory of child language acquisition. Within the single setting, the FRE score is computed at three units: sentence, group, and paragraph. In contrast, the merged setting includes only sentence- and group-based curricula, since the merged corpora do not consistently form coherent paragraphs. This yields three variants for the single setting and two for the merged setting, resulting in five configurations in total. This setup allows us to examine the impact of reading-level-guided difficulty ordering and to identify effective training strategies. The specific sources and characteristics of the datasets used in the single and merged settings are described in Section 4.2, while this section focuses on the three unit levels at which FRE is computed.
  • Group unit: Sentences are split and scored individually by FRE, then coarsely partitioned into three bins—easy, medium, and hard. Within each bin, samples are randomly shuffled without further ordering, resulting in a coarse-grained curriculum. This corresponds to the procedure of Algorithm 1 (group), where difficulty increases stage by stage, but fine-grained ordering is not enforced. Figure 1 shows the visualization of the overall training data organization, where the color shading indicates FRE difficulty (lighter = easier, darker = harder) and sentences within each stage are arranged without internal ordering.
  • Sentence unit: Sentences are again grouped into three stages, but unlike the group unit setting, each sentence is strictly ordered from easiest to hardest according to its FRE score. This corresponds to the procedure of Algorithm 2 (sentence), which enforces a fully ordered, fine-grained curriculum where difficulty increases both across and within stages. In Figure 1, this is represented by progressively darker sentence shading within each block, showing that the model processes inputs in a strictly increasing FRE sequence.
  • Paragraph unit: Instead of scoring at the sentence level, entire paragraphs are treated as indivisible units. FRE is computed across the full paragraph, and units are then partitioned into easy, medium, and hard stages based on tertile boundaries of the paragraph-level distribution. This corresponds to Algorithm 3 (paragraph), ensuring contextual and semantic relations across sentences are preserved within each input. In Figure 1, paragraphs are highlighted as larger yellow-shaded blocks, emphasizing that multiple sentences are grouped and treated as one training unit, rather than being split. The difficulty still follows easy → medium → hard progression, but determined at the paragraph level.
As mentioned above, across all three settings, samples are ordered by their FRE values and split into three subsets—easy, medium, and hard—by dividing the ordered list into thirds. Training proceeds stage by stage for E stage = 10 epochs, in the order of easy → medium → hard.
Algorithm 1 Group
Require:  D = { ( x i , y i ) } , epochs per stage E stage
Ensure: Trained parameters θ *
  1:
  U
  2:
 for each ( x i , y i ) D  do
  3:
        { s i , k } SentenceSplit (xi)
  4:
       for each s i , k  do
  5:
               FRE i , k ComputeFRE (si,k)
  6:
               U U { ( s i , k , y i , FRE i , k ) }
  7:
 OrderBy ( U , FRE , desc )
  8:
 Split U into thirds: ( D easy , D med , D hard )
  9:
 Shuffle each subset
10:
 Initialize parameters θ
11:
 for  s { easy , med , hard } do
12:
        for  e 1 to E stage  do
13:
              Update θ on D s
14:
 return  θ * θ
Algorithm 2 Sentence
Require:  D = { ( x i , y i ) } , epochs per stage E stage
Ensure: Trained parameters θ *
  1:
  U
  2:
for each ( x i , y i ) D  do
  3:
       { s i , k } SentenceSplit (xi)
  4:
      for each s i , k  do
  5:
              FRE i , k ComputeFRE (si,k)
  6:
              U U { ( s i , k , y i , FRE i , k ) }
  7:
 OrderBy  ( U , FRE , desc )
  8:
 Split U into thirds:  ( D easy , D med , D hard )
  9:
 Initialize parameters  θ
10:
 for  s { easy , med , hard } do
11:
      for  e 1 to E stage  do
12:
             Update θ on D s
13:
 return  θ * θ
Algorithm 3 Paragraph
Require:  D = { ( x i , y i ) } , epochs per stage E stage
Ensure: Trained parameters θ *
  1:
  U
  2:
 for each  ( x i , y i ) D  do
  3:
         FRE i ComputeFRE (xi)
  4:
         U U { ( x i , y i , FRE i ) }
  5:
 OrderBy  ( U , FRE , desc )
  6:
 Split U into thirds: ( D easy , D med , D hard )
  7:
 Initialize parameters  θ
  8:
 for  s { easy , med , hard } do
  9:
        for  e 1 to E stage  do
10:
              Update  θ on D s
11:
 return  θ * θ

4.2. Datasets

4.2.1. Pretraining Datasets

Our dataset size was deliberately capped at 100 million words, which corresponds to roughly the number of words a 13-year-old child has been exposed to over their lifetime [15]. By constraining models to human-comparable input sizes, we aim to approximate plausible cognitive models of human learning. Training models on quantities of data closer to what humans actually encounter provides a valuable lens for understanding what enables humans to acquire language so efficiently, and can help illuminate the mechanisms underlying language learning.
Having defined the curriculum construction and training procedure within the scope of 100 M words, the subsequent experiments are divided according to two dataset settings. In the single setting, we used only the Cosmopedia [16] corpus, a synthetic text dataset that contains diverse formats such as textbooks, blogs, and stories generated with Mixtral-8x7B-Instruct-v0.1. Cosmopedia provides a sufficiently broad and coherent resource, which expects the model to capture general linguistic regularities.
In the merged setting, we constructed a mixture of corpora that naturally span a broad range of linguistic complexity. Specifically, we included data from CHILDES [17], Storybook [18], manually curated datasets from Gutenberg Children’s Literature (https://www.gutenberg.org/ebooks/bookshelf/20 (accessed on 6 May 2025)), and a subset of Cosmopedia, thereby covering a wide spectrum of linguistic variation.
We deliberately combined the above datasets that span different levels of readability, so that the merged corpus reflects a graded progression of linguistic complexity. Specifically, the sentences from CHILDES [17] and Storybook [18] primarily fall above an FRE score of 80, corresponding to the “easy” or “very easy” readability levels typically associated with fifth to sixth grade readers, and were included to provide the model with exposure to simple and coherent linguistic patterns resembling children’s early language input. The Gutenberg Children’s Literature corpus mostly occupies the FRE range of 40 to 80, covering “fairly easy” to “fairly difficult” levels, and was chosen to gradually introduce richer vocabulary and moderately complex syntax that reflect the progression to more advanced readers. Finally, a subset of Cosmopedia contributes a greater proportion of sentences below an FRE score of 40, representing challenging and low-readability content, thereby ensuring that the model is also exposed to advanced discourse structures and dense informational content (refer to Table 3 for the detailed interpretation of FRE scores). To verify that our readability-based categorization reflects consistent human perception, we conducted a human evaluation with three annotators, yielding a moderate–substantial inter-rater agreement (Fleiss’ κ ≈ 0.588).

4.2.2. Evaluation Datasets

We evaluate our models using datasets from a subset of GLUE [19] and SuperGLUE [20], including seven tasks: BoolQ, MNLI, MRPC, MultiRC, QQP, RTE, and WSC, together with three additional zero-shot tasks: BLiMP, EWOK, and Reading.
GLUE provides a collection of nine natural language understanding tasks covering textual entailment, similarity, and classification, and has become a widely adopted benchmark for evaluating pretrained models on a diverse range of linguistic phenomena [19]. SuperGLUE consists of more challenging tasks that require deeper reasoning and diverse task formats [20]. For our evaluation, CoLA, SST2, MNLI-mm, and QNLI were not included, as these tasks are highly correlated with other datasets such as BLiMP or MNLI. Overall, adopting GLUE as one of the evaluation tasks provides a standardized and comprehensive testbed for evaluating whether our curriculum-learning approach leads to general improvements across diverse NLU tasks.
In addition to these established benchmarks, we further evaluate on three targeted zero-shot tasks, which enable us to measure the models’ ability to generalize beyond training without task-specific tuning.
BLiMP [21] is a suite of minimal pairs targeting grammatical phenomena. It evaluates whether models prefer the grammatically correct sentence over an ungrammatical counterpart. Since BLiMP directly measures fine-grained grammatical generalization, it is especially relevant for testing whether a readability-based curriculum strengthens models’ sensitivity to core linguistic rules. EWOK [22] assesses models’ world knowledge by testing their ability to distinguish plausible from implausible contexts. Evaluating factual and commonsense reasoning beyond pure syntax or semantics, EWOK tests whether gains from our training method extend to broader knowledge grounding.
Reading [23] contains 205 English sentences (1726 words), for which cloze probabilities, predictability ratings, and computational surprisal estimates are aligned with behavioral and neural measures. Crucially, the reading component includes two complementary tasks: self-paced reading (SPR), which records reaction times as participants reveal words one by one, and eye-tracking, which captures fine-grained gaze measures such as fixations and regressions. Whereas SPR reflects more controlled, consciously paced reading behavior, eye-tracking captures more natural and immediate processing dynamics.
The evaluation of the reading task followed the framework of the BabyLM 2025 Challenge [24], where model predictions are assessed using regression analyses that measure the increase in explained variance ( Δ R 2 ) when surprisal is added as a predictor for human reading measures. Specifically, eye-tracking variables are analyzed without spillover effects, whereas self-paced reading includes a one-word spillover term to account for delayed processing influences from the previous word. The spillover effect captures the phenomenon that cognitive load from a word can extend to the subsequent word, influencing its reading time. Higher Δ R 2 values indicate that model-derived surprisal explains more variance in reaction times or gaze durations, providing a cognitively grounded test of how closely model processing aligns with human processing dynamics.
For all reported results, we averaged performance over three random seeds to ensure fairness. Details of the hyperparameter settings used for fine-tuning on GLUE tasks are provided in Appendix A.

4.3. Backbone Models

We conducted experiments using two widely adopted architectures: GPT-2 [25] and BERT [26]. These models served as the backbones for all curriculum and baseline training runs. GPT-2 was used for autoregressive language modeling, while BERT was used for masked language modeling in a bidirectional setting. Details of model hyperparameters are provided in Table A1.
Choosing these two complementary architectures allowed us to examine the effectiveness of curriculum learning across both the encoder-based and decoder-based paradigms. We deliberately employed moderate-sized, widely used models rather than larger state-of-the-art models. As smaller models provide a controlled environment where the effects of curriculum learning can be isolated without confounding from extreme model capacity, under the 100 M word constraint. In addition, given our deliberate 100 M word limitation, employing much larger recent models would be suboptimal, since their parameter scale requires substantially more data to train effectively. Using moderate-sized, widely validated models thus provides a more controlled testbed for evaluating the specific contribution of curriculum learning.

5. Results

5.1. The Result of the Single Setting

In this section, we present the analysis of the single setting, where curriculum learning is applied to a single dataset, one of the two configurations defined by dataset composition. We first examine (1) zero-shot tasks to observe the intrinsic performance without relying on fine-tuning data, followed by (2) GLUE tasks that require fine-tuning, and finally (3) a detailed analysis to gain deeper insights into the effectiveness of our curriculum. All results reported in this section are averaged over three independent runs with different random seeds, ensuring that the improvements are not due to chance from a single run. Cross-validation was not applied, as it is computationally prohibitive in pretraining settings; the multi-seed evaluation serves as a practical and widely accepted alternative.

5.1.1. Evaluation on Zero-Shot Tasks

We evaluated six zero-shot tasks, as shown in Table 4, after training a small language model using a curriculum structured from easier to more difficult documents and sentences based on their FRE scores. Among the various training strategies, the paragraph setting—where the curriculum was constructed at the paragraph level—consistently yielded superior performance across most tasks regardless of model architecture. In particular, substantial improvements were observed in tasks evaluating the grammatical competence of language models. For BLiMP, the GPT-2 architecture achieved a 12.31% improvement over the baseline, while the BERT architecture achieved a 19.83% improvement. Similarly, in the more grammatically demanding Supplement BLiMP task, performance increased by 11.18% and 12.80%, respectively.
Furthermore, in the Ewok task, which evaluates factual knowledge acquisition, the FRE-based curriculum also resulted in performance improvements, suggesting that utilizing readability indicators—commonly applied to evaluate the difficulty of children’s books—can have practical benefits for knowledge-intensive tasks in small language models.

5.1.2. Evaluation on GLUE Tasks

Table 5 demonstrates that within the single setting, our curriculum learning approach can positively influence natural language understanding capabilities. In particular, the paragraph unit setting—where the curriculum was constructed based on FRE scores computed at the paragraph unit—achieved an average improvement of +6.07 over the baseline, delivering superior performance across most subtasks. While there was a slight drop in tasks such as WSC, the overall improvements were far greater. This suggests that in NLU tasks, not only text difficulty but also contextual understanding across consecutive sentences plays a crucial role.
A striking example is the Multi-Genre Natural Language Inference (MNLI) task, which showed an impressive improvement of 16.87. Since MNLI requires the model to classify the relationship between a pair of sentences as entailment, contradiction, or neutral, semantic and contextual reasoning is essential. For example, given the premise “How do you know? All this is their information again.” and the hypothesis “This information belongs to them.”, the correct label is entailment. This is because the premise already presupposes that the information belongs to “them,” and the hypothesis simply restates this fact in a more concise way. In other words, if the premise is true, the hypothesis must also be true, thereby establishing an entailment relation.
Since tasks like MNLI rely heavily on semantic and contextual inference between sentence pairs, preserving broader context and capturing complex structures are critical. This explains why paragraph-level training, which better retains contextual integrity, likely outperformed sentence-level training in this setting.

5.2. Detailed Analysis: The Effect of Reading-Level Guided Curriculum

5.2.1. Evaluation on Zero-Shot Tasks

In Table 6, we analyze whether each curriculum level exerts an appropriate positive effect when training with the paragraph unit, which showed the best performance in the single setting. The table reports the performance on three zero-shot tasks (BLiMP, BLiMP-S, EWoK) at each curriculum level (easy, medium, hard), relative to the baseline.
As the curriculum progresses, we observe consistent performance gains across both model architectures, indicating that each stage contributes positively in a balanced manner. At the easy level, especially, tasks involving grammaticality judgment, such as BLiMP and BLiMP-S show notable improvements: GPT-2 achieves gains of 8.86% and 11.96%, while BERT improves by 8.18% and 11.18%, respectively. These results suggest that grammaticality-related tasks particularly benefit from exposure to linguistically simpler text in the early stages of the curriculum. In contrast, the EWoK dataset, which evaluates factual knowledge acquisition, shows only a marginal improvement of 0.31, implying that such tasks may be less sensitive to gains from easier text and instead require more complex input to yield substantial benefits.

5.2.2. Evaluation on Reading Task

Table 7 shows the performance of the reading task in the single setting, where curriculum learning was conducted using FRE scores computed at the paragraph level. The task evaluates how closely the model mirrors human-like perceptions of textual difficulty, comparing results against the baseline after each curriculum stage.
Beyond grammaticality judgment tasks, improvements were also observed in the reading tasks, which measure how closely language models align with human processing patterns. In the baseline without a curriculum, the scores remained close to zero. However, when training was structured according to decreasing FRE scores (that is, progressing from easier to more difficult texts), performance consistently improved across all settings. Notably, the paragraph setting yielded the strongest gains, achieving improvements of 4.12 on the eye-tracking task and 2.37 on the self-paced reading task compared to the baseline. These results indicate that language models trained with an FRE-based curriculum tend to struggle at similar points as humans, such as on passages where readers naturally slow down and fixate longer due to perceived difficulty. In this sense, the approach encourages models to exhibit processing patterns more closely aligned with human cognitive responses.

5.2.3. Evaluation on GLUE Task

Table 8 presents the performance on the GLUE subtasks after fine-tuning with their respective training datasets, following curriculum learning based on FRE scores computed at the paragraph level in the single setting. Similar to the earlier zero-shot tasks, all curriculum levels show improvements over the baseline, and furthermore, performance steadily increases as the curriculum progresses, reaching 63.94, 66.01, and 67.46 at the easy, medium, and hard levels, respectively.
Among the notable observations, the first is that the WSC task diverges from the overall trend of gradual improvement. Because this task requires reasoning about the referents of pronouns (coreference resolution), the easy-level data—consisting of simpler structures and clearer sentences—appears to support the learning of “surface-level” coreference patterns. However, as training progresses and later stages become dominated by more complex data with long sentences and nested clauses, the model may dilute the coreference signal or absorb additional noise, leading to reduced performance. Second, in the case of the MNLI task, we observe exponential growth beginning at the medium stage. Since this task relies heavily on contextual inference, it can be interpreted that a certain level of curriculum progression is necessary before robust performance emerges.

5.3. The Result of Merged Setting

5.3.1. Evaluation on Zero-Shot Tasks

Table 9 shows that our curriculum learning approaches on a merged setting demonstrate strong performance in three zero-shot evaluation tasks: BLiMP, Supplementary BLiMP, and Ewok. Across most evaluations, the grouped setting consistently outperformed others, providing empirical evidence that our approach offers a cognitively aligned learning strategy that enables small models to perform better in zero-shot tasks without any example-based supervision.
An interesting observation is the contrast between the single setting and the merged setting. While the paragraph unit proved highly effective in the single setting, its performance was less stable in the merged setup. Unlike the single dataset, which is composed of coherent documents from a single source, the merged dataset is constructed by merging documents from diverse domains and contexts. In such cases, although sentences are sorted by FRE scores, the disruption of contextual continuity appears to negatively impact learning.

5.3.2. Evaluation on GLUE Task

Table 10 demonstrates that the reading-level guided curriculum learning approach, when applied to models trained on our merged setting, leads to improved natural language understanding in BERT. Averaged GLUE scores reveal a clear trend: as the curriculum becomes more fine-grained—specifically when FRE scores are applied at the sentence level—performance improves. This suggests that our curriculum strategy contributes meaningfully to enhancing language understanding in small-scale models.
Among the tasks, MNLI and QQP exhibited relatively weaker performance. As shown in Figure 2, these datasets have notably low single-token coverage, with less than half of the unique words preserved as individual tokens. This high degree of subword fragmentation likely hindered the model’s ability to capture word-level semantics. In contrast, BoolQ did not suffer a performance drop. Our analysis suggests that, despite the impact of token fragmentation in MNLI and QQP, the model’s early exposure to simpler sentences—analogous to human child language acquisition—helped form a stronger linguistic foundation, leading to improved reasoning and inference in BoolQ.

6. Conclusions

This study demonstrated that a reading-level guided curriculum, based on the Flesch Reading Ease (FRE) score, can significantly enhance the learning efficiency of small language models under data-constrained conditions. There are three key contributions through a curriculum learning approach based on the FRE score. First, by leveraging a single readability metric—the FRE score—we propose a curriculum that is substantially simpler and more interpretable than existing approaches. Second, by restricting pretraining data to 100 M tokens, we control for the factor of large-scale data and isolate pretraining methodology as the primary driver of improvement. Third, unlike prior curriculum learning studies that often yielded inconclusive results, this work provides clear and positive learning signals, demonstrating the effectiveness of readability-based curricula. Especially, our approach improved both grammatical generalization and factual reasoning in zero-shot and downstream settings, with gains of up to +19.83% on tasks like BLiMP and consistent improvements in GLUE benchmarks.
In addition to these contributions, applying the curriculum across different levels of granularity—ranging from sentences to grouped segments and paragraphs—and under both single-source and heterogeneous datasets revealed coherent patterns of improvement. These outcomes emphasize that the value of this work lies not only in the method itself but also in its systematic application under diverse training conditions. Thus, the findings highlight that a carefully designed, readability-based curriculum can translate methodological simplicity into tangible efficiency gains and stronger generalization, offering a practical and cognitively motivated pathway for advancing small language models.
At a more fine-grained level, the experimental results further suggest concrete guidance for practice. According to the experimental results, the effectiveness of our curriculum varied depending on the dataset composition. In the single setting—where training data came from a coherent source—paragraph unit curriculum consistently yielded robust gains.
In contrast, for merged datasets drawn from heterogeneous sources, the grouped-level curriculum was more effective. Maintaining internal coherence within each phase led to more stable improvements. Therefore, when applying our method, we found the following:
  • For a single setting, adopting the paragraph unit allows for fine-grained progression and leads to strong generalization.
  • For a merged setting, it is more desirable to construct each curriculum level with the group unit from contextually similar sources to preserve coherence within each level.
In addition, our tokenizer coverage analysis in Figure 2 revealed that tasks with low single-token preservation (e.g., MNLI and QQP) exhibited diminished performance, suggesting that excessive subword segmentation may hinder semantic understanding in small models. This finding emphasized the importance of aligning tokenization granularity with curriculum design.
While our method demonstrated strong zero-shot and fine-tuning performance, limitations remained. The current framework relies on surface-level readability metrics and does not yet incorporate deeper semantic or discourse-level complexity. In addition, it remains to be verified whether the proposed approach can scale effectively to larger datasets and more powerful model architectures. Beyond scalability, since the Flesch Reading Ease score is an English-specific readability measure, its cross-linguistic applicability remains to be verified. Investigating cognitively grounded readability metrics in other languages will be an important step toward extending the framework’s generality beyond English. Future work will also explore its applicability to a broader range of downstream tasks, including question answering and summarization, as well as its extension to multilingual, instruction-tuned, or few-shot settings. Nevertheless, the simplicity, interpretability, and architecture-agnostic nature of our approach position it as a promising framework not only for advancing cognitively plausible and data-efficient NLP systems but also as a practical option in resource-limited environments where small-scale models must be deployed.

Author Contributions

Conceptualization, S.K. and J.K.; methodology, S.K. and J.P.; software, S.K. and J.P.; validation, S.K.; formal analysis, S.K.; investigation, S.K. and J.P.; data curation, J.P.; writing—original draft preparation, S.K. and J.P.; writing—review and editing, J.K.; visualization, S.K. and J.P.; supervision, J.K.; project administration, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Hankuk University of Foreign Studies Research Fund (of 2025).

Data Availability Statement

Publicly available datasets were analyzed in this study. These datasets can be found at the following sources (accessed on 6 May 2025): https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus, https://www.gutenberg.org/ebooks/bookshelf/20, and https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
FREFlesch Reading Ease
CLCurriculum Learning
GLUEGeneral Language Understanding Evaluation
BLiMPThe Benchmark of Linguistic Minimal Pairs for English
EWOKElements of World Knowledge
MRPCMicrosoft Research Paraphrase Corpus
MNLIMulti-Genre Natural Language Inference Corpus
MultiRCMulti-sentence Reading Comprehension
QQPQuora Question Pairs
RTERecognizing Textual Entailment
WSCWinograd Schema Challenge

Appendix A. Hyperparameters

Table A1 presents the hyperparameters used for our training models.
Table A1. Hyperparameters used for GPT-2 and BERT models during training.
Table A1. Hyperparameters used for GPT-2 and BERT models during training.
ModelGPT-2BERT-Base
parameters124 M110 M
vocab size50,25830,522
hidden size768768
heads1212
layers1212
dropout0.10.1
layer norm eps1 × 10−51 × 10−12
initializer range0.020.02
Optimizer
algorithmAdamWAdamW
learning rate5 × 10−55 × 10−5
betas(0.9, 0.999)(0.9, 0.999)
weight decay0.00.0
Scheduler
typelinearlinear
Training
gradient accumulation44
epochs1010
batch size3232
line by linetruetrue
NGPU22

Appendix B. Evaluation Dataset Sizes

Table A2. Evaluation dataset sizes. Each value indicates the number of samples used for evaluation in each benchmark.
Table A2. Evaluation dataset sizes. Each value indicates the number of samples used for evaluation in each benchmark.
DatasetNumber of Samples
BLiMP59,875
BLiMP-S5218
EWoK7618
Reading1725
WUG250
GLUE92,959

Appendix C. List of the Titles of Novels Manually Curated from Gutenberg Children’s Literature

Table A3 and Table A4 show the lists of the novels used in our medium-FRE content, which were manually hand-picked from Gutenberg Children’s Literature
Table A3. List of novels collected from Gutenberg Children’s Literature (Part 1).
Table A3. List of novels collected from Gutenberg Children’s Literature (Part 1).
Creatures That Once Were MenThe Odyssey
gutenberg filtered 1927 2178A Christmas Carol in Prose; Being a Ghost Story of Christmas
Uncle SamStory Hour Readings: Seventh Year
Boycotted, and Other StoriesThe King in Yellow
Anne of Green GablesThe Head of Kay’s
The Wonderful Wizard of Oz & The Rover Boys on the Farm; or, Last Days at Putnam Hall
Moby Multiple Language Lists of Common Words & My Life — Volume 1
An Old Chester SecretLiterature for Children
The Adventures of Sherlock HolmesA Princess of Mars
Ernest Bracebridge: School Days & Goose-Quill Papers
Anna KareninaBeyond Good and Evil
Leviathan & The New Girl at St. Chad’s: A Story of School Life
Gulliver’s Travels into Several Remote Nations of the WorldOn Liberty
What Katy Did at SchoolPride and Prejudice
The Romance of Lust: A classic Victorian erotic novelThe Philippines a Century Hence
The Camp Fire Girls at School; Or, The Wohelo WeaversThe Land of Little Rain
Le Morte d’Arthur: Volume 2The Adventures of Tom Sawyer, Complete
The PrincessThe Further Adventures of Robinson Crusoe
Bert Wilson on the GridironJo’s Boys
Little Men: Life at Plumfield With Jo’s BoysTime Enough at Last
A Day at Camp Killkare; Or, Aunt Jane and the Campfire GirlsGrimms’ Fairy Tales
Christmas Speakin’ at Skaggs’s SkuleJust Patty
Our Home and Personal DutyStory Lessons on Character-Building (Morals) and Manners
The Adventures of a Three-Guinea WatchThe Moon and Sixpence
The PriceThe Youngest Girl in the Fifth: A School Story
A brother to dragons, and other old-time talesThe Decameron of Giovanni Boccaccio
The World I Live InAround the World in Eighty Days
The Girl from MontanaAudrey
Lessons on Manners for School and Home UseGames Without Music for Children
Beowulf: An Anglo-Saxon Epic PoemThe Willoughby Captains
Adventures of Huckleberry FinnA Tale of Two Cities
Etiquette Made EasyThe Cock-House at Fellsgarth
Etheldreda the Ready: A School StoryThe Rover Boys at Colby Hall; or, The Struggles of the Young Cadets
Heart of DarknessMeditations
The Importance of Being Earnest: A Trivial Comedy for Serious PeopleThe Triple Alliance, Its Trials and Triumphs
A Prefect’s UncleA Popular Schoolgirl
The Flower-Patch Among the HillsFred Fenton on the Track; Or, The Athletes of Riverport School
Le Morte d’Arthur: Volume 1The Count of Monte Cristo
UlyssesSecond Treatise of Government
Tom and Some Other Girls: A Public School StoryMike and Psmith
Jack of Both Sides: The Story of a School WarThe Gold Bag
The Rebel of the SchoolBetty Gordon at Boarding School; Or, The Treasure of Indian Chasm
Practical EtiquetteThe Luckiest Girl in the School
Don QuixoteActon’s Feud: A Public School Story
For the Sake of the SchoolThe Cricket on the Hearth
Daddy-Long-LegsAmaryllis at the Fair
The PothuntersThe Mary Frances first aid book
Table A4. List of novels collected from Gutenberg Children’s Literature (Part 2).
Table A4. List of novels collected from Gutenberg Children’s Literature (Part 2).
Great ExpectationsThe Boys of Bellwood School; Or, Frank Jordan’s Triumph
Eric, or Little by LittleEmma
Ontario Teachers’ Manuals: LiteratureMarjorie Dean, High School Freshman
Dreams and DustInterrupted
Walden, and On The Duty Of Civil Disobediencegutenberg filtered 1343 1497
The Confessions of St. AugustineSara Crewe; Or, What Happened at Miss Minchin’s Boarding School
The Politeness of Princes, and Other School StoriesThe Rover Boys Under Canvas; Or, The Mystery of the Wrecked Submarine
When Patty Went to CollegeA Study in Scarlet
The divine comedyThe Yellow Wallpaper
War and PeaceLouis’ School Days: A Story for Boys
Eric; Or, Little by LittleFrankenstein; Or, The Modern Prometheus
Life of St. Francis of AssisiThe Expedition of Humphry Clinker
Wuthering HeightsThe Republic
Paradise LostThe Story of Chautauqua
Fred Fenton on the Crew; Or, The Young Oarsmen of Riverport SchoolRound the World in Eighty Days
Scouting for Girls Adapted from Girl GuidingThe Reign of Greed
Jane Eyre: An AutobiographyThe Brothers Karamazov
Glyn Severn’s SchooldaysCamp and Trail
Oliver TwistThe Scarlet Letter
Grace Harlowe’s Senior Year at High SchoolEducational Work of the Boy Scouts
Wilton School; or, Harry Campbell’s RevengeGulliver’s Travels into Several Remote Regions of the World
Aunt Crete’s EmancipationMonitress Merle
Aunt Jane’s Nieces AbroadThe Mystery of Mary
The House Behind the CedarsThe Iliad
MikeThe Princess of the School
Caleb Wright: A Story of the WestTom Brown’s School Days
gutenberg filtered 1618 1878Parkhurst Boys, and Other Stories of School Life
Little WomenPlays
Thus Spake Zarathustra: A Book for All and NoneLittle Men: Life at Plumfield with Jo’s Boys
The Hound of the BaskervillesThe Grammar School Boys of Gridley; or, Dick & Co. Start Things Moving
Ontario Teachers’ Manuals: Household ManagementFollow My Leader: The Boys of Templeton
Teaching the Child PatriotismThe Rover Boys at School; Or, The Cadets of Putnam Hall
Aunt Jane’s Nieces in the Red CrossTwenty years after
The White FeatherEducation in the Home, the Kindergarten, and the Primary School
Pioneer Life in IllinoisScouting For Girls, Official Handbook of the Girl Scouts
My Life — Volume 2The Witness
The PrinceThe Rover Boys in Camp; or, The Rivals of Pine Island
The Adventures of PinocchioThe Mystery at Putnam Hall: The School Chums’ Strange Discovery
The Gold BatThe Chautauqua Girls At Home
Treasure IslandCamps and Trails
DublinersThe Princess Aline
Charlotte TempleMy Friend Smith: A Story of School and City Life
Our town and civic dutyThe Wheat Princess
The Souls of Black FolkSpoon River Anthology
The Minister’s WooingTales of St. Austin’s
The Secret GardenCareers of Danger and Daring
Les MisérablesThe Emerald City of Oz

References

  1. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In ICML’09: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, ON, Canada, 14–18 June 2009; ACM: New York, NY, USA, 2009; pp. 41–48. [Google Scholar]
  2. Soviany, P.; Ionescu, R.T.; Rota, P.; Sebe, N. Curriculum Learning: A Survey. arXiv 2022, arXiv:2101.10382. [Google Scholar] [CrossRef]
  3. Flesch, R. A new readability yardstick. J. Appl. Psychol. 1948, 32, 221–233. [Google Scholar] [CrossRef] [PubMed]
  4. Fenson, L.; Dale, P.S.; Reznick, J.S.; Bates, E.; Thal, D.J.; Pethick, S.J.; Tomasello, M.; Mervis, C.B.; Stiles, J. Variability in early communicative development. Monogr. Soc. Res. Child Dev. 1994, 59, 1–185. [Google Scholar] [CrossRef] [PubMed]
  5. Brown, R. Development of the first language in the human species. Am. Psychol. 1973, 28, 97. [Google Scholar] [CrossRef]
  6. Crain, W. Theories of Development: Concepts and Applications; Routledge: New York, NY, USA, 2015. [Google Scholar]
  7. BabyLM Team. BabyLM Turns 3: Call for Papers for the 2025 BabyLM Workshop. 2025. Available online: https://babylm.github.io/ (accessed on 31 August 2025).
  8. Elgaar, M.; Amiri, H. Ling-CL: Understanding NLP Models through Linguistic Curricula. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 13526–13542. [Google Scholar]
  9. Oba, M.; Haga, A.; Fukatsu, A.; Oseki, Y. BabyLM challenge: Curriculum learning based on sentence complexity approximating language acquisition. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Singapore, 6–7 December 2023; Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., et al., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 290–297. [Google Scholar] [CrossRef]
  10. Platanios, E.A.; Stretcu, O.; Neubig, G.; Poczos, B.; Mitchell, T. Competence-based Curriculum Learning for Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 1162–1172. [Google Scholar] [CrossRef]
  11. Johnson, R.E. How Readable Are Our Elementary Social Studies Textbooks? In Proceedings of the International Reading Association Conference, Anaheim, CA, USA, 6–9 May 1970; ERIC Document ED043459; Institute of Education Sciences, U.S. Department of Education: Washington, DC, USA. Available online: https://eric.ed.gov/?id=ED043459 (accessed on 6 May 2025).
  12. Leonard, M.A. Parent–Child Storytelling During Joint Picture-Book Reading and Relation to Language Scores of Children with ADHD. Master’s Thesis, University of Kentucky, Lexington, KY, USA, 2005. [Google Scholar]
  13. Zainurrahman, Z.; Yusuf, F.N.; Sukyadi, D. Text readability: Its impact on reading comprehension and reading time. J. Educ. Learn. (EduLearn) 2024, 18, 1422–1432. [Google Scholar] [CrossRef]
  14. Zhang, Y.; Mohamed, A.; Abdine, H.; Shang, G.; Vazirgiannis, M. Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning. arXiv 2025, arXiv:2506.11300. [Google Scholar] [CrossRef]
  15. Choshen, L.; Cotterell, R.; Hu, M.Y.; Linzen, T.; Mueller, A.; Ross, C.; Warstadt, A.; Wilcox, E.; Williams, A.; Zhuang, C. The 2nd BabyLM challenge: Sample-efficient pretraining on a developmentally plausible corpus. arXiv 2024, arXiv:2404.06214. [Google Scholar]
  16. Ben Allal, L.; Lozhkov, A.; Penedo, G.; Wolf, T.; von Werra, L. SmolLM-Corpus. 2024. Available online: https://huggingface.co/blog/smollm (accessed on 6 May 2025).
  17. MacWhinney, B. The CHILDES Project: Tools for Analyzing Talk, 3rd ed.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2000. [Google Scholar]
  18. edenbd. Children Stories Text Corpus; Kaggle Dataset: San Francisco, CA, USA, 2019. [Google Scholar]
  19. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; Linzen, T., Chrupała, G., Alishahi, A., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 353–355. [Google Scholar] [CrossRef]
  20. Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv 2020, arXiv:1905.00537. [Google Scholar]
  21. Warstadt, A.; Parrish, A.; Liu, H.; Mohananey, A.; Peng, W.; Wang, S.F.; Bowman, S.R. BLiMP: The benchmark of linguistic minimal pairs for English. Trans. Assoc. Comput. Linguist. 2020, 8, 377–392. [Google Scholar] [CrossRef]
  22. Ivanova, A.A.; Sathe, A.; Lipkin, B.; Kumar, U.; Radkani, S.; Clark, T.H.; Kauf, C.; Hu, J.; Pramod, R.T.; Grand, G.; et al. Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models. arXiv 2024, arXiv:2405.09605. [Google Scholar] [CrossRef]
  23. de Varda, A.G.; Marelli, M.; Amenta, S. Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data. Behav. Res. Methods 2024, 56, 5190–5213. [Google Scholar] [CrossRef] [PubMed]
  24. Charpentier, L.; Choshen, L.; Cotterell, R.; Gul, M.O.; Hu, M.; Jumelet, J.; Linzen, T.; Liu, J.; Mueller, A.; Ross, C.; et al. BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop. arXiv 2025, arXiv:2502.10645. [Google Scholar]
  25. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 10 May 2025).
  26. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Figure 1. FRE-based curriculum learning data organization by unit (sentence, group, paragraph) and difficulty level.
Figure 1. FRE-based curriculum learning data organization by unit (sentence, group, paragraph) and difficulty level.
Mathematics 13 03300 g001
Figure 2. Single-token coverage (%) for BERT tokenizers trained on our merged setting, across GLUE benchmark. Coverage is defined as the proportion of unique words represented as a single token. QQP, MNLI, and BoolQ show the lowest coverage among all tasks, comprising the bottom-three datasets in terms of word-level token preservation.
Figure 2. Single-token coverage (%) for BERT tokenizers trained on our merged setting, across GLUE benchmark. Coverage is defined as the proportion of unique words represented as a single token. QQP, MNLI, and BoolQ show the lowest coverage among all tasks, comprising the bottom-three datasets in terms of word-level token preservation.
Mathematics 13 03300 g002
Table 1. Length predictors and criterion [3]. Means and SDs summarize the calibration corpus (McCall–Crabbs passages) used to estimate Equation (1); they are not used to compute RE for a new text.
Table 1. Length predictors and criterion [3]. Means and SDs summarize the calibration corpus (McCall–Crabbs passages) used to estimate Equation (1); they are not used to compute RE for a new text.
wlsiMeanSD
Correlations
wl (syllables per 100 words)1.0000.4644146.65125.373
si (sentence length in words)0.46441.00024.0329.113
Criterion C 75 0.63070.62127.34842.1345
Table 2. Curriculum level summary with FRE ranges and mean scores. In all levels, models were trained on datasets consisting of approximately 33 M tokens each.
Table 2. Curriculum level summary with FRE ranges and mean scores. In all levels, models were trained on datasets consisting of approximately 33 M tokens each.
DatasetLevelFRE RangeMean
Single1 (Easy)119.19 ≥ FRE > 45.0967.76
2 (Medium)45.09 ≥ FRE > 15.3131.30
3 (Hard)15.31 ≥ FRE−8.64
Merged1 (Easy)119.19 ≥ FRE > 73.8598.48
2 (Medium)73.85 ≥ FRE > 40.6958.43
3 (Hard)40.69 ≥ FRE14.43
Table 3. Interpretation of Flesch Reading Ease (FRE) scores.
Table 3. Interpretation of Flesch Reading Ease (FRE) scores.
FRE ScoreReadability LevelEstimated Grade Level
0–30Very DifficultCollege graduate
30–40Difficult13th to 16th grade
50–60Fairly Difficult10th to 12th grade
60–70Standard8th or 9th grader
70–80Fairly Easy7th grader
80–90Easy6th grader
90–100Very Easy5th grader
Table 4. Performance of the single setting on the three zero-shot evaluation tasks: BLiMP, Supplementary BLiMP, and Ewok. Bold values indicate the highest performance.
Table 4. Performance of the single setting on the three zero-shot evaluation tasks: BLiMP, Supplementary BLiMP, and Ewok. Bold values indicate the highest performance.
SettingsBLiMPBLiMP-SEwok
GPT2Baseline60.5148.3249.97
Group 64.06   ±   4.25 47.59   ±   2.48 49.80   ±   0.22
(+4.15)(−0.73)(−0.17)
Sentence 64.67   ±   0.85 48.90   ±   1.17 50.22   ±   0.12
(+4.16)(+0.58)(+0.25)
Paragraph72.82   ±   0.21 59.50   ±   1.25 51.51   ±   0.62
(+12.31)(+11.18)(+1.54)
BERTBaseline53.2344.9349.74
Group53.72   ±   0.62 43.11   ±   1.87 49.83   ±   0.26
(+0.49)(−1.82)(+0.09)
Sentence56.24   ±   0.1 46.19   ±   2.76 49.97   ±   0.25
(+3.01)(+1.26)(+0.23)
Paragraph73.06   ±   1.37 57.73   ±   0.30 51.11   ±   0.17
(+19.83)(+12.80)(+1.37)
Table 5. Performance of the single setting on the GLUE tasks. Bold values indicate the highest performance.
Table 5. Performance of the single setting on the GLUE tasks. Bold values indicate the highest performance.
ModelsSettingsBoolQMNLIMRPCMultiRCQQPRTEWSCAVG
BERTBaseline65.1941.9569.6062.6270.8756.1163.4661.44
Group66.11   ±   0.59 43.98   ±   1.10 70.58   ±   1.74 60.14   ±   3.08 70.47   ±   0.58 58.27   ±   2.27 63.46   ±   0.96 61.85
(+0.92)(+2.03)(+0.98)(−2.48)(−0.4)(+2.16)(+0)(+0.41)
Sentence68.07   ±   0.52 43.25   ±   0.30 73.03   ±   0.69 60.02   ±   0.21 69.72   ±   0.59 58.27   ±   4.07 65.38   ±   1.11 62.53
(+2.88)(+1.3)(+3.43)(−2.6)(−1.15)(+2.16)(+1.92)(+1.09)
Paragraph69.63   ±   0.57 58.82   ±   0.83 77.94   ±   0.69 65.01   ±   1.05 77.20   ±   0.46 61.23   ±   4.07 62.49   ±   1.36 67.47
(+4.44)(+16.87)(+8.34)(+2.39)(+6.33)(+5.12)(−0.97)(+6.03)
Table 6. Performance on three zero-shot tasks in the single setting with paragraph unit training. Bold values indicate the highest performance.
Table 6. Performance on three zero-shot tasks in the single setting with paragraph unit training. Bold values indicate the highest performance.
LevelsBLiMPBLiMP-SEWoK
GPT2Baseline60.5148.3249.97
1 (Easy)69.37   ±   1.12 60.28   ±   0.51 50.28   ±   0.31
(+8.86)(+11.96)(+0.31)
2 (Medium)72.62   ±   0.28 58.21   ±   0.49 52.12   ±   0.34
(+12.11)(+9.89)(+2.15)
3 (Hard)72.82   ±   0.21 59.50   ±   1.25 51.51   ±   0.62
(+12.31)(+11.18)(+1.54)
BERTBaseline53.2344.9349.74
1 (Easy)61.41   ±   0.22 56.11   ±   1.24 49.98   ±   0.23
(+8.18)(+11.18)(+0.24)
2 (Medium)71.05   ±   1.43 57.74   ±   0.38 50.75   ±   0.85
(+17.82)(+12.81)(+1.01)
3 (Hard)73.06   ±   1.37 57.73   ±   0.30 51.11   ±   0.18
(+19.83)(+12.80)(+1.37)
Table 7. Performance on the reading task in the single setting with paragraph unit training. Bold values indicate the highest performance.
Table 7. Performance on the reading task in the single setting with paragraph unit training. Bold values indicate the highest performance.
Reading
LevelsEye-TrackingSelf-Paced Reading
BERTBaseline2.240.63
1 (Easy)4.91   ±   0.62 2.60   ±   0.11
(+2.67)(+1.97)
2 (Medium)5.52   ±   0.27 2.72   ±   0.19
(+3.28)(+2.09)
3 (Hard)6.36   ±   0.55 3.00   ±   0.13
(+4.12)(+2.37)
Table 8. Performance on the GLUE task in the single setting with paragraph unit training. Bold values indicate the highest performance.
Table 8. Performance on the GLUE task in the single setting with paragraph unit training. Bold values indicate the highest performance.
ModelsSettingsBoolQMNLIMRPCMultiRCQQPRTEWSCAVG
BERTBaseline65.1941.9569.6062.6270.8756.1163.4661.44
1 (Easy)67.85   ±   0.04 45.53   ±   0.08 71.80   ±   1.04 64.26   ±   0.23 71.26   ±   0.35 58.63   ±   0.51 68.26   ±   1.36 63.94
(+2.66)(+3.58)(+2.20)(+1.64)(+0.39)(+2.52)(+4.80)(+2.50)
2 (Medium)67.95   ±   0.78 55.95   ±   0.47 74.99   ±   0.70 65.30   ±   0.35 75.48   ±   0.10 56.11   ±   0.00 66.34   ±   1.36 66.01
(+2.76)(+14.00)(+5.39)(+2.68)(+4.61)(+0.00)(+2.88)(+4.57)
3 (Hard)69.63   ±   0.57 58.82   ±   0.84 77.94   ±   0.69 65.01   ±   1.05 77.20   ±   0.46 61.23   ±   4.07 62.49   ±   1.36 67.47
(+4.44)(+16.87)(+8.34)(+2.39)(+6.33)(+5.12)(−0.97)(+6.03)
Table 9. Performance on the three zero-shot tasks (BLiMP, BLiMP-S, EWoK) in the merged setting. Bold values indicate the highest performance.
Table 9. Performance on the three zero-shot tasks (BLiMP, BLiMP-S, EWoK) in the merged setting. Bold values indicate the highest performance.
ModelsSettingsBLiMPBLiMP-SEWoK
GPT2Baseline70.6250.2949.90
Group70.68   ±   1.09 53.13   ±   1.29 50.34   ±   1.95
(+0.06)(+2.84)(+0.44)
Sentence69.40   ±   0.28 52.72   ±   1.35 50.22   ±   0.24
(−1.22)(+2.43)(+0.32)
BERTBaseline51.7551.3564.76
Group52.99   ±   0.37 52.10   ±   1.44 71.10   ±   0.92
(+1.24)(+0.75)(+6.34)
Sentence53.57   ±   0.56 47.17   ±   1.33 65.31   ±   0.82
(+1.82)(−4.18)(+0.55)
Table 10. Performance of the merged setting on the three zero-shot evaluation tasks: BLiMP, Supplementary BLiMP, and Ewok. Bold values indicate the highest performance.
Table 10. Performance of the merged setting on the three zero-shot evaluation tasks: BLiMP, Supplementary BLiMP, and Ewok. Bold values indicate the highest performance.
ModelsSettingsBoolQMNLIMRPCMultiRCQQPRTEWSCAVG
BERTBaseline67.2745.4370.0960.7671.1854.6763.4661.83
Group67.40   ±   1.39 44.33   ±   1.72 70.09   ±   1.03 61.46   ±   0.23 70.33   ±   0.74 57.53   ±   1.03 63.46   ±   1.37 62.08
(+0.13)(−1.1)(+0)(+0.7)(−0.85)(+2.86)(+0)(+0.26)
Sentence65.01   ±   0.00 45.15   ±   1.51 70.58   ±   2.35 61.59   ±   0.32 71.14   ±   0.04 58.27   ±   0.51 67.30   ±   1.06 62.72
(−2.26)(−0.28)(+0.49)(+0.83)(−0.04)(+3.6)(+3.84)(+0.89)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, S.; Park, J.; Kim, J. A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining. Mathematics 2025, 13, 3300. https://doi.org/10.3390/math13203300

AMA Style

Kim S, Park J, Kim J. A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining. Mathematics. 2025; 13(20):3300. https://doi.org/10.3390/math13203300

Chicago/Turabian Style

Kim, Suyun, Jungwon Park, and Juae Kim. 2025. "A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining" Mathematics 13, no. 20: 3300. https://doi.org/10.3390/math13203300

APA Style

Kim, S., Park, J., & Kim, J. (2025). A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining. Mathematics, 13(20), 3300. https://doi.org/10.3390/math13203300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop