A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining

Kim, Suyun; Park, Jungwon; Kim, Juae

doi:10.3390/math13203300

Open AccessArticle

A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining

by

Suyun Kim

,

Jungwon Park

and

Juae Kim

^*

Department of English Linguistics and Language Technology, Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(20), 3300; https://doi.org/10.3390/math13203300

Submission received: 21 September 2025 / Revised: 10 October 2025 / Accepted: 13 October 2025 / Published: 16 October 2025

(This article belongs to the Special Issue Research on Machine Learning, Data Mining, Natural Language Processes, and Optimization Methods)

Download

Browse Figures

Versions Notes

Abstract

Large language models demand substantial computational and data resources, motivating approaches that improve the training efficiency of small language models. While curriculum learning methods based on linguistic difficulty measures have been explored as a potential solution, prior approaches that rely on complex linguistic indices are often computationally expensive, difficult to interpret, or fail to yield consistent improvements. Moreover, existing methods rarely incorporate the cognitive and linguistic efficiency observed in human language acquisition. To address these gaps, we propose a readability-driven curriculum learning method based on the Flesch Reading Ease (FRE) score, which provides a simple, interpretable, and cognitively motivated measure of text difficulty. Across two dataset configurations and multiple curriculum granularities, our method yields consistent improvements over baseline models without curriculum learning, achieving substantial gains on BLiMP and MNLI. Reading behavior evaluations also reveal human-like sensitivity to textual difficulty. These findings demonstrate that a lightweight, interpretable curriculum design can enhance small language models under strict data constraints, offering a practical path toward more efficient training.

Keywords:

curriculum learning; pre-training; Flesch Reading Ease score; language model

MSC:

68T50; 68T07

1. Introduction

Large language models achieve strong performance across a variety of linguistic tasks, but only at the cost of extensive training resources and massive data consumption. This raises a central question in model design: how can we optimize training efficiency under limited data conditions? One promising direction for improving efficiency is curriculum learning, which organizes training data in a sequence that reflects increasing task difficulty. This paradigm has been shown to benefit both machine learning and cognitive modeling [1,2], highlighting data ordering as an effective strategy for guiding model learning in a stable and progressive manner. A central challenge in this approach lies in defining a principled measure of curriculum difficulty, since difficulty effectively acts as a regulator of distributional entropy during training. Therefore, the choice of difficulty metric can critically determine how efficiently a model learns under limited data conditions.

To address this challenge, we introduce a readability-based function that organizes training inputs by difficulty using the Flesch Reading Ease (FRE) score [3]. FRE quantifies textual comprehensibility through a regression formula grounded in human reading behavior, providing a principled metric that reflects developmental learning stages. Drawing inspiration from child language acquisition, we note that children achieve remarkable linguistic competence from limited input [4,5,6], suggesting that curricula modeled on human learning trajectories may yield more data-efficient and cognitively grounded training methodologies [7].

This observation is supported by developmental evidence showing that children can infer grammatical structure and derive meaning from remarkably sparse input, acquiring core elements of syntax and morphology within the first few years of life, despite limited vocabulary and exposure [4,5,6]. Children achieve this level of linguistic ability from relatively low amounts of input—especially when compared to modern language models, which often require well over 10,000 words of training data for every single word a 13-year-old child has encountered [7]. This gap in learning efficiency highlights a critical opportunity: by modeling training trajectories after human development, we may move toward training methodologies that are not only more data-efficient, but also cognitively grounded and empirically motivated.

Thus, we propose a curriculum learning framework based on established readability metrics, which we call the Reading-level Guided Curriculum. Among readability indices, we employ the Flesch Reading Ease (FRE) score, a widely validated psycholinguistic measure closely associated with human reading development. Owing to its simplicity and interpretability, FRE provides a cognitively grounded and reproducible basis for structuring our Reading-level Guided Curriculum.

Furthermore, unlike large-scale models that primarily aim to maximize performance through massive data consumption, our study deliberately constrains the training data to a human-comparable scale. The goal is not to rival the raw performance of large models, but rather to test whether cognitively inspired training trajectories—analogous to human language acquisition under limited exposure—can yield efficient learning and competitive improvements even in data-restricted conditions.

Our results also show that a curriculum guided by the psycholinguistic FRE score improves efficiency in low-resource settings while aligning with insights from human language development. This highlights the potential of our method as a structured and cognitively inspired training strategy for small language models. Our contributions can be summarized as follows:

We propose a curriculum that is substantially simpler and more interpretable than existing approaches.
By controlling for the confounding factor of large-scale data, we isolate pretraining methodology as the primary driver of improvement.
Unlike prior curriculum learning approaches that often yielded inconclusive results, our method produces clear and positive learning signals.

2. Related Works

2.1. Importance of Curriculum for Language Models

2.1.1. Curriculum Learning

Curriculum learning is a training strategy in which examples are presented in an organized order—from simpler to more complex—rather than randomly, allowing models to learn progressively as humans do [1].

The concept can also be understood as a form of continuation method. Instead of tackling a hard non-convex optimization problem directly, training begins with a simplified version of the problem and gradually shifts to the original one. This staged transition helps the model find better minima, achieving lower training loss and stronger generalization.

Formally, following Bengio et al. [1], let s denote a sample drawn from the target distribution

D (s)

. At training stage

α

(

0 \leq α \leq 1

), the reweighted distribution is defined as:

Q_{α} (s) \propto W_{α} (s) D (s),

(1)

where

s is a training sample;
$D (s)$ is the original target distribution over samples;
$W_{α} (s)$ is a weighting function that determines the importance of sample s at stage $α$ ;
$Q_{α} (s)$ is the reweighted distribution at stage $α$ .

The weighting function

W_{α} (s)

increases monotonically with

α

. At early stages, simpler data points are emphasized, while as

α

grows, increasingly complex examples are incorporated. At the final stage (

α = 1

),

Q_{1} (s) = D (s)

(2)

meaning that the model is trained on the full target distribution without any reweighting.

A curriculum is then characterized by the following two conditions. First, the entropy of the training distribution must increase monotonically over time:

H (Q_{α}) < H (Q_{α + ε}), \forall ε > 0 .

(3)

where

H (Q_{α})

denotes the entropy of the reweighted data distribution

Q_{α}

, which reflects the overall diversity of training samples at stage

α

, and

ε

represents a small positive increment indicating the transition to a slightly later stage. This condition ensures that while early stages emphasize a narrow band of difficulty, later stages introduce broader diversity. Second, the weights of individual examples must also grow monotonically with

α

:

W_{α + ε} (s) \geq W_{α} (s), \forall s, \forall ε > 0 .

(4)

where

W_{α} (s)

denotes the weighting function that determines the importance of each training sample s at stage

α

, and

ε

represents a small positive increment indicating a transition to a subsequent stage. Once an example is included, its importance does not diminish; instead, more difficult examples are layered on top, enabling the learner to build upon earlier knowledge.

In this way, curriculum learning provides a stable and intuitive framework that guides models to handle increasingly complex inputs, making it particularly well-suited for domains such as language modeling, where gradual acquisition plays a central role.

2.1.2. Linguistic Indicators in Curriculum Learning

Several studies have explored ways to leverage linguistic indicators for curriculum learning in order to improve model performance [8,9,10]. These works are often motivated by the intuition that, much like in human learning, models benefit from progressing from simpler inputs to more complex ones. What distinguishes this line of research from other curriculum methods is that the idea of difficulty is defined through linguistic indicators, which makes the curriculum not only an effective training strategy but also a linguistically meaningful way. While these studies share the same overarching motivation, they differ substantially in how linguistic difficulty is measured.

Oba et al. [9] operationalized difficulty through syntactic complexity, using measures such as dependency tree depth and the number of syntactic constituents to approximate stages of linguistic acquisition. While this syntactic approach offered a linguistically grounded perspective, the observed performance gains were relatively limited. Interestingly, later investigations [10] showed that even surface-level indicators such as sentence length and word rarity could yield improvements when used to design curricula. Recently, a multi-view curriculum framework was introduced in 2023 [8], drawing on more than 200 linguistic complexity indices. Their results showed meaningful gains across downstream tasks, underscoring the potential of linguistically rich curricula. But at the same time, the reliance on such a large set of features made the framework less transparent and straightforward to apply.

These prior studies collectively demonstrate the promise of leveraging linguistic indicators for curriculum design, yet they also leave open questions about which indicators are most effective and how they should be operationalized. Syntactic-complexity-based curricula [9], while grounded in solid linguistic theory, translated into only modest performance gains in practice. At the other end of the spectrum, multi-view frameworks [8] that integrate hundreds of linguistic indices achieved stronger results but did so at the cost of interpretability and practical applicability, as their reliance on extensive feature sets made them cumbersome to adopt consistently. This tension suggests a broader gap: existing approaches either fall short in impact, oversimplify the notion of difficulty, or become too complex to implement reliably. Beyond such extremes, there is room to explore indicators that not only reflect surface-level statistics but also align more closely with cognitive perspectives on language learning.

Building on this view, we propose a Reading-level Guided Curriculum learning framework grounded in readability metrics that have been widely applied in educational and applied linguistic contexts. By relying on well-established readability measures, the framework keeps the training process computationally lightweight while remaining aligned with human intuitions about text difficulty. We investigate whether insights from human language development can promote more efficient learning dynamics in data-limited language models, regardless of model size or training scale.

2.2. Flesch Reading Ease Score

As discussed in the previous section, this study builds on deep insights from human developmental patterns and examines the Flesch Reading Ease (FRE) score as a readability-based measure that can serve as a cognitively grounded foundation. Among the many possible criteria, we focus on FRE for two main reasons: (1) it is a intuitive and quantitative metric that integrates sentence length and lexical complexity into a single interpretable value, and (2) it has been widely applied in educational contexts for English learners, aligning well with the philosophical goal of this work to treat language models as learning entities.

2.2.1. Interpretability Through Integration

The FRE score was originally designed to evaluate the accessibility of written English, integrating sentence length and lexical complexity into a single interpretable value. The metric was first developed using 363 passages from the McCall–Crabbs Standard Test Lessons in Reading, where children’s reading comprehension performance served as the foundation. The criterion, denoted as

C_{75}

, was defined as the average grade level at which children could correctly answer 75% of comprehension questions, providing a numerical benchmark for the grade level required to understand a given text.

To model the relationship between textual features and grade-level comprehension, a multiple regression analysis was conducted with two predictors: syllables per 100 words (

w l

) and average sentence length in words (

s i

). The resulting regression equation was:

C_{75} = 0.0846 w l + 0.1015 s i - 5.6835 (multiple correlation R = 0.7047) .

This indicates that texts with longer sentences and more syllabic words require higher grade levels to comprehend. To complement the regression analysis, Table 1 summarizes the descriptive statistics and pairwise correlations among the predictors (

w l

,

s i

) and the criterion

C_{75}

.

As shown in the table,

r (w l, s i) = 0.4644

indicates a moderate positive association, meaning that passages with longer sentences also tend to contain more syllables per 100 words. Likewise,

r (C_{75}, w l) = 0.6307

and

r (C_{75}, s i) = 0.6212

show that texts with more syllabic words and longer sentences tend to require higher grade levels.

Because this raw regression predicted grade levels rather than ease of reading, an additional transformation was introduced to yield an interpretable “ease” score. The predicted grade level was first multiplied by 10 to create a 0–100 scale (so that one point corresponds to one-tenth of a grade), and the signs of the predictors were reversed so that higher values would correspond to easier texts. The final FRE formula in terms of the original variables was:

FRE = 206.835 - 0.846 w l - 1.015 s i .

As many implementations use average syllables per word (

A S W = w l / 100

) rather than syllables per 100 words, the formula can equivalently be expressed as:

FRE = 206.835 - 1.015 ASL - 84.6 ASW,

where

A S L

denotes average sentence length (words per sentence) and

A S W

denotes average syllables per word.

In summary, the FRE score is derived through three steps: (1) defining a comprehension-based criterion from children’s performance, (2) linking this criterion to textual length features via regression, and (3) transforming the predicted grade level into a 0–100 ease scale. This process shows how sentence length and lexical complexity can be integrated into a single interpretable measure, thereby demonstrating interpretability through integration.

2.2.2. Practicality for Controlling Text Difficulty

As discussed above, the FRE score—originally designed to assess the accessibility of written English—has been widely adopted in educational contexts, particularly for evaluating materials targeted at children and second-language learners. In educational practice, it is employed to evaluate school texts for consistency with children’s developmental levels [11], with its role extending to research on child language development where FRE serves as an experimental tool for calibrating linguistic stimuli [12]. Studies on children’s reading comprehension and processing time often adopt FRE as a standardized scale for comparing textual difficulty [13].

While alternative readability or complexity indices such as syntactic depth, lexical sophistication, or multi-view linguistic feature sets mentioned above in Section 2.1.2 have also been explored in prior curriculum learning studies, these approaches often require extensive feature extraction, yield less interpretable signals, or introduce computational overhead. Notably, a recent study [14] also included FRE among multiple difficulty metrics and found it to be one of the most effective indicators for improving training efficiency. FRE score provides a single, transparent value that has been validated in both educational practice and psycholinguistic research. Its simplicity and interpretability make it particularly suitable for our goal of developing a lightweight yet cognitively motivated curriculum framework for small-model training.

3. Proposed Method

In this work, we extend the general framework of curriculum learning introduced by [1] by incorporating the FRE score as a measure of linguistic difficulty. We introduce a curriculum parameter,

α

(

0 \leq α \leq 1

), which controls the progression from easy to difficult inputs during training. At stage

α

, the distribution is defined as:

R_{α} (s) \propto U_{α} (s) D (s),

(5)

where s denotes a training unit (e.g., a sentence or a paragraph depending on the experimental setting),

D (s)

is the target distribution over such units, and

U_{α} (s)

is a weighting function explicitly determined by the FRE score of s. Intuitively,

U_{α} (s)

prioritizes easier units at earlier stages and gradually introduces more complex ones, thereby making the progression of linguistic complexity.

At lower values of

α

, the weighting function favors sentences with higher FRE scores (i.e., easier sentences). As

α

increases, progressively lower FRE scores (i.e., more complex sentences) are incorporated. The progression of

α

thus corresponds to a transparent trajectory of increasing linguistic complexity:

$α \approx 0$ : the distribution emphasizes easier sentences.
$0 < α < 1$ : progressively more complex sentences are added.
$α = 1$ : the full target distribution $D (s)$ is restored without any reweighting.

The mapping between FRE scores and the weighting function

U_{α} (s)

can be illustrated through a simple example. Consider three sentences of different difficulty levels: Sentence A (FRE = 90, easy), Sentence B (FRE = 70, medium), and Sentence C (FRE = 40, hard). At the early stage of training (

α = 0

), the weighting function prioritizes high-FRE sentences, and thus only Sentence A is emphasized. As

α

increases (e.g.,

α = 0.5

), medium-level sentences such as Sentence B are also incorporated. Finally, at

α = 1

, the full target distribution

D (s)

is recovered, and all sentences, including the more complex Sentence C, are included. This stepwise inclusion illustrates how the weighting function

U_{α} (s)

is explicitly determined by the FRE score: higher values are favored earlier in the curriculum, and progressively lower values are introduced as training advances.

Unlike prior cognitively motivated curricula that struggled with effectiveness or relied on complex heuristics, our approach offers a lightweight and interpretable mechanism for controlling training difficulty. In this way, our formulation integrates FRE scores directly into the curriculum, providing a principled and interpretable mechanism for controlling linguistic difficulty during training.

By construction,

R_{α}

preserves the two central conditions of curriculum learning: (1) monotonic increase of entropy—as

α

grows, the curriculum expands to cover a broader range of FRE scores; (2) monotonic growth of weights—once easy examples are introduced, their relative importance does not diminish, and harder examples are layered on top. Furthermore, we apply this simple equation to various curriculum designs and configurations in order to investigate its effectiveness in supporting efficient training and enhancing small model performance. Accordingly, the formulation offers a simple yet effective and interpretable approach to defining and regulating linguistic difficulty during training. The details of the experimental settings are presented in the following section.

4. Experimental Settings

To investigate how data diversity affects curriculum learning, we designed experiments under two dataset settings: single and merged. In the single setting, curricula were constructed at three units (sentence, group, and paragraph), while in the merged setting, only two units (sentence and grouped) were feasible. This results in five conditions in total, allowing us to analyze how reading-level guided difficulty ordering impacts learning. To operationalize text difficulty within these settings, we compute an FRE score for each setting. We then sort all text segments computed, dividing them into three equal-sized bins (tertiles). As a result, the top third is categorized as easy, the middle third as medium, and the bottom third as hard. The specific difficulty boundaries used for single and merged dataset settings are summarized in Table 2, which also reports the mean FRE score within each level.

4.1. Comparison of Settings

In our experiments, we consider a total of five configurations. First, based on the type of training dataset, we distinguish between two settings: the single setting, where curriculum learning is applied to a single corpus, and the merged setting, where different corpora are assigned to different curriculum levels to more closely simulate the developmental trajectory of child language acquisition. Within the single setting, the FRE score is computed at three units: sentence, group, and paragraph. In contrast, the merged setting includes only sentence- and group-based curricula, since the merged corpora do not consistently form coherent paragraphs. This yields three variants for the single setting and two for the merged setting, resulting in five configurations in total. This setup allows us to examine the impact of reading-level-guided difficulty ordering and to identify effective training strategies. The specific sources and characteristics of the datasets used in the single and merged settings are described in Section 4.2, while this section focuses on the three unit levels at which FRE is computed.

Group unit: Sentences are split and scored individually by FRE, then coarsely partitioned into three bins—easy, medium, and hard. Within each bin, samples are randomly shuffled without further ordering, resulting in a coarse-grained curriculum. This corresponds to the procedure of Algorithm 1 (group), where difficulty increases stage by stage, but fine-grained ordering is not enforced. Figure 1 shows the visualization of the overall training data organization, where the color shading indicates FRE difficulty (lighter = easier, darker = harder) and sentences within each stage are arranged without internal ordering.
Sentence unit: Sentences are again grouped into three stages, but unlike the group unit setting, each sentence is strictly ordered from easiest to hardest according to its FRE score. This corresponds to the procedure of Algorithm 2 (sentence), which enforces a fully ordered, fine-grained curriculum where difficulty increases both across and within stages. In Figure 1, this is represented by progressively darker sentence shading within each block, showing that the model processes inputs in a strictly increasing FRE sequence.
Paragraph unit: Instead of scoring at the sentence level, entire paragraphs are treated as indivisible units. FRE is computed across the full paragraph, and units are then partitioned into easy, medium, and hard stages based on tertile boundaries of the paragraph-level distribution. This corresponds to Algorithm 3 (paragraph), ensuring contextual and semantic relations across sentences are preserved within each input. In Figure 1, paragraphs are highlighted as larger yellow-shaded blocks, emphasizing that multiple sentences are grouped and treated as one training unit, rather than being split. The difficulty still follows easy → medium → hard progression, but determined at the paragraph level.

As mentioned above, across all three settings, samples are ordered by their FRE values and split into three subsets—easy, medium, and hard—by dividing the ordered list into thirds. Training proceeds stage by stage for

E_{stage} = 10

epochs, in the order of easy → medium → hard.

Algorithm 1 Group

Require:

D = {(x_{i}, y_{i})}

, epochs per stage

E_{stage}

Ensure: Trained parameters

θ^{*}

1:: $U \leftarrow \emptyset$
2:: for each $(x_{i}, y_{i}) \in D$ do
3:: ${s_{i, k}} \leftarrow$ SentenceSplit (x_i)
4:: for each $s_{i, k}$ do
5:: ${FRE}_{i, k} \leftarrow$ ComputeFRE (s_i,k)
6:: $U \leftarrow U \cup {(s_{i, k}, y_{i}, {FRE}_{i, k})}$
7:: OrderBy $(U, FRE, desc)$
8:: Split U into thirds: $(D^{easy}, D^{med}, D^{hard})$
9:: Shuffle each subset
10:: Initialize parameters $θ$
11:: for $s \in {easy, med, hard}$ do
12:: for $e \leftarrow 1$ to $E_{stage}$ do
13:: Update $θ$ on $D^{s}$
14:: return $θ^{*} \leftarrow θ$

Algorithm 2 Sentence

Require:

D = {(x_{i}, y_{i})}

, epochs per stage

E_{stage}

Ensure: Trained parameters

θ^{*}

1:: $U \leftarrow \emptyset$
2:: for each $(x_{i}, y_{i}) \in D$ do
3:: ${s_{i, k}} \leftarrow$ SentenceSplit (x_i)
4:: for each $s_{i, k}$ do
5:: ${FRE}_{i, k} \leftarrow$ ComputeFRE (s_i,k)
6:: $U \leftarrow U \cup {(s_{i, k}, y_{i}, {FRE}_{i, k})}$
7:: OrderBy $(U, FRE, desc)$
8:: Split U into thirds: $(D^{easy}, D^{med}, D^{hard})$
9:: Initialize parameters $θ$
10:: for $s \in {easy, med, hard}$ do
11:: for $e \leftarrow 1$ to $E_{stage}$ do
12:: Update $θ$ on $D^{s}$
13:: return $θ^{*} \leftarrow θ$

Algorithm 3 Paragraph

Require:

D = {(x_{i}, y_{i})}

, epochs per stage

E_{stage}

Ensure: Trained parameters

θ^{*}

1:: $U \leftarrow \emptyset$
2:: for each $(x_{i}, y_{i}) \in D$ do
3:: ${FRE}_{i} \leftarrow$ ComputeFRE (x_i)
4:: $U \leftarrow U \cup {(x_{i}, y_{i}, {FRE}_{i})}$
5:: OrderBy $(U, FRE, desc)$
6:: Split U into thirds: $(D^{easy}, D^{med}, D^{hard})$
7:: Initialize parameters $θ$
8:: for $s \in {easy, med, hard}$ do
9:: for $e \leftarrow 1$ to $E_{stage}$ do
10:: Update $θ$ on $D^{s}$
11:: return $θ^{*} \leftarrow θ$

4.2. Datasets

4.2.1. Pretraining Datasets

Our dataset size was deliberately capped at 100 million words, which corresponds to roughly the number of words a 13-year-old child has been exposed to over their lifetime [15]. By constraining models to human-comparable input sizes, we aim to approximate plausible cognitive models of human learning. Training models on quantities of data closer to what humans actually encounter provides a valuable lens for understanding what enables humans to acquire language so efficiently, and can help illuminate the mechanisms underlying language learning.

Having defined the curriculum construction and training procedure within the scope of 100 M words, the subsequent experiments are divided according to two dataset settings. In the single setting, we used only the Cosmopedia [16] corpus, a synthetic text dataset that contains diverse formats such as textbooks, blogs, and stories generated with Mixtral-8x7B-Instruct-v0.1. Cosmopedia provides a sufficiently broad and coherent resource, which expects the model to capture general linguistic regularities.

In the merged setting, we constructed a mixture of corpora that naturally span a broad range of linguistic complexity. Specifically, we included data from CHILDES [17], Storybook [18], manually curated datasets from Gutenberg Children’s Literature (https://www.gutenberg.org/ebooks/bookshelf/20 (accessed on 6 May 2025)), and a subset of Cosmopedia, thereby covering a wide spectrum of linguistic variation.

We deliberately combined the above datasets that span different levels of readability, so that the merged corpus reflects a graded progression of linguistic complexity. Specifically, the sentences from CHILDES [17] and Storybook [18] primarily fall above an FRE score of 80, corresponding to the “easy” or “very easy” readability levels typically associated with fifth to sixth grade readers, and were included to provide the model with exposure to simple and coherent linguistic patterns resembling children’s early language input. The Gutenberg Children’s Literature corpus mostly occupies the FRE range of 40 to 80, covering “fairly easy” to “fairly difficult” levels, and was chosen to gradually introduce richer vocabulary and moderately complex syntax that reflect the progression to more advanced readers. Finally, a subset of Cosmopedia contributes a greater proportion of sentences below an FRE score of 40, representing challenging and low-readability content, thereby ensuring that the model is also exposed to advanced discourse structures and dense informational content (refer to Table 3 for the detailed interpretation of FRE scores). To verify that our readability-based categorization reflects consistent human perception, we conducted a human evaluation with three annotators, yielding a moderate–substantial inter-rater agreement (Fleiss’

κ

≈ 0.588).

4.2.2. Evaluation Datasets

We evaluate our models using datasets from a subset of GLUE [19] and SuperGLUE [20], including seven tasks: BoolQ, MNLI, MRPC, MultiRC, QQP, RTE, and WSC, together with three additional zero-shot tasks: BLiMP, EWOK, and Reading.

GLUE provides a collection of nine natural language understanding tasks covering textual entailment, similarity, and classification, and has become a widely adopted benchmark for evaluating pretrained models on a diverse range of linguistic phenomena [19]. SuperGLUE consists of more challenging tasks that require deeper reasoning and diverse task formats [20]. For our evaluation, CoLA, SST2, MNLI-mm, and QNLI were not included, as these tasks are highly correlated with other datasets such as BLiMP or MNLI. Overall, adopting GLUE as one of the evaluation tasks provides a standardized and comprehensive testbed for evaluating whether our curriculum-learning approach leads to general improvements across diverse NLU tasks.

In addition to these established benchmarks, we further evaluate on three targeted zero-shot tasks, which enable us to measure the models’ ability to generalize beyond training without task-specific tuning.

BLiMP [21] is a suite of minimal pairs targeting grammatical phenomena. It evaluates whether models prefer the grammatically correct sentence over an ungrammatical counterpart. Since BLiMP directly measures fine-grained grammatical generalization, it is especially relevant for testing whether a readability-based curriculum strengthens models’ sensitivity to core linguistic rules. EWOK [22] assesses models’ world knowledge by testing their ability to distinguish plausible from implausible contexts. Evaluating factual and commonsense reasoning beyond pure syntax or semantics, EWOK tests whether gains from our training method extend to broader knowledge grounding.

Reading [23] contains 205 English sentences (1726 words), for which cloze probabilities, predictability ratings, and computational surprisal estimates are aligned with behavioral and neural measures. Crucially, the reading component includes two complementary tasks: self-paced reading (SPR), which records reaction times as participants reveal words one by one, and eye-tracking, which captures fine-grained gaze measures such as fixations and regressions. Whereas SPR reflects more controlled, consciously paced reading behavior, eye-tracking captures more natural and immediate processing dynamics.

The evaluation of the reading task followed the framework of the BabyLM 2025 Challenge [24], where model predictions are assessed using regression analyses that measure the increase in explained variance (

Δ R^{2}

) when surprisal is added as a predictor for human reading measures. Specifically, eye-tracking variables are analyzed without spillover effects, whereas self-paced reading includes a one-word spillover term to account for delayed processing influences from the previous word. The spillover effect captures the phenomenon that cognitive load from a word can extend to the subsequent word, influencing its reading time. Higher

Δ R^{2}

values indicate that model-derived surprisal explains more variance in reaction times or gaze durations, providing a cognitively grounded test of how closely model processing aligns with human processing dynamics.

For all reported results, we averaged performance over three random seeds to ensure fairness. Details of the hyperparameter settings used for fine-tuning on GLUE tasks are provided in Appendix A.

4.3. Backbone Models

We conducted experiments using two widely adopted architectures: GPT-2 [25] and BERT [26]. These models served as the backbones for all curriculum and baseline training runs. GPT-2 was used for autoregressive language modeling, while BERT was used for masked language modeling in a bidirectional setting. Details of model hyperparameters are provided in Table A1.

Choosing these two complementary architectures allowed us to examine the effectiveness of curriculum learning across both the encoder-based and decoder-based paradigms. We deliberately employed moderate-sized, widely used models rather than larger state-of-the-art models. As smaller models provide a controlled environment where the effects of curriculum learning can be isolated without confounding from extreme model capacity, under the 100 M word constraint. In addition, given our deliberate 100 M word limitation, employing much larger recent models would be suboptimal, since their parameter scale requires substantially more data to train effectively. Using moderate-sized, widely validated models thus provides a more controlled testbed for evaluating the specific contribution of curriculum learning.

5. Results

5.1. The Result of the Single Setting

In this section, we present the analysis of the single setting, where curriculum learning is applied to a single dataset, one of the two configurations defined by dataset composition. We first examine (1) zero-shot tasks to observe the intrinsic performance without relying on fine-tuning data, followed by (2) GLUE tasks that require fine-tuning, and finally (3) a detailed analysis to gain deeper insights into the effectiveness of our curriculum. All results reported in this section are averaged over three independent runs with different random seeds, ensuring that the improvements are not due to chance from a single run. Cross-validation was not applied, as it is computationally prohibitive in pretraining settings; the multi-seed evaluation serves as a practical and widely accepted alternative.

5.1.1. Evaluation on Zero-Shot Tasks

We evaluated six zero-shot tasks, as shown in Table 4, after training a small language model using a curriculum structured from easier to more difficult documents and sentences based on their FRE scores. Among the various training strategies, the paragraph setting—where the curriculum was constructed at the paragraph level—consistently yielded superior performance across most tasks regardless of model architecture. In particular, substantial improvements were observed in tasks evaluating the grammatical competence of language models. For BLiMP, the GPT-2 architecture achieved a 12.31% improvement over the baseline, while the BERT architecture achieved a 19.83% improvement. Similarly, in the more grammatically demanding Supplement BLiMP task, performance increased by 11.18% and 12.80%, respectively.

Furthermore, in the Ewok task, which evaluates factual knowledge acquisition, the FRE-based curriculum also resulted in performance improvements, suggesting that utilizing readability indicators—commonly applied to evaluate the difficulty of children’s books—can have practical benefits for knowledge-intensive tasks in small language models.

5.1.2. Evaluation on GLUE Tasks

Table 5 demonstrates that within the single setting, our curriculum learning approach can positively influence natural language understanding capabilities. In particular, the paragraph unit setting—where the curriculum was constructed based on FRE scores computed at the paragraph unit—achieved an average improvement of +6.07 over the baseline, delivering superior performance across most subtasks. While there was a slight drop in tasks such as WSC, the overall improvements were far greater. This suggests that in NLU tasks, not only text difficulty but also contextual understanding across consecutive sentences plays a crucial role.

A striking example is the Multi-Genre Natural Language Inference (MNLI) task, which showed an impressive improvement of 16.87. Since MNLI requires the model to classify the relationship between a pair of sentences as entailment, contradiction, or neutral, semantic and contextual reasoning is essential. For example, given the premise “How do you know? All this is their information again.” and the hypothesis “This information belongs to them.”, the correct label is entailment. This is because the premise already presupposes that the information belongs to “them,” and the hypothesis simply restates this fact in a more concise way. In other words, if the premise is true, the hypothesis must also be true, thereby establishing an entailment relation.

Since tasks like MNLI rely heavily on semantic and contextual inference between sentence pairs, preserving broader context and capturing complex structures are critical. This explains why paragraph-level training, which better retains contextual integrity, likely outperformed sentence-level training in this setting.

5.2. Detailed Analysis: The Effect of Reading-Level Guided Curriculum

5.2.1. Evaluation on Zero-Shot Tasks

In Table 6, we analyze whether each curriculum level exerts an appropriate positive effect when training with the paragraph unit, which showed the best performance in the single setting. The table reports the performance on three zero-shot tasks (BLiMP, BLiMP-S, EWoK) at each curriculum level (easy, medium, hard), relative to the baseline.

As the curriculum progresses, we observe consistent performance gains across both model architectures, indicating that each stage contributes positively in a balanced manner. At the easy level, especially, tasks involving grammaticality judgment, such as BLiMP and BLiMP-S show notable improvements: GPT-2 achieves gains of 8.86% and 11.96%, while BERT improves by 8.18% and 11.18%, respectively. These results suggest that grammaticality-related tasks particularly benefit from exposure to linguistically simpler text in the early stages of the curriculum. In contrast, the EWoK dataset, which evaluates factual knowledge acquisition, shows only a marginal improvement of 0.31, implying that such tasks may be less sensitive to gains from easier text and instead require more complex input to yield substantial benefits.

5.2.2. Evaluation on Reading Task

Table 7 shows the performance of the reading task in the single setting, where curriculum learning was conducted using FRE scores computed at the paragraph level. The task evaluates how closely the model mirrors human-like perceptions of textual difficulty, comparing results against the baseline after each curriculum stage.

Beyond grammaticality judgment tasks, improvements were also observed in the reading tasks, which measure how closely language models align with human processing patterns. In the baseline without a curriculum, the scores remained close to zero. However, when training was structured according to decreasing FRE scores (that is, progressing from easier to more difficult texts), performance consistently improved across all settings. Notably, the paragraph setting yielded the strongest gains, achieving improvements of 4.12 on the eye-tracking task and 2.37 on the self-paced reading task compared to the baseline. These results indicate that language models trained with an FRE-based curriculum tend to struggle at similar points as humans, such as on passages where readers naturally slow down and fixate longer due to perceived difficulty. In this sense, the approach encourages models to exhibit processing patterns more closely aligned with human cognitive responses.

5.2.3. Evaluation on GLUE Task

Table 8 presents the performance on the GLUE subtasks after fine-tuning with their respective training datasets, following curriculum learning based on FRE scores computed at the paragraph level in the single setting. Similar to the earlier zero-shot tasks, all curriculum levels show improvements over the baseline, and furthermore, performance steadily increases as the curriculum progresses, reaching 63.94, 66.01, and 67.46 at the easy, medium, and hard levels, respectively.

Among the notable observations, the first is that the WSC task diverges from the overall trend of gradual improvement. Because this task requires reasoning about the referents of pronouns (coreference resolution), the easy-level data—consisting of simpler structures and clearer sentences—appears to support the learning of “surface-level” coreference patterns. However, as training progresses and later stages become dominated by more complex data with long sentences and nested clauses, the model may dilute the coreference signal or absorb additional noise, leading to reduced performance. Second, in the case of the MNLI task, we observe exponential growth beginning at the medium stage. Since this task relies heavily on contextual inference, it can be interpreted that a certain level of curriculum progression is necessary before robust performance emerges.

5.3. The Result of Merged Setting

5.3.1. Evaluation on Zero-Shot Tasks

Table 9 shows that our curriculum learning approaches on a merged setting demonstrate strong performance in three zero-shot evaluation tasks: BLiMP, Supplementary BLiMP, and Ewok. Across most evaluations, the grouped setting consistently outperformed others, providing empirical evidence that our approach offers a cognitively aligned learning strategy that enables small models to perform better in zero-shot tasks without any example-based supervision.

An interesting observation is the contrast between the single setting and the merged setting. While the paragraph unit proved highly effective in the single setting, its performance was less stable in the merged setup. Unlike the single dataset, which is composed of coherent documents from a single source, the merged dataset is constructed by merging documents from diverse domains and contexts. In such cases, although sentences are sorted by FRE scores, the disruption of contextual continuity appears to negatively impact learning.

5.3.2. Evaluation on GLUE Task

Table 10 demonstrates that the reading-level guided curriculum learning approach, when applied to models trained on our merged setting, leads to improved natural language understanding in BERT. Averaged GLUE scores reveal a clear trend: as the curriculum becomes more fine-grained—specifically when FRE scores are applied at the sentence level—performance improves. This suggests that our curriculum strategy contributes meaningfully to enhancing language understanding in small-scale models.

Among the tasks, MNLI and QQP exhibited relatively weaker performance. As shown in Figure 2, these datasets have notably low single-token coverage, with less than half of the unique words preserved as individual tokens. This high degree of subword fragmentation likely hindered the model’s ability to capture word-level semantics. In contrast, BoolQ did not suffer a performance drop. Our analysis suggests that, despite the impact of token fragmentation in MNLI and QQP, the model’s early exposure to simpler sentences—analogous to human child language acquisition—helped form a stronger linguistic foundation, leading to improved reasoning and inference in BoolQ.

6. Conclusions

This study demonstrated that a reading-level guided curriculum, based on the Flesch Reading Ease (FRE) score, can significantly enhance the learning efficiency of small language models under data-constrained conditions. There are three key contributions through a curriculum learning approach based on the FRE score. First, by leveraging a single readability metric—the FRE score—we propose a curriculum that is substantially simpler and more interpretable than existing approaches. Second, by restricting pretraining data to 100 M tokens, we control for the factor of large-scale data and isolate pretraining methodology as the primary driver of improvement. Third, unlike prior curriculum learning studies that often yielded inconclusive results, this work provides clear and positive learning signals, demonstrating the effectiveness of readability-based curricula. Especially, our approach improved both grammatical generalization and factual reasoning in zero-shot and downstream settings, with gains of up to +19.83% on tasks like BLiMP and consistent improvements in GLUE benchmarks.

In addition to these contributions, applying the curriculum across different levels of granularity—ranging from sentences to grouped segments and paragraphs—and under both single-source and heterogeneous datasets revealed coherent patterns of improvement. These outcomes emphasize that the value of this work lies not only in the method itself but also in its systematic application under diverse training conditions. Thus, the findings highlight that a carefully designed, readability-based curriculum can translate methodological simplicity into tangible efficiency gains and stronger generalization, offering a practical and cognitively motivated pathway for advancing small language models.

At a more fine-grained level, the experimental results further suggest concrete guidance for practice. According to the experimental results, the effectiveness of our curriculum varied depending on the dataset composition. In the single setting—where training data came from a coherent source—paragraph unit curriculum consistently yielded robust gains.

In contrast, for merged datasets drawn from heterogeneous sources, the grouped-level curriculum was more effective. Maintaining internal coherence within each phase led to more stable improvements. Therefore, when applying our method, we found the following:

For a single setting, adopting the paragraph unit allows for fine-grained progression and leads to strong generalization.
For a merged setting, it is more desirable to construct each curriculum level with the group unit from contextually similar sources to preserve coherence within each level.

In addition, our tokenizer coverage analysis in Figure 2 revealed that tasks with low single-token preservation (e.g., MNLI and QQP) exhibited diminished performance, suggesting that excessive subword segmentation may hinder semantic understanding in small models. This finding emphasized the importance of aligning tokenization granularity with curriculum design.

While our method demonstrated strong zero-shot and fine-tuning performance, limitations remained. The current framework relies on surface-level readability metrics and does not yet incorporate deeper semantic or discourse-level complexity. In addition, it remains to be verified whether the proposed approach can scale effectively to larger datasets and more powerful model architectures. Beyond scalability, since the Flesch Reading Ease score is an English-specific readability measure, its cross-linguistic applicability remains to be verified. Investigating cognitively grounded readability metrics in other languages will be an important step toward extending the framework’s generality beyond English. Future work will also explore its applicability to a broader range of downstream tasks, including question answering and summarization, as well as its extension to multilingual, instruction-tuned, or few-shot settings. Nevertheless, the simplicity, interpretability, and architecture-agnostic nature of our approach position it as a promising framework not only for advancing cognitively plausible and data-efficient NLP systems but also as a practical option in resource-limited environments where small-scale models must be deployed.

Author Contributions

Conceptualization, S.K. and J.K.; methodology, S.K. and J.P.; software, S.K. and J.P.; validation, S.K.; formal analysis, S.K.; investigation, S.K. and J.P.; data curation, J.P.; writing—original draft preparation, S.K. and J.P.; writing—review and editing, J.K.; visualization, S.K. and J.P.; supervision, J.K.; project administration, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Hankuk University of Foreign Studies Research Fund (of 2025).

Data Availability Statement

Publicly available datasets were analyzed in this study. These datasets can be found at the following sources (accessed on 6 May 2025): https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus, https://www.gutenberg.org/ebooks/bookshelf/20, and https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FRE	Flesch Reading Ease
CL	Curriculum Learning
GLUE	General Language Understanding Evaluation
BLiMP	The Benchmark of Linguistic Minimal Pairs for English
EWOK	Elements of World Knowledge
MRPC	Microsoft Research Paraphrase Corpus
MNLI	Multi-Genre Natural Language Inference Corpus
MultiRC	Multi-sentence Reading Comprehension
QQP	Quora Question Pairs
RTE	Recognizing Textual Entailment
WSC	Winograd Schema Challenge

Appendix A. Hyperparameters

Table A1 presents the hyperparameters used for our training models.

Table A1. Hyperparameters used for GPT-2 and BERT models during training.

Model	GPT-2	BERT-Base
parameters	124 M	110 M
vocab size	50,258	30,522
hidden size	768	768
heads	12	12
layers	12	12
dropout	0.1	0.1
layer norm eps	1 × 10⁻⁵	1 × 10⁻¹²
initializer range	0.02	0.02
Optimizer
algorithm	AdamW	AdamW
learning rate	5 × 10⁻⁵	5 × 10⁻⁵
betas	(0.9, 0.999)	(0.9, 0.999)
weight decay	0.0	0.0
Scheduler
type	linear	linear
Training
gradient accumulation	4	4
epochs	10	10
batch size	32	32
line by line	true	true
NGPU	2	2

Appendix B. Evaluation Dataset Sizes

Table A2. Evaluation dataset sizes. Each value indicates the number of samples used for evaluation in each benchmark.

Dataset	Number of Samples
BLiMP	59,875
BLiMP-S	5218
EWoK	7618
Reading	1725
WUG	250
GLUE	92,959

Appendix C. List of the Titles of Novels Manually Curated from Gutenberg Children’s Literature

Table A3 and Table A4 show the lists of the novels used in our medium-FRE content, which were manually hand-picked from Gutenberg Children’s Literature

Table A3. List of novels collected from Gutenberg Children’s Literature (Part 1).

Creatures That Once Were Men	The Odyssey
gutenberg filtered 1927 2178	A Christmas Carol in Prose; Being a Ghost Story of Christmas
Uncle Sam	Story Hour Readings: Seventh Year
Boycotted, and Other Stories	The King in Yellow
Anne of Green Gables	The Head of Kay’s
The Wonderful Wizard of Oz & The Rover Boys on the Farm; or, Last Days at Putnam Hall
Moby Multiple Language Lists of Common Words & My Life — Volume 1
An Old Chester Secret	Literature for Children
The Adventures of Sherlock Holmes	A Princess of Mars
Ernest Bracebridge: School Days & Goose-Quill Papers
Anna Karenina	Beyond Good and Evil
Leviathan & The New Girl at St. Chad’s: A Story of School Life
Gulliver’s Travels into Several Remote Nations of the World	On Liberty
What Katy Did at School	Pride and Prejudice
The Romance of Lust: A classic Victorian erotic novel	The Philippines a Century Hence
The Camp Fire Girls at School; Or, The Wohelo Weavers	The Land of Little Rain
Le Morte d’Arthur: Volume 2	The Adventures of Tom Sawyer, Complete
The Princess	The Further Adventures of Robinson Crusoe
Bert Wilson on the Gridiron	Jo’s Boys
Little Men: Life at Plumfield With Jo’s Boys	Time Enough at Last
A Day at Camp Killkare; Or, Aunt Jane and the Campfire Girls	Grimms’ Fairy Tales
Christmas Speakin’ at Skaggs’s Skule	Just Patty
Our Home and Personal Duty	Story Lessons on Character-Building (Morals) and Manners
The Adventures of a Three-Guinea Watch	The Moon and Sixpence
The Price	The Youngest Girl in the Fifth: A School Story
A brother to dragons, and other old-time tales	The Decameron of Giovanni Boccaccio
The World I Live In	Around the World in Eighty Days
The Girl from Montana	Audrey
Lessons on Manners for School and Home Use	Games Without Music for Children
Beowulf: An Anglo-Saxon Epic Poem	The Willoughby Captains
Adventures of Huckleberry Finn	A Tale of Two Cities
Etiquette Made Easy	The Cock-House at Fellsgarth
Etheldreda the Ready: A School Story	The Rover Boys at Colby Hall; or, The Struggles of the Young Cadets
Heart of Darkness	Meditations
The Importance of Being Earnest: A Trivial Comedy for Serious People	The Triple Alliance, Its Trials and Triumphs
A Prefect’s Uncle	A Popular Schoolgirl
The Flower-Patch Among the Hills	Fred Fenton on the Track; Or, The Athletes of Riverport School
Le Morte d’Arthur: Volume 1	The Count of Monte Cristo
Ulysses	Second Treatise of Government
Tom and Some Other Girls: A Public School Story	Mike and Psmith
Jack of Both Sides: The Story of a School War	The Gold Bag
The Rebel of the School	Betty Gordon at Boarding School; Or, The Treasure of Indian Chasm
Practical Etiquette	The Luckiest Girl in the School
Don Quixote	Acton’s Feud: A Public School Story
For the Sake of the School	The Cricket on the Hearth
Daddy-Long-Legs	Amaryllis at the Fair
The Pothunters	The Mary Frances first aid book

Table A4. List of novels collected from Gutenberg Children’s Literature (Part 2).

Great Expectations	The Boys of Bellwood School; Or, Frank Jordan’s Triumph
Eric, or Little by Little	Emma
Ontario Teachers’ Manuals: Literature	Marjorie Dean, High School Freshman
Dreams and Dust	Interrupted
Walden, and On The Duty Of Civil Disobedience	gutenberg filtered 1343 1497
The Confessions of St. Augustine	Sara Crewe; Or, What Happened at Miss Minchin’s Boarding School
The Politeness of Princes, and Other School Stories	The Rover Boys Under Canvas; Or, The Mystery of the Wrecked Submarine
When Patty Went to College	A Study in Scarlet
The divine comedy	The Yellow Wallpaper
War and Peace	Louis’ School Days: A Story for Boys
Eric; Or, Little by Little	Frankenstein; Or, The Modern Prometheus
Life of St. Francis of Assisi	The Expedition of Humphry Clinker
Wuthering Heights	The Republic
Paradise Lost	The Story of Chautauqua
Fred Fenton on the Crew; Or, The Young Oarsmen of Riverport School	Round the World in Eighty Days
Scouting for Girls Adapted from Girl Guiding	The Reign of Greed
Jane Eyre: An Autobiography	The Brothers Karamazov
Glyn Severn’s Schooldays	Camp and Trail
Oliver Twist	The Scarlet Letter
Grace Harlowe’s Senior Year at High School	Educational Work of the Boy Scouts
Wilton School; or, Harry Campbell’s Revenge	Gulliver’s Travels into Several Remote Regions of the World
Aunt Crete’s Emancipation	Monitress Merle
Aunt Jane’s Nieces Abroad	The Mystery of Mary
The House Behind the Cedars	The Iliad
Mike	The Princess of the School
Caleb Wright: A Story of the West	Tom Brown’s School Days
gutenberg filtered 1618 1878	Parkhurst Boys, and Other Stories of School Life
Little Women	Plays
Thus Spake Zarathustra: A Book for All and None	Little Men: Life at Plumfield with Jo’s Boys
The Hound of the Baskervilles	The Grammar School Boys of Gridley; or, Dick & Co. Start Things Moving
Ontario Teachers’ Manuals: Household Management	Follow My Leader: The Boys of Templeton
Teaching the Child Patriotism	The Rover Boys at School; Or, The Cadets of Putnam Hall
Aunt Jane’s Nieces in the Red Cross	Twenty years after
The White Feather	Education in the Home, the Kindergarten, and the Primary School
Pioneer Life in Illinois	Scouting For Girls, Official Handbook of the Girl Scouts
My Life — Volume 2	The Witness
The Prince	The Rover Boys in Camp; or, The Rivals of Pine Island
The Adventures of Pinocchio	The Mystery at Putnam Hall: The School Chums’ Strange Discovery
The Gold Bat	The Chautauqua Girls At Home
Treasure Island	Camps and Trails
Dubliners	The Princess Aline
Charlotte Temple	My Friend Smith: A Story of School and City Life
Our town and civic duty	The Wheat Princess
The Souls of Black Folk	Spoon River Anthology
The Minister’s Wooing	Tales of St. Austin’s
The Secret Garden	Careers of Danger and Daring
Les Misérables	The Emerald City of Oz

References

Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In ICML’09: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, ON, Canada, 14–18 June 2009; ACM: New York, NY, USA, 2009; pp. 41–48. [Google Scholar]
Soviany, P.; Ionescu, R.T.; Rota, P.; Sebe, N. Curriculum Learning: A Survey. arXiv 2022, arXiv:2101.10382. [Google Scholar] [CrossRef]
Flesch, R. A new readability yardstick. J. Appl. Psychol. 1948, 32, 221–233. [Google Scholar] [CrossRef] [PubMed]
Fenson, L.; Dale, P.S.; Reznick, J.S.; Bates, E.; Thal, D.J.; Pethick, S.J.; Tomasello, M.; Mervis, C.B.; Stiles, J. Variability in early communicative development. Monogr. Soc. Res. Child Dev. 1994, 59, 1–185. [Google Scholar] [CrossRef] [PubMed]
Brown, R. Development of the first language in the human species. Am. Psychol. 1973, 28, 97. [Google Scholar] [CrossRef]
Crain, W. Theories of Development: Concepts and Applications; Routledge: New York, NY, USA, 2015. [Google Scholar]
BabyLM Team. BabyLM Turns 3: Call for Papers for the 2025 BabyLM Workshop. 2025. Available online: https://babylm.github.io/ (accessed on 31 August 2025).
Elgaar, M.; Amiri, H. Ling-CL: Understanding NLP Models through Linguistic Curricula. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 13526–13542. [Google Scholar]
Oba, M.; Haga, A.; Fukatsu, A.; Oseki, Y. BabyLM challenge: Curriculum learning based on sentence complexity approximating language acquisition. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Singapore, 6–7 December 2023; Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., et al., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 290–297. [Google Scholar] [CrossRef]
Platanios, E.A.; Stretcu, O.; Neubig, G.; Poczos, B.; Mitchell, T. Competence-based Curriculum Learning for Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 1162–1172. [Google Scholar] [CrossRef]
Johnson, R.E. How Readable Are Our Elementary Social Studies Textbooks? In Proceedings of the International Reading Association Conference, Anaheim, CA, USA, 6–9 May 1970; ERIC Document ED043459; Institute of Education Sciences, U.S. Department of Education: Washington, DC, USA. Available online: https://eric.ed.gov/?id=ED043459 (accessed on 6 May 2025).
Leonard, M.A. Parent–Child Storytelling During Joint Picture-Book Reading and Relation to Language Scores of Children with ADHD. Master’s Thesis, University of Kentucky, Lexington, KY, USA, 2005. [Google Scholar]
Zainurrahman, Z.; Yusuf, F.N.; Sukyadi, D. Text readability: Its impact on reading comprehension and reading time. J. Educ. Learn. (EduLearn) 2024, 18, 1422–1432. [Google Scholar] [CrossRef]
Zhang, Y.; Mohamed, A.; Abdine, H.; Shang, G.; Vazirgiannis, M. Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning. arXiv 2025, arXiv:2506.11300. [Google Scholar] [CrossRef]
Choshen, L.; Cotterell, R.; Hu, M.Y.; Linzen, T.; Mueller, A.; Ross, C.; Warstadt, A.; Wilcox, E.; Williams, A.; Zhuang, C. The 2nd BabyLM challenge: Sample-efficient pretraining on a developmentally plausible corpus. arXiv 2024, arXiv:2404.06214. [Google Scholar]
Ben Allal, L.; Lozhkov, A.; Penedo, G.; Wolf, T.; von Werra, L. SmolLM-Corpus. 2024. Available online: https://huggingface.co/blog/smollm (accessed on 6 May 2025).
MacWhinney, B. The CHILDES Project: Tools for Analyzing Talk, 3rd ed.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2000. [Google Scholar]
edenbd. Children Stories Text Corpus; Kaggle Dataset: San Francisco, CA, USA, 2019. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; Linzen, T., Chrupała, G., Alishahi, A., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 353–355. [Google Scholar] [CrossRef]
Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv 2020, arXiv:1905.00537. [Google Scholar]
Warstadt, A.; Parrish, A.; Liu, H.; Mohananey, A.; Peng, W.; Wang, S.F.; Bowman, S.R. BLiMP: The benchmark of linguistic minimal pairs for English. Trans. Assoc. Comput. Linguist. 2020, 8, 377–392. [Google Scholar] [CrossRef]
Ivanova, A.A.; Sathe, A.; Lipkin, B.; Kumar, U.; Radkani, S.; Clark, T.H.; Kauf, C.; Hu, J.; Pramod, R.T.; Grand, G.; et al. Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models. arXiv 2024, arXiv:2405.09605. [Google Scholar] [CrossRef]
de Varda, A.G.; Marelli, M.; Amenta, S. Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data. Behav. Res. Methods 2024, 56, 5190–5213. [Google Scholar] [CrossRef] [PubMed]
Charpentier, L.; Choshen, L.; Cotterell, R.; Gul, M.O.; Hu, M.; Jumelet, J.; Linzen, T.; Liu, J.; Mueller, A.; Ross, C.; et al. BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop. arXiv 2025, arXiv:2502.10645. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 10 May 2025).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]

Figure 1. FRE-based curriculum learning data organization by unit (sentence, group, paragraph) and difficulty level.

Figure 2. Single-token coverage (%) for BERT tokenizers trained on our merged setting, across GLUE benchmark. Coverage is defined as the proportion of unique words represented as a single token. QQP, MNLI, and BoolQ show the lowest coverage among all tasks, comprising the bottom-three datasets in terms of word-level token preservation.

Table 1. Length predictors and criterion [3]. Means and SDs summarize the calibration corpus (McCall–Crabbs passages) used to estimate Equation (1); they are not used to compute

RE

for a new text.

Table 1. Length predictors and criterion [3]. Means and SDs summarize the calibration corpus (McCall–Crabbs passages) used to estimate Equation (1); they are not used to compute

RE

for a new text.

	wl	si	Mean	SD
Correlations
wl (syllables per 100 words)	1.000	0.4644	146.651	25.373
si (sentence length in words)	0.4644	1.000	24.032	9.113
Criterion $C_{75}$	0.6307	0.6212	7.3484	2.1345

Table 2. Curriculum level summary with FRE ranges and mean scores. In all levels, models were trained on datasets consisting of approximately 33 M tokens each.

Dataset	Level	FRE Range	Mean
Single	1 (Easy)	119.19 ≥ FRE > 45.09	67.76
	2 (Medium)	45.09 ≥ FRE > 15.31	31.30
	3 (Hard)	15.31 ≥ FRE	−8.64
Merged	1 (Easy)	119.19 ≥ FRE > 73.85	98.48
	2 (Medium)	73.85 ≥ FRE > 40.69	58.43
	3 (Hard)	40.69 ≥ FRE	14.43

Table 3. Interpretation of Flesch Reading Ease (FRE) scores.

FRE Score	Readability Level	Estimated Grade Level
0–30	Very Difficult	College graduate
30–40	Difficult	13th to 16th grade
50–60	Fairly Difficult	10th to 12th grade
60–70	Standard	8th or 9th grader
70–80	Fairly Easy	7th grader
80–90	Easy	6th grader
90–100	Very Easy	5th grader

Table 4. Performance of the single setting on the three zero-shot evaluation tasks: BLiMP, Supplementary BLiMP, and Ewok. Bold values indicate the highest performance.

	Settings	BLiMP	BLiMP-S	Ewok
GPT2	Baseline	60.51	48.32	49.97
	Group	$64.06 \pm 4.25$	$47.59 \pm 2.48$	$49.80 \pm 0.22$
		(+4.15)	(−0.73)	(−0.17)
	Sentence	$64.67 \pm 0.85$	$48.90 \pm 1.17$	$50.22 \pm 0.12$
		(+4.16)	(+0.58)	(+0.25)
	Paragraph	72.82 $\pm 0.21$	59.50 $\pm 1.25$	51.51 $\pm 0.62$
		(+12.31)	(+11.18)	(+1.54)
BERT	Baseline	53.23	44.93	49.74
	Group	53.72 $\pm 0.62$	43.11 $\pm 1.87$	49.83 $\pm 0.26$
		(+0.49)	(−1.82)	(+0.09)
	Sentence	56.24 $\pm 0.1$	46.19 $\pm 2.76$	49.97 $\pm 0.25$
		(+3.01)	(+1.26)	(+0.23)
	Paragraph	73.06 $\pm 1.37$	57.73 $\pm 0.30$	51.11 $\pm 0.17$
		(+19.83)	(+12.80)	(+1.37)

Table 5. Performance of the single setting on the GLUE tasks. Bold values indicate the highest performance.

Models	Settings	BoolQ	MNLI	MRPC	MultiRC	QQP	RTE	WSC	AVG
BERT	Baseline	65.19	41.95	69.60	62.62	70.87	56.11	63.46	61.44
	Group	66.11 $\pm 0.59$	43.98 $\pm 1.10$	70.58 $\pm 1.74$	60.14 $\pm 3.08$	70.47 $\pm 0.58$	58.27 $\pm 2.27$	63.46 $\pm 0.96$	61.85
		(+0.92)	(+2.03)	(+0.98)	(−2.48)	(−0.4)	(+2.16)	(+0)	(+0.41)
	Sentence	68.07 $\pm 0.52$	43.25 $\pm 0.30$	73.03 $\pm 0.69$	60.02 $\pm 0.21$	69.72 $\pm 0.59$	58.27 $\pm 4.07$	65.38 $\pm 1.11$	62.53
		(+2.88)	(+1.3)	(+3.43)	(−2.6)	(−1.15)	(+2.16)	(+1.92)	(+1.09)
	Paragraph	69.63 $\pm 0.57$	58.82 $\pm 0.83$	77.94 $\pm 0.69$	65.01 $\pm 1.05$	77.20 $\pm 0.46$	61.23 $\pm 4.07$	62.49 $\pm 1.36$	67.47
		(+4.44)	(+16.87)	(+8.34)	(+2.39)	(+6.33)	(+5.12)	(−0.97)	(+6.03)

Table 6. Performance on three zero-shot tasks in the single setting with paragraph unit training. Bold values indicate the highest performance.

	Levels	BLiMP	BLiMP-S	EWoK
GPT2	Baseline	60.51	48.32	49.97
	1 (Easy)	69.37 $\pm 1.12$	60.28 $\pm 0.51$	50.28 $\pm 0.31$
		(+8.86)	(+11.96)	(+0.31)
	2 (Medium)	72.62 $\pm 0.28$	58.21 $\pm 0.49$	52.12 $\pm 0.34$
		(+12.11)	(+9.89)	(+2.15)
	3 (Hard)	72.82 $\pm 0.21$	59.50 $\pm 1.25$	51.51 $\pm 0.62$
		(+12.31)	(+11.18)	(+1.54)
BERT	Baseline	53.23	44.93	49.74
	1 (Easy)	61.41 $\pm 0.22$	56.11 $\pm 1.24$	49.98 $\pm 0.23$
		(+8.18)	(+11.18)	(+0.24)
	2 (Medium)	71.05 $\pm 1.43$	57.74 $\pm 0.38$	50.75 $\pm 0.85$
		(+17.82)	(+12.81)	(+1.01)
	3 (Hard)	73.06 $\pm 1.37$	57.73 $\pm 0.30$	51.11 $\pm 0.18$
		(+19.83)	(+12.80)	(+1.37)

Table 7. Performance on the reading task in the single setting with paragraph unit training. Bold values indicate the highest performance.

		Reading
	Levels	Eye-Tracking	Self-Paced Reading
BERT	Baseline	2.24	0.63
	1 (Easy)	4.91 $\pm 0.62$	2.60 $\pm 0.11$
		(+2.67)	(+1.97)
	2 (Medium)	5.52 $\pm 0.27$	2.72 $\pm 0.19$
		(+3.28)	(+2.09)
	3 (Hard)	6.36 $\pm 0.55$	3.00 $\pm 0.13$
		(+4.12)	(+2.37)

Table 8. Performance on the GLUE task in the single setting with paragraph unit training. Bold values indicate the highest performance.

Models	Settings	BoolQ	MNLI	MRPC	MultiRC	QQP	RTE	WSC	AVG
BERT	Baseline	65.19	41.95	69.60	62.62	70.87	56.11	63.46	61.44
	1 (Easy)	67.85 $\pm 0.04$	45.53 $\pm 0.08$	71.80 $\pm 1.04$	64.26 $\pm 0.23$	71.26 $\pm 0.35$	58.63 $\pm 0.51$	68.26 $\pm 1.36$	63.94
		(+2.66)	(+3.58)	(+2.20)	(+1.64)	(+0.39)	(+2.52)	(+4.80)	(+2.50)
	2 (Medium)	67.95 $\pm 0.78$	55.95 $\pm 0.47$	74.99 $\pm 0.70$	65.30 $\pm 0.35$	75.48 $\pm 0.10$	56.11 $\pm 0.00$	66.34 $\pm 1.36$	66.01
		(+2.76)	(+14.00)	(+5.39)	(+2.68)	(+4.61)	(+0.00)	(+2.88)	(+4.57)
	3 (Hard)	69.63 $\pm 0.57$	58.82 $\pm 0.84$	77.94 $\pm 0.69$	65.01 $\pm 1.05$	77.20 $\pm 0.46$	61.23 $\pm 4.07$	62.49 $\pm 1.36$	67.47
		(+4.44)	(+16.87)	(+8.34)	(+2.39)	(+6.33)	(+5.12)	(−0.97)	(+6.03)

Table 9. Performance on the three zero-shot tasks (BLiMP, BLiMP-S, EWoK) in the merged setting. Bold values indicate the highest performance.

Models	Settings	BLiMP	BLiMP-S	EWoK
GPT2	Baseline	70.62	50.29	49.90
	Group	70.68 $\pm 1.09$	53.13 $\pm 1.29$	50.34 $\pm 1.95$
		(+0.06)	(+2.84)	(+0.44)
	Sentence	69.40 $\pm 0.28$	52.72 $\pm 1.35$	50.22 $\pm 0.24$
		(−1.22)	(+2.43)	(+0.32)
BERT	Baseline	51.75	51.35	64.76
	Group	52.99 $\pm 0.37$	52.10 $\pm 1.44$	71.10 $\pm 0.92$
		(+1.24)	(+0.75)	(+6.34)
	Sentence	53.57 $\pm 0.56$	47.17 $\pm 1.33$	65.31 $\pm 0.82$
		(+1.82)	(−4.18)	(+0.55)

Table 10. Performance of the merged setting on the three zero-shot evaluation tasks: BLiMP, Supplementary BLiMP, and Ewok. Bold values indicate the highest performance.

Models	Settings	BoolQ	MNLI	MRPC	MultiRC	QQP	RTE	WSC	AVG
BERT	Baseline	67.27	45.43	70.09	60.76	71.18	54.67	63.46	61.83
	Group	67.40 $\pm 1.39$	44.33 $\pm 1.72$	70.09 $\pm 1.03$	61.46 $\pm 0.23$	70.33 $\pm 0.74$	57.53 $\pm 1.03$	63.46 $\pm 1.37$	62.08
		(+0.13)	(−1.1)	(+0)	(+0.7)	(−0.85)	(+2.86)	(+0)	(+0.26)
	Sentence	65.01 $\pm 0.00$	45.15 $\pm 1.51$	70.58 $\pm 2.35$	61.59 $\pm 0.32$	71.14 $\pm 0.04$	58.27 $\pm 0.51$	67.30 $\pm 1.06$	62.72
		(−2.26)	(−0.28)	(+0.49)	(+0.83)	(−0.04)	(+3.6)	(+3.84)	(+0.89)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, S.; Park, J.; Kim, J. A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining. Mathematics 2025, 13, 3300. https://doi.org/10.3390/math13203300

AMA Style

Kim S, Park J, Kim J. A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining. Mathematics. 2025; 13(20):3300. https://doi.org/10.3390/math13203300

Chicago/Turabian Style

Kim, Suyun, Jungwon Park, and Juae Kim. 2025. "A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining" Mathematics 13, no. 20: 3300. https://doi.org/10.3390/math13203300

APA Style

Kim, S., Park, J., & Kim, J. (2025). A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining. Mathematics, 13(20), 3300. https://doi.org/10.3390/math13203300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining

Abstract

1. Introduction

2. Related Works

2.1. Importance of Curriculum for Language Models

2.1.1. Curriculum Learning

2.1.2. Linguistic Indicators in Curriculum Learning

2.2. Flesch Reading Ease Score

2.2.1. Interpretability Through Integration

2.2.2. Practicality for Controlling Text Difficulty

3. Proposed Method

4. Experimental Settings

4.1. Comparison of Settings

4.2. Datasets

4.2.1. Pretraining Datasets

4.2.2. Evaluation Datasets

4.3. Backbone Models

5. Results

5.1. The Result of the Single Setting

5.1.1. Evaluation on Zero-Shot Tasks

5.1.2. Evaluation on GLUE Tasks

5.2. Detailed Analysis: The Effect of Reading-Level Guided Curriculum

5.2.1. Evaluation on Zero-Shot Tasks

5.2.2. Evaluation on Reading Task

5.2.3. Evaluation on GLUE Task

5.3. The Result of Merged Setting

5.3.1. Evaluation on Zero-Shot Tasks

5.3.2. Evaluation on GLUE Task

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Hyperparameters

Appendix B. Evaluation Dataset Sizes

Appendix C. List of the Titles of Novels Manually Curated from Gutenberg Children’s Literature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI