Identification of Writing Strategies in Educational Assessments with an Unsupervised Learning Measurement Framework

Tang, Cheng; Xiong, Jiawei; Engelhard, George

doi:10.3390/educsci15070912

Open AccessArticle

Identification of Writing Strategies in Educational Assessments with an Unsupervised Learning Measurement Framework

by

Cheng Tang

,

Jiawei Xiong

^*

and

George Engelhard

Department of Educational Psychology, Mary Frances Early College of Education, University of Georgia, Athens, GA 30605, USA

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2025, 15(7), 912; https://doi.org/10.3390/educsci15070912

Submission received: 12 May 2025 / Revised: 30 June 2025 / Accepted: 15 July 2025 / Published: 17 July 2025

(This article belongs to the Section Education and Psychology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study proposes a framework that leverages natural language processing and unsupervised machine learning techniques to measure, identify, and classify examinees’ writing strategies. The framework integrates three categories of writing strategies (text complexity, evidence use, and argument structure) to identify the characteristics of examinees’ writing. Additionally, a measurement model is used to calibrate examinees’ writing proficiency. An empirical example is presented to demonstrate the performance of the framework. The data comprise 430 Grade 8 examinees’ responses to English Language Arts (ELA) assessments in the United States. Using K-means clustering, distinct patterns were identified in each category. The one-parameter logistic measurement model was applied to estimate examinees’ writing proficiency. Analyses revealed significant effects of text complexity and evidence use on writing proficiency, while argument structure was not significant. This study has implications for writing instruction and assessment design that highlight the point that effective writing is not simply a matter of isolated skill acquisition, but rather the coordinated implementation of complementary strategies, a finding that supports cognitive developmental theories of writing.

Keywords:

measurement; educational assessments; unsupervised learning; writing strategies

1. Introduction

Educational assessments typically include constructed response items, which require students to generate written answers (Seifert & Sutton, 2009). These open-ended items assess students’ ability to articulate arguments, demonstrate critical thinking, and apply logical reasoning, while also evaluating their writing proficiency (Xiong et al., 2024). Measurement models have commonly emphasized analyses through holistic or rubric-based scoring, where the rater makes an overall judgment about the quality of performance (Jonsson & Svingby, 2007). Therefore, these methods focus predominantly on the final writing proficiency calibration and may ignore underlying cognitive and strategic processes examinees employ during the assessment.

Research has indicated that investigating the responding process can provide a comprehensive evaluation of examinees’ proficiency and reveal a more nuanced relationship between cognitive resources and performance outcomes (M. Kim et al., 2021). Particularly, there has been a growing interest in studying how different writing strategies may influence examinees’ writing proficiency (Chuang & Yan, 2022). Research has suggested that students’ strategy selection plays a critical role in demonstrating their writing performance (McCutchen, 2000). For example, skilled writers use strategies that integrate extensive knowledge, empirical evidence, and topic expertise compared with novice writers who demonstrate less coherent writing. Furthermore, studies have indicated that strategic approaches to writing, such as meaning-based complexity and form-based complexity, may significantly impact both the quality of written responses and overall test performance (Yasuda, 2024). For example, research has shown that examinees who employ systematic planning strategies tend to produce more coherent and well-structured responses (Kormos, 2011). Moreover, strategies related to language use, content, and organization are core aspects of text quality (Pun & Li, 2024; Toth, 2025). These findings highlight the importance and necessity of evaluating writing processes, rather than just final outcomes, in writing assessment measurement.

There are, however, at least two measurement challenges when evaluating examinees’ writing strategies. First, there could be various writing strategies across the examinees’ writing compositions. In other words, examinees might use more than one writing strategy when generating responses (McCutchen, 2000; Sagredo-Ortiz & Kloss, 2025). Second, the relationship between strategy use and performance is complex and difficult to measure. Common strategy evaluation approaches mainly rely on various think-aloud methods, such as protocol analysis (Ericsson, 2017). The focus is on collecting verbal reports from the examinees to infer real-time cognitive processes, and the aim is to model the sequence and structure of thoughts during writing. These approaches are usually limited by various factors. First, scalability constraints could impede the application due to its reliance on meticulous data collection, systematic encoding, and model building (Hayes & Flower, 1986). This limitation greatly constrains the efficiency of analyzing constructed responses when the number of examinees is relatively large. Moreover, potential biases might exist because of the influence of inaccurate interpretation and prompting bias, and this might distort the participants’ original thinking process (Charters, 2003). In addition, there might also be difficulties in capturing the full range of strategic behaviors because individual writers have different composing strategies, such as linear planners or recursive revisers (Kellogg, 2008).

Recent advances in natural language processing (NLP; Chowdhary, 2020) and artificial intelligence have created unprecedented opportunities to investigate student writing from various perspectives. For example, research has demonstrated that NLP techniques and tools can detect latent strategic behaviors by analyzing temporal patterns (Ben-Porat et al., 2020; Chen et al., 2018) and extracting patterns from linguistic features (Kyle & Crossley, 2016). However, they often focus on isolated aspects of writing rather than the combination of different strategies. This may oversimplify examinees’ dynamic utilization of strategies because it is common for them to use multiple strategies in the writing generation process. Additionally, in recent years, AI tools have been widely applied to the assessment of essays and constructed responses to provide fast and consistent evaluations of scores and feedback. However, they fall short of assessing higher-order writing skills such as organization, coherence, and logical flow (Alharbi, 2023; H. Kim et al., 2024). Thus, it is crucial and necessary to develop an approach to comprehensively and efficiently identify and measure the writing strategies.

Machine learning frameworks have shown potential in modeling the complex relationship between strategy use and writing quality (Talebinamvar & Zarrabi, 2022). Therefore, this study addresses these limitations above by introducing a novel and comprehensive computational framework that leverages NLP and unsupervised machine learning techniques to efficiently identify, measure, and classify writing strategies from constructed responses in educational assessments.

It is commonly recognized that there are some major categories of writing strategies. For example, linguistic development, such as syntactic complexity, lexical diversity, and word frequency, has been an important topic in writing (McNamara et al., 2010). Evidence use from source material is also a critical component in students’ writing to expand depth and add credibility (Cumming et al., 2016; Driscoll & Brizee, 2013). Moreover, the characteristics of argumentation in essay performances can provide additional measures for the quality of the writing. Chuang and Yan (2022) analyzed argument structure in 150 argumentative essays of different levels to see if they reflect proficiency differences. Considering previous research, we constructed three categories of writing strategy: text complexity, evidence use, and argument structure. In these categories, text complexity is used to capture linguistic patterns that reflect organizational approaches. Evidence use distinguishes between direct quotation, paraphrasing, and original content. The argument structure focuses on how students construct and connect claims throughout their responses.

By integrating the identification of these categories into a unified framework, the purpose of this study is to provide a more comprehensive understanding of writing strategy profiles and their relationships to performance outcomes as measured by rubric-based scoring assessment systems. To provide a more integrated view, the interpretation of this study is guided by the cognitive developmental theory of writing proposed by Bereiter and Scardamalia (2013), specifically two fundamental composing models: knowledge-telling and knowledge-transforming. The knowledge-telling model, often used by novice writers, involves a straightforward transcription of retrieved ideas. In contrast, the knowledge-transforming model, employed by expert writers, reconceptualizes writing as a complex problem-solving process that reshapes the writer’s own understanding. This theoretical lens suggests that proficiency is not merely the sum of discrete strategies but an indicator of the underlying cognitive model a writer employs.

In the following sections, we detail the theoretical foundations of our approach, describe the computational methods employed, present findings from empirical applications, and discuss implications for educational measurement and instruction. The specific research questions addressed in this study are as follows:

Can the proposed framework effectively identify distinct writing strategy patterns in examinees’ constructed responses?
How do the identified writing strategies relate to the examinee’s writing performance?

2. Writing Strategies

This section examines the relevant literature for the three categories of writing strategies that form the foundation of the proposed unified framework: test complexity, evidence use, and argument type. By reviewing research on text complexity, evidence use, and argument structure, the goal is to establish the theoretical underpinnings necessary for an integrated analysis.

2.1. Text Complexity

Text complexity represents a construct that requires various measurement approaches to capture its nuanced aspects (Lahmann et al., 2019). It usually refers to characteristics in two groups: lexical complexity and syntactic complexity (C. Lu et al., 2019). These two groups provide complementary insights into writing proficiency and development (Yasuda, 2024).

Lexical complexity represents the degree of lexical richness, such as density, sophistication, and variation in word choice (X. Lu, 2012). Peng et al. (2023) proposed that lexical complexity should be understood as having two main components, lexical diversity and lexical sophistication. There are some popular metrics to measure them. For example, the Type–Token Ratio (TTR) is a simple measure of lexical diversity, which is defined as the ratio of unique word types to total tokens in a text, where a token refers to each individual word occurrence (Kettunen, 2014). In addition to TTR, average word length is considered the best lexical linguistic complexity as a quantitative measure because it is easy to compute and robustly correlates with academic style maturity, which serves as an excellent choice to measure lexical sophistication (Verspoor et al., 2017). Additionally, more sophisticated measures such as the measure of textual lexical diversity (Mazgutova & Kormos, 2015) and the vocd-D (Yoon & Polio, 2017) were proposed to measure lexical complexity.

Syntactic complexity has been broadly defined as the diversity and elaborateness of grammatical structures used in the writing process (Lyu et al., 2022; Zhang & Zhang, 2024). Research has suggested the use of detailed clausal complexity measures to complement comprehensive measures with analytical tools such as Coh-Metrix (Graesser et al., 2004), L2 syntactic complexity analyzer (X. Lu, 2010), and the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (Kyle, 2016).

2.2. Evidence Use

The analysis of evidence use identifies how examinees incorporate source materials into their writing. For example, when sharing specific ideas from a source, writers can choose the direct quote as verbatim reproductions of a speaker’s words or Paraphrases as restatements of a speaker’s ideas in the examinee’s own words (Weaver et al., 1974). The pattern of utilizing source materials was detected using approaches like automated lexical analysis and statistical modeling, but could face challenges such as multicollinearity (Kyle & Crossley, 2016).

Word embeddings provide computational methods for detecting patterns in evidence use. Significant progress has been made in the techniques of word embedding from the Word2Vec method (Mikolov et al., 2013) to the GloVe method (Pennington et al., 2014). In addition to these advances, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) have recently revolutionized NLP technologies because they provide both positional and contextualized embeddings, which makes them suitable for semantic similarity detection (Zhou et al., 2024).

Complementing these approaches, the Term Frequency–Inverse Document Frequency (TF-IDF) (Ramos, 2003) algorithm determines the relevance of a word in a document by comparing its frequency in a specific document to its inverse frequency across the entire corpus, which can reflect how important a word is to a document. It can be used to assess lexical similarity between texts. When combined with semantic similarity measures, this technique creates a robust approach for distinguishing between direct quotations, which typically exhibit high lexical and semantic similarity to source materials, and Paraphrases, which maintain semantic alignment while demonstrating greater lexical divergence.

2.3. Argument Structure

Argument structure encompasses the organizational patterns and logical frameworks that examinees use to present and support their main ideas. Research in this field has identified several argumentation architectures among students with theory-based frameworks. Mann and Thompson (1988) discussed elements relevant to argumentation, such as Contrast-based relationships, sequential organization, evidential support, justification, and concession elements. Building on this foundation, Crammond (1998) examined structural features through Toulmin’s framework (Toulmin, 2003). This research showed that argument chains with increasing depth represent a progression-based approach, while expert writers and older students used more counter-rebuttals. This work also proposed that some student texts contained nonfunctional segments that were either unrelated or not linked by semantic or syntactic relational links to the arguments. Nussbaum and Schraw (2007) focused on practical strategies for integrating arguments and counterarguments in students’ writing, categorizing them as refutation, synthesis, and weighing.

NLP technologies have recently been applied to this field as well. For instance, Lippi and Torroni (2015) discussed structured argumentation based on models such as the claim/premise model and the Toulmin model and used machine learning to predict argument relations with argument components. Lawrence and Reed (2015) further proposed argument structure identification through techniques such as discourse indicators, topic-based similarity, and argumentation schemes. Additionally, Lawrence and Reed (2020) analyzed the interdependencies between argument components, relations, and contextual knowledge to find the patterns of reasoning and dialogical relations. Hua and Wang (2022) proposed a framework to automatically extract argument structure using a Transformer model. This convergence of theoretical frameworks and computational methods has significantly enhanced our ability to automatically analyze argument structure in writing.

3. Measurement Framework

3.1. Conceptual Framework for Identifying Writing Strategies

A three-category conceptual framework is proposed to integrate the aforementioned categories, text complexity, evidence use, and argument structure, to generate comprehensive profiles of examinee writing. This approach allows for the identification and measurement of distinct writing strategy patterns while accommodating natural variations in examinees’ responses. Each component of the analytical framework is presented below in Figure 1.

This framework illustrates a comprehensive methodology for analyzing writing strategies and their relationship to proficiency. The process begins with the examinees’ constructed responses, which undergo a series of procedures such as feature extraction and preprocessing to derive three distinct categories of features: text complexity, evidence use, and argument structure. Each feature set is then subjected to the unsupervised machine learning algorithm, K-means clustering, resulting in distinct strategy clusters for each category. Concurrently, examinees’ scores are processed through the Rasch model based on item response theory to establish writing proficiency metrics. The strategy clusters and writing proficiency measures join to inform statistical analyses that investigate the relationship and patterns between writing strategies and writing proficiency.

3.1.1. Text Complexity Analysis

The text complexity analysis captures different aspects of writing sophistication through four feature groups: basic measures, lexical complexity, syntactic complexity, and readability metrics. In the basic measures, fundamental metrics were computed, including word count and average sentence length, which provide baseline indicators of text volume and complexity. For lexical complexity, we assessed the lexical diversity by calculating the TTR and the lexical sophistication by calculating the average word length. In syntactic complexity, clauses per sentence and the dependent clauses ratio were identified to estimate structural complexity. Additionally, the Flesch–Kincaid grade level formula was incorporated (Flesch, 1948; Kincaid et al., 1975), which combined sentence length and syllable counts to estimate the academic grade level required to comprehend the text. This provides a standardized measure of overall text complexity.

For implementation, a pipeline is developed based on Python 3.13 (Van Rossum & Drake, 1995) utilizing the Natural Language Toolkit library (NLTK) (Hardeniya et al., 2016) for text tokenization, the spaCy library (Vasiliev, 2020) for linguistic processing, including dependency parsing, and the textstat library (Mayahi & Alshatti, 2023) for readability calculations.

3.1.2. Evidence Use Analysis

The evidence use analysis of the framework aims to identify how examinees incorporate source materials in their constructed responses. Effective use of evidence is a critical aspect of academic writing. Students demonstrate varying levels of sophistication in how they incorporate source materials; some rely heavily on direct quotations, while others demonstrate higher-level skills by paraphrasing or synthesizing information. This methodology enables the automated detection of direct quotations, paraphrases, and original content in student writing through analysis.

The proposed method employs dual similarity measures to capture different aspects of textual relationships between student responses and source passages. Specifically, the calculation of similarity is the key component in this method because it enables the precise classification of evidence use based on the numerical values of dual metrics. For the similarity computation, two distinct similarities are computed between each student’s response and the source passages. The first one is the TF-IDF vectorization technique, which was used to represent texts as high-dimensional sparse vectors that capture the importance of words relative to the corpus. The TF-IDF vectorizer is fitted on the combined corpus of examinees’ responses and source passages. On the other hand, the pre-trained BERT embeddings are used to capture deep semantic relationships beyond lexical overlap. The pre-trained BERT model generates contextualized embeddings for both examinees’ responses and source passages. The cosine similarity was calculated for both similarities between the representations of each response–passage pair.

Throughout this analysis, the following specialized libraries within Python 3.13 were employed: NLTK for the tokenization, scikit-learn for the TF-IDF vectorization, and the Transformer library for the BERT model (Wolf et al., 2020).

3.1.3. Argument Structure Analysis

The argumentation structure analysis identifies distinctive patterns in how students construct their arguments in written responses. This methodology enables the detection of various argumentative strategies through computational analysis of semantic relationships between sentences. In this study, the Sentence Transformer model ("all-MiniLM-L6-v2") is used to generate dense vector representations of sentences that capture their semantic content (Reimers & Gurevych, 2019). With these embeddings, five features that characterize different aspects of argumentation structure were successfully extracted: linearity score, linearity variance, contrast score, first–last similarity, and claim centrality.

The linearity score measures sequential coherence between adjacent sentences by calculating the mean cosine similarity between consecutive sentence embeddings. Higher values indicate a more Linear Progression where each idea builds directly upon the previous one, creating a clear logical thread throughout the response. Building upon the concept of linearity, we also examine its consistency through linearity variance, which computes the variance of similarities between consecutive sentences. Higher linearity variance indicates a more variable connection strength between adjacent sentences.

While the previous metrics focus on sequential relationships, the contrast score captures the frequency and nature of thematic transitions within the response. This metric is derived by calculating each sentence’s similarity to topic-specific keyword centroids and tracking shifts between topics. Higher values suggest more comparative or contrastive reasoning strategies, indicative of different argumentative approaches compared with the previous ones.

The structural relationship between beginning and ending points is captured by first–last similarity, which evaluates the semantic relationship between introductory and concluding sentences. Higher values suggest circular argumentation structures where conclusions explicitly connect back to the initial claims. Similar to the first–last similarity claim, centrality assesses the degree to which the response maintains reference to its central proposition, operationalized as the first sentence, by measuring the mean cosine similarity between the initial sentence and all subsequent sentences. Higher values indicate arguments that maintain strong connections to the central claim throughout.

3.2. Clustering Approach

The proposed framework employs unsupervised K-means clustering (Sinaga & Yang, 2020) to identify meaningful patterns in student writing across all three categories. This data-driven approach allows us to discover natural groupings without requiring predefined categories or manual labeling.

For evidence use analysis, the K-means clustering algorithm is applied to identify three clusters directly using the two features, the BERT and TF-IDF similarity scores. The schema for the three-cluster classification depends on the value of two similarities. High BERT and high TF-IDF scores mean direct quotes. Then, high BERT and low TF-IDF scores represent paraphrases. Last but not least, low BERT scores point to original content regardless of the TF-IDF value. The cluster centers are sorted by their combined similarity scores. Specifically, the threshold for high BERT similarity is set at the midpoint between the highest and second-highest cluster centers on the BERT dimension. The same rule is applied to the TF-IDF dimension as well. This adaptive technique allows classification boundaries to automatically adjust to dataset-specific characteristics.

For text complexity and argument structure, unlike the evidence use analysis, the optimal number of clusters is not certain and needs to be determined at the beginning of the analysis. Explorations with multiple clustering solutions ranging from 2 to 10 clusters were conducted. For each potential solution, both silhouette scores (Shahapure & Nicholas, 2020) and distortion scores with the elbow method (Shi et al., 2021; Thorndike, 1953) were computed as the evaluation metrics. The silhouette score measures how similar an object is to its own cluster compared to other clusters, with higher values indicating better-defined clusters. The distortion score represents the sum of squared distances from each point to its assigned center, with the “elbow point” in this metric suggesting an optimal balance between model complexity and explanatory power.

This clustering methodology provides a robust foundation for identifying and characterizing different clusters in certain writing strategies across multiple strategy categories. By selecting optimal cluster solutions through systematic evaluation, the framework maintains interpretability while capturing meaningful variation in student writing practices.

3.3. One-Parameter Logistic Measurement Model

To investigate the relationship between identified writing strategies and student performance outcomes, the one-parameter logistic measurement model (1PL; Engelhard, 2013) was employed, which is widely used in educational assessment to estimate examinees’ proficiency given the item response theory. Equation (1) shows the 1PL model format for dichotomous item scores:

P (X_{n i} = 1) = \frac{e x p (θ_{n} - b_{i})}{1 + e x p (θ_{n} - b_{i})}

(1)

where

P (X_{n i} = 1)

is the probability that person correctly answers item

i

,

θ_{n}

represents the proficiency of person

n

, and

b_{i}

denotes the difficulty parameter of item

i

. This model places person and item parameters on the same logit scale, enabling invariant measurement. That is, the person’s ability estimates are independent of the specific set of items used, and item difficulty estimates are independent of the sample of persons when a good model fit is obtained.

In order to model polytomous responses where examinees can receive partial credit across ordered categories, the Partial Credit Model (PCM; Masters, 1982), a generalization of the 1PL model, was employed for items with more than two scoring levels. The PCM models the probability of a person responding in the category

k

of item

i

, conditional on their latent ability

θ_{n}

and the step (threshold) parameters

δ_{i k}

, as follows:

P (X_{n i} = k | θ_{n}) = \frac{\exp ((\sum_{m = 0}^{k} (θ_{n} - δ_{i m})))}{\sum_{j = 0}^{M_{i}} \exp (\exp (\sum_{m = 0}^{j} (θ_{n} - δ_{i m})))} for k = 0, 1, \dots, M_{i}

(2)

where

X_{n i}

is the response of person

n

to item

i

; taking a value from

0

to

M_{i}

, where

M_{i}

is the maximum score category of item

i

;

θ_{n}

is the latent trait (e.g., writing proficiency) of person

n

; and

δ_{i m}

is the step (threshold) parameter associated with the transition between categories

m - 1

and m for item

i

. These step parameters

δ_{i m}

reflect the relative difficulty of achieving each successive score level, allowing for varying category structures across items. This formulation assumes the response process proceeds through a series of ordered thresholds, and the probability of a particular score is determined by the cumulative differences between a person’s ability and these thresholds. As with the dichotomous 1PL model, the PCM maintains the principle of specific objectivity, permitting comparisons of person parameters independent of the items used, provided the model holds.

In this study, the 1PL model and PCM were implemented using the mirt package (Chalmers, 2012) in RStudio 4.3.3 (R Core Team, 2010). The ability estimates were calculated using the Expected A Posteriori (EAP; De Ayala, 1995) method based on the whole test, including multiple choices and constructed responses. In this way, we obtained a more theoretically sound measure of writing proficiency that could be meaningfully compared across different writing strategy clusters rather than relying solely on raw scores.

4. Methodology

An empirical study was conducted to demonstrate the proposed conceptual framework with real writing data.

4.1. Participants

This section demonstrates the data used in the empirical study, which consists of 430 Grade 8 students’ responses from English Language Arts (ELA) assessments administered within several school districts in Georgia, the United States. These ELA assessments are designed to evaluate writing proficiency in extended reasoning and argumentative writing, consisting of both multiple-choice and constructed-response questions. Examinees’ writing answers to the constructed-response questions were investigated to obtain their writing strategies, while their writing proficiencies were estimated using the scores from both the multiple-choice questions and constructed-response questions.

Upon data cleaning, to ensure the quality of the analysis, responses with less than 50 tokens were removed from the data (Zenker & Kyle, 2021), leaving 406 valid responses. The responses vary considerably in length, with word counts ranging from a minimum of 1 to a maximum of 1044 words. The average response contains 310.5 words (median = 301.5), with a standard deviation of 162.4 words.

4.2. Procedures

4.2.1. Feature Preprocessing

Each analysis in this framework requires specific preprocessing techniques to ensure optimal performance of subsequent unsupervised clustering algorithms. For text complexity analysis, several preprocessing steps were implemented to enhance clustering accuracy. For example, extreme values in features such as average sentence length, word count, and clauses per sentence were capped at reasonable upper bounds (50 words per sentence, 1000 words per response, and 5 clauses per sentence, respectively) to prevent outliers from distorting the clustering process. For features with wide value ranges and positive skewness, such as word count, sentence length, clauses per sentence, and Flesch–Kincaid grade level, a logarithmic transformation was implemented to normalize their distributions. Similarly, in the argument structure analysis, normalization was applied to normalize all the features to ensure each feature could contribute proportionally to the clustering process, preventing any single feature from dominating the process due to disproportionately high value ranges.

For evidence use analysis, the preprocessing focuses on text normalization and representation. A tokenization process was deployed to split texts into individual words, the case normalization was applied by converting all text to lowercase, and common English stopwords were removed using scikit-learn’s built-in stopwords list (Pedregosa et al., 2011). Additionally, special characters and extra whitespace were removed, which guarantees that the focus is on the analysis of meaningful content words rather than formatting or function words. When handling longer passages that exceeded BERT’s token limit, a sliding window approach was utilized with a window size of 510 tokens and a stride of 255 tokens to ensure the comprehensive coverage of source materials while maintaining computational efficiency (Beltagy et al., 2020).

4.2.2. Text Complexity Analysis

Our methodology began with determining the optimal number of clusters through silhouette and distortion score analysis. Figure 2 and Figure 3 present the two analyses across different cluster plans, respectively. The silhouette scores showed the highest value, 0.3884, for a two-cluster solution, indicated by the red dashed line, with scores decreasing as the number of clusters increased. Concurrently, the distortion score analysis suggested four clusters, indicated by the green dashed line, as an optimal solution based on the elbow method. To balance both metrics and cluster interpretability, a three-cluster solution was selected because it provides the most meaningful differentiation between text complexity strategies while maintaining statistically significant distinctions between groups.

Principal Component Analysis (PCA) reveals a clear separation between the three identified complexity levels, as shown in Figure 4. The first principal component explains 43.0% of the total variance in the text complexity features, while the second component explains an additional 27.1%. The clear separation between clusters in this two-dimensional projection demonstrates that the identified complexity levels represent meaningfully distinct writing patterns. The descriptive statistics for the text complexity clusters are presented in Table 1.

The boxplot of the seven key complexity features across the three clusters is shown in Figure 5. The Basic Composition cluster represents responses characterized by the lowest values for several metrics, including word count, sentence length, clauses per sentence, dependent clauses ratio, and Flesch–Kincaid grade. However, it shows the highest Type–Token Ratio (0.55), indicating greater vocabulary diversity despite using shorter sentences. This pattern suggests concise responses with varied vocabulary but simpler sentence structures.

The Intermediate Composition cluster shows the highest word count among all groups, indicating these are the longest responses. They have moderate sentence length and the lowest Type–Token Ratio, which suggests repetitive vocabulary despite the length. These responses have moderate syntactic complexity with medium values for clauses per sentence, the dependent clauses ratio, and the Flesch–Kincaid grade level. This pattern suggests verbose responses that use longer words but with limited vocabulary diversity.

The Elaborate Composition cluster exhibits the most sophisticated syntactic structure, with substantially higher sentence length, clauses per sentence, and especially a dependent clauses ratio that is significantly higher than both other clusters. While having a moderate word count and Type–Token Ratio, these responses show the highest log Flesch–Kincaid grade level. This pattern indicates responses with complex, sophisticated sentence construction that efficiently uses fewer but more structurally complex sentences.

4.2.3. Evidence Use Analysis

In this part, we examined how students incorporate source materials into their responses. The proposed framework distinguishes the responses between direct quotations, paraphrases, and original content. For each student response, similarity scores were calculated against the provided source passages based on BERT and TF-IDF. Thresholds for classification were determined through unsupervised K-means clustering to identify natural groupings in the similarity space.

Unlike the text complexity, which employed unsupervised clustering to discover the optimal number of clusters, the evidence use analysis utilized a rule-based system with clear theoretical underpinnings based on two similarity metrics. The high BERT similarity with a high TF-IDF similarity pattern indicates direct quotation. High BERT similarity with a low TF-IDF similarity indicates paraphrasing, and the low similarity in BERT represents original content regardless of the value in TF-IDF. This classification scheme is grounded in distinctions about how students engage with source materials rather than being derived from the data itself. Therefore, examining silhouette or distortion scores was unnecessary in this category.

As shown in Figure 6, the three categories of evidence use demonstrate a clear separation in the similarity space. The red cluster represents the direct quote, which has high BERT similarity (

\geq 0.93

) and high TF-IDF similarity

(\geq 0.40

). It shows a high similarity in both semantic meaning and lexical choice to the source material. The green cluster represents the paraphrase, which has high BERT similarity

(\geq 0.93

) but lower TF-IDF similarity

(< 0.40

). This indicates that these responses maintain high semantic similarity while demonstrating lexical variation, indicating students’ ability to restate source information in their own words. The blue cluster represents the original content, which has lower BERT similarity

(< 0.93

), regardless of the TF-IDF score. It shows a greater distance from source materials in semantic space, with varying degrees of lexical similarity.

The descriptive statistics for the evidence use clusters are presented in Table 1. The distribution of classification results indicates that half of the students choose to write original content (n = 234) instead of direct quotes (n = 116) or paraphrasing (n = 56) when expressing ideas. This suggests a significant portion of students prefer to develop their own perspectives rather than rely heavily on source materials.

4.2.4. Argument Structure Analysis

For the argument structure analysis, we implemented an unsupervised approach that identifies distinct argumentation patterns based on the structural relationships between sentences within a response. Following a similar methodology to the text complexity analysis, we first determined the optimal number of argument structure clusters by examining silhouette and distortion scores. As shown in Figure 7, the distortion score plot with the elbow method indicates a notable bend at 6 clusters, where the improvement in distortion begins to level off (845), as indicated by the green dashed line. However, when examining the silhouette scores in Figure 8, we observe the highest value at 2 clusters (0.2501), as indicated by the red dashed line, followed by a local peak at 4 clusters (0.2335).

After checking all the results from 2 to 10 clusters, a 3-cluster solution was selected based on both statistical and interpretive considerations. While statistical measures suggested various potential clustering solutions, the 3-cluster approach provided the most interpretable argument structures. This decision was also supported by the clear separation of clusters in the PCA visualization that is shown in Figure 9, where we can observe three distinct groupings with minimal overlap.

The distribution of the cluster features is shown in Figure 10. The distribution of responses across the three identified argument structures was relatively balanced. There are 153 responses in the Linear Progression cluster, 129 responses in the Contrast-based cluster, and 124 responses in the Discrete Arguments cluster. In the PCA results, the first principal component explains 42.7% of the total variance in the text complexity features, while the second component explains an additional 24.4%.

The Linear Progression cluster represents responses characterized by high linearity, moderate first–last similarity, and high claim centrality, paired with low linearity variance and contrast score. These responses demonstrate a coherent, step-by-step development of ideas where each sentence builds logically upon previous content. The strong first–last similarity indicates that these arguments often return to their initial claims, creating a well-rounded structure.

The Contrast-based cluster shows distinctive features of comparative argumentation with near-zero linearity, high linearity variance, high contrast score, and moderate first–last similarity and claim centrality. This pattern indicates arguments constructed through comparisons and contrasts, likely between the two passages discussed in the source materials. These responses frequently switch between topics, creating a balanced presentation of alternative viewpoints rather than a single Linear Progression of ideas.

The Discrete Arguments cluster exhibits negative values across multiple features, most notably extremely low claim centrality and first–last similarity. These responses do not focus on maintaining connections to these initial statements throughout the argument. The low linearity and contrast scores suggest these responses tend to present loosely connected ideas without strong structural organization through either linear development or systematic comparison. This result suggests that students exhibit different argumentative approaches when responding to complex prompts that involve multiple source materials.

4.2.5. Measurement of Student Writing Proficiency

The 1PL model was used to calibrate students’ writing proficiency using the mirt package in RStudio 4.3.3. The distribution shows a normally distributed pattern with a mean near 0.00, which is consistent with the centering convention in the 1PL modeling. The ability estimates ranged from approximately −3 to 2.5 logits, with the majority of students falling between −1 and 1.5. This distribution suggests that the assessment provided good measurement precision across a wide range of student abilities.

5. Results

After identifying the distinct writing strategies employed by students across text complexity, evidence use, and argument structure dimensions, we examined the relationship between these strategies and writing proficiency.

Table 2 displays the mean writing proficiency by writing strategy combinations. This table also includes information about the frequency of various strategy combinations in our sample. Out of 27 theoretically possible combinations of the three strategy dimensions, 20 occurred with more than five occurrences, indicating that students naturally employ diverse combinatorial approaches to writing. The most common strategy combination was “Direct Quote—Intermediate Composition—Linear Progression” (n = 43), followed by “Original Content—Intermediate Composition—Linear Progression” (n = 35). Other frequent combinations included various permutations of Intermediate Composition with different evidence and argument approaches. The mean proficiency of different combinations of strategies is discussed in the next subsection.

5.1. Writing Proficiency by Strategy Type

Figure 11, Figure 12 and Figure 13 illustrate the distribution of proficiency estimates across the different strategy types, respectively. For evidence use, students employing direct quote strategies demonstrated the highest median proficiency scores, followed closely by those using paraphrasing. Students relying primarily on original content without explicit source integration showed substantially lower performance. Our results strongly confirm the importance of source integration in academic writing, as emphasized in writing pedagogy. The significantly higher proficiency of students who used direct quotes or paraphrased, compared to those who relied on original content, aligns with research from Cumming et al. (2016), who finds students with lower English proficiency tend to focus on vocabulary and grammar when composing from sources, while those with higher proficiency concentrate more on cohesion, content, and rhetoric.

For text complexity, the Intermediate Composition strategy showed markedly higher performance compared to both Basic Composition and Elaborate Composition. This suggests that moderate complexity is associated with optimal performance in this assessment context. This poses an interesting phenomenon, as the prior research often correlated increased text complexity with higher writing proficiency (Kisselev et al., 2022). The superior performance of the Intermediate Composition group over the Elaborate Composition group indicates that, beyond a certain point, linguistic complexity may cease to be productive and could even hinder communication, a finding that aligns with Yasuda’s (2024) investigation into the different roles of form-based complexity.

Argument structure types showed the least variation in writing performance. All three argument approaches exhibited similar median proficiency estimates with substantial overlap in their distributions.

Table 2 above presents the comprehensive breakdown of all 25 strategy combinations observed across the three writing dimensions: text complexity, evidence use, and argument structure. The table is sorted by frequency to show which strategy combinations were most common. The highest-performing combination was Intermediate Composition with paraphrase evidence and Linear Progression arguments (mean proficiency = 0.892), though this combination was used by relatively few students. Conversely, the most common combination, used by 43 students, was Intermediate Composition with direct quote evidence and Linear Progression arguments, which also showed strong performance (mean proficiency = 0.685). The lowest-performing combination was Basic Composition with direct quote and Discrete Arguments (mean proficiency = −0.880).

In summary, the data suggest that text complexity and evidence use are more strongly associated with measured writing proficiency than the particular argumentative structure students employ. The relationship between text complexity and student proficiency may vary somewhat depending on which evidence use strategy is employed, although this effect is relatively weak. Argument structure alone may not predict performance; however, certain combinations of argument structures with specific complexity and evidence patterns are associated with higher or lower performance outcomes.

5.2. Heatmaps

The heatmaps in Figure 14, Figure 15 and Figure 16 provide a detailed visualization of mean proficiency scores across different strategy combinations. Figure 14 examines evidence use and text complexity, with the highest performance scores appearing in combinations involving Intermediate Composition with either paraphrasing or direct quoting. All Basic Composition combinations showed low proficiency scores, regardless of evidence use strategy. Notably, original content coupled with Intermediate Composition yielded positive but modest scores, suggesting that even without explicit source integration, intermediate complexity can support moderate performance. Elaborate Composition generally showed lower proficiency estimates across all evidence use approaches.

Figure 15 provides the results for evidence use and argument structure. The highest performance was observed in strategies combining paraphrasing with Linear Progression, followed closely by direct quoting with Linear Progression. All original content combinations yielded negative proficiency estimates regardless of argument structure, with the most negative scores in Linear Progression. This pattern suggests that while Linear Progression appears advantageous when combined with source material integration, it may be detrimental without appropriate evidence.

Finally, Figure 16 examines text complexity and argument structure. Intermediate Composition yielded positive proficiency estimates across all argument structures, with Linear Progression showing the highest performance. In contrast, both basic and Elaborate Composition showed negative proficiency estimates across all argument approaches. The lowest performance was observed in the combination of Elaborate Composition with Discrete Arguments, suggesting that sufficient linguistic resources without proper argument structure may be particularly problematic.

From these heatmaps, the interactive nature of writing strategies can be observed. While theories of writing expertise, such as the knowledge-transforming model (Bereiter & Scardamalia, 2013), posit that mature writing requires the orchestration of multiple cognitive processes, our findings provide quantitative evidence for this view.

5.3. Relationship Between Proficiency and Strategy Choice

Table 3 illustrates the predicted probability of strategy employment as a function of proficiency. This offers insights into the developmental trajectory of writing strategy acquisition.

Panel A in Table 3 provides information about text complexity. Lower proficiency students primarily use Basic Composition approaches, with minimal employment of intermediate complexity. As proficiency values increase, there is a strong shift toward Intermediate Composition strategies. Elaborate Composition shows consistently low probability across the proficiency scale, with a slight decline as proficiency increases. This may suggest that this approach reflects misguided attempts at sophistication.

Panel B in Table 3 shows the relationship of writing proficiency with evidence use—students with lower proficiency levels (θ < −0.5) predominantly rely on original content with minimal use of direct quotes or paraphrasing. As proficiency values increase, the probability of using original content decreases dramatically, while direct quotes become increasingly prevalent. At higher proficiency levels (θ > 1.0), direct quoting becomes the dominant strategy, with paraphrasing showing modest increases. This pattern suggests a progression from original content to direct quotation to paraphrase as proficiency increases.

Panel C in Table 3 shows argument structure with the relationship between proficiency and argument structure choice showing less distinctive patterns. Linear Progression probability increases steadily as proficiency value increases, while Discrete Arguments decrease correspondingly. Contrast-based approaches show only modest increases with proficiency value increases. These gentler slopes align with earlier findings that argument structure alone has weaker associations with performance compared to other facets.

6. Discussion

This study proposes a novel computational framework to investigate examinees’ writing strategies and the relationships of these strategies to their writing proficiency. An empirical study is presented to demonstrate the performance of this framework. The results revealed that examinees’ writing strategies can be effectively characterized into three categories: text complexity, evidence use, and argument structure, each with measurable patterns that provide insights into writing cognition and performance. Methodologically, the framework offers a method of educational assessment that moves beyond a single holistic score toward a diagnostic profile of a writer’s strategic approach across different writing strategies. The results also suggested that writing strategies may interact in complex ways to influence writing quality, which can serve as strong empirical support for developmental models of writing to reflect a writer’s progression from a fragmented “knowledge-telling” process to a more coherent “knowledge-transforming” one.

Text complexity was a significant predictor of writing performance, with Intermediate Composition strategies consistently associated with higher ability estimates compared to both basic and elaborate approaches. Examinees who use the Basic Composition strategies appear to be operating within a “knowledge-telling” framework; their linguistic resources are sufficient for transcribing ideas but may be inadequate for building the more complex arguments required for “knowledge transformation”.

By contrast, examinees using Intermediate and Elaborate Composition strategies are both likely attempting a more sophisticated, knowledge-transforming approach, but with differing results. Those in the Intermediate Composition group possess a functional command of language that is complex enough to effectively structure and express sophisticated ideas. For this group, linguistic skill serves as a tool for transforming knowledge. Conversely, writers using Elaborate Composition strategies may be producing overly complex syntax that hinders clarity, ultimately failing to transform their ideas into a coherent and persuasive argument, which leads to a drop in performance. This distinction explains why Intermediate Composition shows the best performance outcomes, as it represents the successful application of the knowledge-transforming model. It may also be the case that the rubrics as used and understood by raters impact the rater-mediated evaluation of writing quality by rewarding effective communication over complexity for its own sake.

Evidence use demonstrated equally strong associations with performance, with both direct quoting and paraphrasing linked to substantially higher proficiency estimates compared to the original content. Students relying on original content and performing poorly are likely engaging in “knowledge-telling”, where they are simply reporting their pre-existing ideas without integrating the new information available in the source materials. This approach is characteristic of a less developed writing process. In contrast, the effective use of evidence is a hallmark of the more sophisticated “knowledge-transforming” model. The process of selecting a quote or constructing a paraphrase forces a writer to move beyond simply reporting what they know. They must actively engage with external sources, evaluate their relevance, and restructure that information to build and support their arguments. This act of source integration is fundamental to the process of transforming knowledge. This highlights the importance of source integration in academic writing, and this may reflect the need for knowledge construction and success in educational contexts.

While argument structure alone did not significantly predict performance, the data suggest, as evidenced by the heatmap results, that argumentative approaches may function as contextual modulators rather than independent determinants of quality. For example, in Figure 15, Linear Progression generally yields high scores (0.60–0.65) when paired with direct quote or paraphrase evidence. However, Figure 16 shows that this effectiveness is primarily confined to Intermediate Composition (0.57). With Basic Composition and Linear Progression, the heatmaps show the lowest scores (−0.48). This suggests that Linear Progression arguments are only effective when the right text complexity is also in place. Another example is that evidence use effectiveness depends on both of the other variables, too. Figure 14 indicates that paraphrase evidence yields high scores (0.76) with Intermediate Composition. The data also suggests this effect varies by argument type (0.29 with Contrast-based vs. 0.65 with Linear Progression). This indicates that the effectiveness of paraphrasing evidence depends simultaneously on both text complexity and argument structure. Our findings suggest that the relationship between any two writing strategies may change depending on the third strategy in use because successful knowledge transformation requires the simultaneous coordination of linguistic, structural, and evidence-integration skills. For educational purposes, this finding suggests that writing strategies do not operate in isolation but rather form a coherent system where optimal outcomes depend on the specific configuration of all three dimensions working together.

There are several specific educational implications of this study. Generally, the significance of this research extends beyond measurement methodology. Understanding the relationship between writing strategies and writing proficiency can be beneficial to instructional interventions and assessment design, as teachers can emphasize certain combinations of strategies if they are consistently correlating with higher proficiency across assessment contexts. Conversely, if different strategies prove effective in other contexts, instructions might need to address this diversity. Moreover, the ability to automatically detect and classify writing strategies at scale offers opportunities for providing support to students during and after the writing process.

The substantial performance differences observed across strategy combinations offer important guidance for writing instruction in school. Rather than teaching individual writing skills in isolation, these findings suggest that curriculum designers should focus on combining strategy development across multiple dimensions. The relationship between strategy combination and writing proficiency can serve as a reference for instruction. For example, for lower-performing students who predominantly use original content with Basic Composition, explicit instruction in source integration techniques may be particularly beneficial. The significant performance advantages associated with direct quotations suggest this could serve as an accessible entry point before advancing to more sophisticated paraphrasing approaches.

The relatively poor performance associated with Elaborate Composition cautions against encouraging linguistic complexity beyond students’ functional capacity. Writing instruction that emphasizes “more is better” in terms of text complexity may inadvertently impede meaningful communication. Instead, instruction might focus on helping students achieve balanced intermediate complexity that supports rather than hinders meaning-making.

The interaction between argument structure and other dimensions suggests that argumentative frameworks may need to be tailored to students’ current capabilities. Linear Progression structures appear most effective for students who have already developed intermediate text complexity and source integration skills, while Contrast-based approaches may provide viable alternatives for students still developing these capabilities.

This study can also help with methodological advancements. The analysis focuses on the final written product. A significant direction for future research is to integrate our framework with process data, such as keystroke logging or eye-tracking. This would allow researchers to investigate whether the strategy clusters we identified—such as Linear Progression or Discrete Arguments—correlate with specific, observable writing behaviors like pausing, planning, or revision patterns. This would provide a much deeper validation of the cognitive processes our framework aims to capture. Another meaningful direction is to further enhance the automated essay scoring (AES) with the framework. Instead of relying on surface-level features, AES developers can use the methodology to train systems to identify the underlying strategy clusters. An AES system could not only score an essay but also classify its strategic profile, offering targeted, automated feedback that guides students toward more effective, knowledge-transforming approaches to writing. This would shift AES from a purely evaluative tool to a powerful pedagogical one.

This study has several promising methodological contributions for studying writing. First, this study demonstrates the implementation of unsupervised machine learning approaches for identifying writing strategies without imposing predetermined categories and thresholds. The data-driven clustering methodologies reveal hidden patterns that might have been missed through theory-driven categorization alone. This data-driven approach to strategy identification complements common qualitative analysis methods and offers the potential for scaling writing research to larger datasets. This offers researchers a new area of inquiry regarding constructed-response assessments and a powerful, efficient tool to assist human experts. Second, the integration of item response measurement models with cluster analysis represents another methodological contribution, allowing for the examination of how latent traits relate to emergent strategy patterns. This combined approach bridges psychometric and computational perspectives on writing assessment, potentially enriching both fields and leading to a more comprehensive analysis.

There are several limitations of this study that should be addressed in future research. First, the analysis results of the proposed framework focus exclusively on the final written product. This approach, while effective for identifying strategic patterns in completed texts, does not allow us to directly observe the cognitive processes writers engaged in during composition. A crucial next step for future research is to integrate our framework with process data, such as keystroke logging or think-aloud protocols, to investigate whether the strategy clusters we identified correlate with specific, real-time writing behaviors. Second, our findings are based on a specific group of Grade 8 students responding to a single type of ELA assessment. Future research should also explore the generalizability of these findings across different populations, developmental levels, and writing contexts to test the generalizability of these findings and refine our understanding of how writing strategies vary across contexts. Additionally, expanding this methodology to include a wider range of linguistic and rhetorical features could reveal additional strategy dimensions relevant to writing development.

7. Conclusions

In conclusion, this study contributes to enhancing the understanding of writing strategy use by introducing a comprehensive framework to demonstrate the relationship between writing proficiency and the utilization of strategies across multiple categories. By leveraging natural language processing and unsupervised machine learning, we showcase that writing proficiency is not the result of mastering isolated skills but rather emerges from the successful and coordinated implementation of strategies across different categories. Our findings provide strong empirical support for a developmental view of writing, showing a clear distinction between a fragmented, “knowledge-telling” approach and a coherent, systemic “knowledge-transforming” process.

The results revealed that high-quality writing involves a delicate balance. Optimal performance is associated not with maximum complexity, but with an intermediate level that allows for clear and effective expression. Furthermore, the ability to integrate external sources through quoting or paraphrasing proved to be a far stronger indicator of proficiency than reliance on creating original content alone. Additionally, there are important interactions among text complexity, evidence use, and argument structure that highlight the interconnected nature of writing proficiency and the use of various writing strategies.

These findings challenge the approach of simply treating writing as an isolated skill. Instead, our results show that examinees must learn to juggle several categories of writing simultaneously, adapting their approach based on the specific demands of each writing task. As examinees develop higher writing proficiency, they not only acquire more sophisticated strategies within individual categories but also learn to coordinate these strategies more effectively across categories. By illuminating these complex relationships between writing strategies and writing proficiency, this research offers valuable guidance for designing instruction, curricula, assessments, and targeted interventions that recognize and foster the interconnected nature of writing proficiency. As computational methods continue to advance, they offer powerful tools to help us better understand and cultivate the sophisticated cognitive skills that define proficient writing.

Author Contributions

Conceptualization, C.T.; methodology, C.T. and G.E.; software, C.T.; validation, C.T. and J.X.; formal analysis, C.T.; investigation, C.T.; resources, J.X.; data curation, J.X.; writing—original draft preparation, C.T.; writing—review and editing, J.X. and G.E.; visualization, C.T.; supervision, G.E.; project administration, J.X.; funding acquisition, C.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

We would like to clarify that our study was determined as Not Human Research by our Institutional Review Board (IRB) because the project activities were limited to the analysis of de-identified data. Specifically, all data used in this research consisted solely of de-identified student scores and responses, with no personally identifiable information or direct interaction with human subjects. As per institutional and ethical guidelines, studies that involve only de-identified data and do not include human subjects, human material, or human tissues are not classified as human subjects research and therefore do not require IRB approval. I have attached a copy of two IRB documentation of the assesslet data we used in this study for you to review. In addition, our previous publication in Education Sciences was published with the same data and IRB documents.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to (The empirical data is protected under state law, and a description of the assessment can be found at https://coe.uga.edu/directory/k-12-assessment-solutions/ (accessed on 6 April 2025), formerly named the “Georgia Center for Assessment”).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BERT	Bidirectional Encoder Representations from Transformers
1PL	One-Parameter Logistic Measurement Model
PCM	Partial Credit Model
TF-IDF	Term Frequency–Inverse Document Frequency
TTR	Type–Token Ratio
PCA	Principal Component Analysis
NLTK	Natural Language Toolkit library
ELA	English Language Arts

References

Alharbi, W. (2023). AI in the foreign language classroom: A pedagogical overview of automated writing assistance tools. Education Research International, 2023, 4253331. [Google Scholar] [CrossRef]
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv, arXiv:2004.05150. [Google Scholar] [CrossRef]
Ben-Porat, O., Hirsch, S., Kuchy, L., Elad, G., Reichart, R., & Tennenholtz, M. (2020). Predicting strategic behavior from free text. Journal of Artificial Intelligence Research, 68, 413–445. [Google Scholar] [CrossRef]
Bereiter, C., & Scardamalia, M. (2013). The psychology of written composition. Routledge. Available online: https://www.taylorfrancis.com/books/mono/10.4324/9780203812310/psychology-written-composition-carl-bereiter-marlene-scardamalia (accessed on 6 April 2025).
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29. [Google Scholar] [CrossRef]
Charters, E. (2003). The use of think-aloud methods in qualitative research an introduction to think-aloud methods. Brock Education Journal, 12(2), 68–82. [Google Scholar] [CrossRef]
Chen, C., Kim, S., Bui, H., Rossi, R., Koh, E., Kveton, B., & Bunescu, R. (2018, October 22–26). Predictive analysis by leveraging temporal user behavior and user embeddings. 27th ACM International Conference on Information and Knowledge Management (pp. 2175–2182), Torino, Italy. [Google Scholar] [CrossRef]
Chowdhary, K. R. (2020). Natural language processing. In K. R. Chowdhary (Ed.), Fundamentals of artificial intelligence (pp. 603–649). Springer India. [Google Scholar] [CrossRef]
Chuang, P.-L., & Yan, X. (2022). An investigation of the relationship between argument structure and essay quality in assessed writing. Journal of Second Language Writing, 56, 100892. [Google Scholar] [CrossRef]
Crammond, J. G. (1998). The uses and complexity of argument structures in expert and student persuasive writing. Written Communication, 15(2), 230–268. [Google Scholar] [CrossRef]
Cumming, A., Lai, C., & Cho, H. (2016). Students’ writing from sources for academic purposes: A synthesis of recent research. Journal of English for Academic Purposes, 23, 47–58. [Google Scholar] [CrossRef]
De Ayala, R. J. (1995). An investigation of the standard errors of expected a posteriori ability estimates. Available online: https://eric.ed.gov/?id=ED392840 (accessed on 6 April 2025).
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics. Available online: https://aclanthology.org/N19-1423/?utm_campaign=The%20Batch&utm_source=hs_email&utm_medium=email&_hsenc=p2ANqtz-_m9bbH_7ECE1h3lZ3D61TYg52rKpifVNjL4fvJ85uqggrXsWDBTB7YooFLJeNXHWqhvOyC (accessed on 6 April 2025).
Driscoll, D. L., & Brizee, A. (2013). Quoting, paraphrasing, and summarizing. The Purdue OWL. Purdue U Writing Lab, 15. [Google Scholar]
Engelhard, G., Jr. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge. Available online: https://www.taylorfrancis.com/books/mono/10.4324/9780203073636/invariant-measurement-george-engelhard-jr (accessed on 6 April 2025).
Ericsson, K. A. (2017). Protocol analysis. In W. Bechtel, & G. Graham (Eds.), A companion to cognitive science (1st ed., pp. 425–432). Wiley. [Google Scholar] [CrossRef]
Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221. [Google Scholar] [CrossRef]
Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193–202. [Google Scholar] [CrossRef]
Hardeniya, N., Perkins, J., Chopra, D., Joshi, N., & Mathur, I. (2016). Natural language processing: Python and NLTK. Packt Publishing Ltd. Available online: https://books.google.com/books?hl=en&lr=&id=0J_cDgAAQBAJ&oi=fnd&pg=PP1&dq=NLTK+library&ots=lgstr0lzWT&sig=oVZ3c2gMUEp0_T1jZi2JmVTQ-lY (accessed on 6 April 2025).
Hayes, J. R., & Flower, L. S. (1986). Writing research and the writer. American Psychologist, 41(10), 1106. [Google Scholar] [CrossRef]
Hua, X., & Wang, L. (2022). Efficient argument structure extraction with transfer learning and active learning. arXiv, arXiv:2204.00707. [Google Scholar] [CrossRef]
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. [Google Scholar] [CrossRef]
Kellogg, R. T. (2008). Training writing skills: A cognitive developmental perspective. Journal of Writing Research, 1(1), 1–26. [Google Scholar] [CrossRef]
Kettunen, K. (2014). Can type-token ratio be used to show morphological complexity of languages? Journal of Quantitative Linguistics, 21(3), 223–245. [Google Scholar] [CrossRef]
Kim, H., Baghestani, S., Yin, S., Karatay, Y., Kurt, S., Beck, J., & Karatay, L. (2024). ChatGPT for writing evaluation: Examining the accuracy and reliability of AI-generated scores compared to human raters. In Exploring artificial intelligence in applied linguistics (pp. 73–95). Iowa State University Digital Press. [Google Scholar]
Kim, M., Tian, Y., & Crossley, S. A. (2021). Exploring the relationships among cognitive and linguistic resources, writing processes, and written products in second language writing. Journal of Second Language Writing, 53, 100824. [Google Scholar] [CrossRef]
Kincaid, J. P., Fishburne, R. P., Jr., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Available online: https://stars.library.ucf.edu/istlibrary/56/ (accessed on 6 April 2025).
Kisselev, O., Soyan, R., Pastushenkov, D., & Merrill, J. (2022). Measuring writing development and proficiency gains using indices of lexical and syntactic complexity: Evidence from longitudinal Russian learner corpus data. The Modern Language Journal, 106(4), 798–817. [Google Scholar] [CrossRef]
Kormos, J. (2011). Task complexity and linguistic and discourse features of narrative writing performance. Journal of Second Language Writing, 20(2), 148–161. [Google Scholar] [CrossRef]
Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication. Available online: https://scholarworks.gsu.edu/items/8046a5d0-2d60-44c8-9ab7-207f263ce8b5 (accessed on 6 April 2025).
Kyle, K., & Crossley, S. (2016). The relationship between lexical sophistication and independent and source-based writing. Journal of Second Language Writing, 34, 12–24. [Google Scholar] [CrossRef]
Lahmann, C., Steinkrauss, R., & Schmid, M. S. (2019). Measuring linguistic complexity in long-term L2 speakers of English and L1 attriters of German. International Journal of Applied Linguistics, 29(2), 173–191. [Google Scholar] [CrossRef]
Lawrence, J., & Reed, C. (2015, June 4). Combining argument mining techniques. 2nd Workshop on Argumentation Mining (pp. 127–136), Denver, CO, USA. Available online: https://aclanthology.org/W15-0516.pdf (accessed on 6 April 2025).
Lawrence, J., & Reed, C. (2020). Argument mining: A survey. Computational Linguistics, 45(4), 765–818. [Google Scholar] [CrossRef]
Lippi, M., & Torroni, P. (2015). Argument mining: A machine learning perspective. In E. Black, S. Modgil, & N. Oren (Eds.), Theory and applications of formal argumentation (Vol. 9524, pp. 163–176). Springer International Publishing. [Google Scholar] [CrossRef]
Lu, C., Bu, Y., Dong, X., Wang, J., Ding, Y., Larivière, V., Sugimoto, C. R., Paul, L., & Zhang, C. (2019). Analyzing linguistic complexity and scientific impact. Journal of Informetrics, 13(3), 817–829. [Google Scholar] [CrossRef]
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. [Google Scholar] [CrossRef]
Lu, X. (2012). The relationship of lexical richness to the quality of ESL learners’ oral narratives. The Modern Language Journal, 96(2), 190–208. [Google Scholar] [CrossRef]
Lyu, J., Chishti, M. I., & Peng, Z. (2022). Marked distinctions in syntactic complexity: A case of second language university learners’ and native speakers’ syntactic constructions. Frontiers in Psychology, 13, 1048286. [Google Scholar] [CrossRef] [PubMed]
Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text—Interdisciplinary Journal for the Study of Discourse, 8(3), 243–281. [Google Scholar] [CrossRef]
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. [Google Scholar] [CrossRef]
Mayahi, A. J., & Alshatti, E. N. (2023, October 29–30). Assessing English language writing and readability skills using long short-term memory model. 2023 Computer Applications & Technological Solutions (CATS) (pp. 1–4), Mubarak Al-Abdullah, Kuwait. Available online: https://ieeexplore.ieee.org/abstract/document/10424106/?casa_token=jTPotnF7lWEAAAAA:8CJNyt7CwqPiG5LMd6LpqH7Y3KPx9EzTrM4Kb4lqdVI3lRYk8mKud4RqPr2LZYn7gQV-qZ2YxUk (accessed on 6 April 2025).
Mazgutova, D., & Kormos, J. (2015). Syntactic and lexical development in an intensive English for Academic Purposes programme. Journal of Second Language Writing, 29, 3–15. [Google Scholar] [CrossRef]
McCutchen, D. (2000). Knowledge, processing, and working memory: Implications for a theory of writing. Educational Psychologist, 35(1), 13–23. [Google Scholar] [CrossRef]
McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2010). Linguistic features of writing quality. Written Communication, 27(1), 57–86. [Google Scholar] [CrossRef]
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv, arXiv:1301.3781. [Google Scholar] [CrossRef]
Nussbaum, E. M., & Schraw, G. (2007). Promoting argument-counterargument integration in students’ writing. The Journal of Experimental Education, 76(1), 59–92. [Google Scholar] [CrossRef]
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., & Dubourg, V. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830. [Google Scholar]
Peng, Y., Sun, J., Quan, J., Wang, Y., Lv, C., & Zhang, H. (2023). Predicting Chinese EFL learners’ human-rated writing quality in argumentative writing through multidimensional computational indices of lexical complexity. Assessing Writing, 56, 100722. [Google Scholar] [CrossRef]
Pennington, J., Socher, R., & Manning, C. D. (2014, October 25–29). Glove: Global vectors for word representation. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543), Doha, Qatar. Available online: https://aclanthology.org/D14-1162.pdf (accessed on 6 April 2025).
Pun, J., & Li, W. K. (2024). A structural equation investigation of linguistic features as indices of writing quality in assessed secondary-level EMI learners’ scientific reports. Assessing Writing, 62, 100897. [Google Scholar] [CrossRef]
Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. Proceedings of the First Instructional Conference on Machine Learning, 242(1), 29–48. [Google Scholar]
R Core Team. (2010). R: A language and environment for statistical computing. Available online: https://cir.nii.ac.jp/crid/1370294721063650048 (accessed on 6 April 2025).
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv, arXiv:1908.10084. [Google Scholar] [CrossRef]
Sagredo-Ortiz, S., & Kloss, S. (2025). Academic writing strategies in university students from three disciplinary areas: Design and validation of an instrument. Frontiers in Education, 10, 1600497. [Google Scholar] [CrossRef]
Seifert, K., & Sutton, R. (2009). Educational psychology. Available online: http://117.250.119.200:8080/jspui/bitstream/123456789/112/1/Kelvin%20Seifert_Educational%20Psychology.pdf (accessed on 6 April 2025).
Shahapure, K. R., & Nicholas, C. (2020, October 6–9). Cluster quality analysis using silhouette score. 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) (pp. 747–748), Sydney, Australia. Available online: https://ieeexplore.ieee.org/abstract/document/9260048/?casa_token=7vv0LNndPKQAAAAA:ZOS9it2int31pwL_8KBy16vsxHLL8nh0VxhJOBEao-7wQ_hLjs8aBtBpnCiy4EgZWtXAJSuQdPk (accessed on 6 April 2025).
Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., & Liu, J. (2021). A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP Journal on Wireless Communications and Networking, 2021(1), 31. [Google Scholar] [CrossRef]
Sinaga, K. P., & Yang, M.-S. (2020). Unsupervised K-means clustering algorithm. IEEE Access, 8, 80716–80727. [Google Scholar] [CrossRef]
Talebinamvar, M., & Zarrabi, F. (2022). Clustering students’ writing behaviors using keystroke logging: A learning analytic approach in EFL writing. Language Testing in Asia, 12(1), 6. [Google Scholar] [CrossRef]
Thorndike, R. L. (1953). Who belongs in the family? Psychometrika, 18(4), 267–276. [Google Scholar] [CrossRef]
Toth, Z. (2025). The measurement of text quality: Current methods and open challenges. Open Research Europe, 5(98), 98. [Google Scholar] [CrossRef]
Toulmin, S. E. (2003). The uses of argument. Cambridge University Press. Available online: https://books.google.com/books?hl=en&lr=&id=8UYgegaB1S0C&oi=fnd&pg=PR7&dq=Toulmin,+S.+E.+(2003).+The+uses+of+argument+(Updated+edition).+Cambridge+University+Press.&ots=Xg20shCPuU&sig=RJqKcvgV-puC5OeiWPYBmw7J6eg (accessed on 6 April 2025).
Van Rossum, G., & Drake, F. L. (1995). Python reference manual (Vol. 111). Centrum voor Wiskunde en Informatica Amsterdam. Available online: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/gwydion-1/OldFiles/OldFiles/python/Doc/ref.ps (accessed on 6 April 2025).
Vasiliev, Y. (2020). Natural language processing with Python and spaCy: A practical introduction. No Starch Press. Available online: https://books.google.com/books?hl=en&lr=&id=Au-_DwAAQBAJ&oi=fnd&pg=PR15&dq=spacy+library&ots=0oalSVonWQ&sig=lxuMaNqfaCqc93ac-XXfcdRfw2E (accessed on 6 April 2025).
Verspoor, M., Lowie, W., Chan, H. P., & Vahtrick, L. (2017). Linguistic complexity in second language development: Variability and variation at advanced stages. Recherches En Didactique Des Langues et Des Cultures. Les Cahiers de l’Acedle, 14(14-1). Available online: https://journals.openedition.org/rdlc/1450 (accessed on 6 April 2025).
Weaver, D. H., Hopkins, W. W., Billings, W. H., & Cole, R. R. (1974). Quotes vs. paraphrases in writing: Does it make a difference to readers? Journalism Quarterly, 51(3), 400–404. [Google Scholar] [CrossRef]
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., & Funtowicz, M. (2020, November 16–20). Transformers: State-of-the-art natural language processing. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45), Online. Available online: https://aclanthology.org/2020.emnlp-demos.6/ (accessed on 6 April 2025).
Xiong, J., Engelhard, G., & Cohen, A. S. (2024). Analysis of mixed-format assessments using measurement models and topic modeling. Measurement: Interdisciplinary Research and Perspectives, 23, 101–115. [Google Scholar] [CrossRef]
Yasuda, S. (2024). Does “more complexity” equal “better writing”? Investigating the relationship between form-based complexity and meaning-based complexity in high school EFL learners’ argumentative writing. Assessing Writing, 61, 100867. [Google Scholar] [CrossRef]
Yoon, H.-J., & Polio, C. (2017). The linguistic development of students of English as a second language in two written genres. TESOL Quarterly, 51(2), 275–301. [Google Scholar] [CrossRef]
Zenker, F., & Kyle, K. (2021). Investigating minimum text lengths for lexical diversity indices. Assessing Writing, 47, 100505. [Google Scholar] [CrossRef]
Zhang, L. J., & Zhang, J. (2024). EFL students’ syntactic complexity development in argumentative writing: A latent class growth analysis (LCGA) approach. Assessing Writing, 61, 100877. [Google Scholar] [CrossRef]
Zhou, C., Qiu, C., Liang, L., & Acuna, D. E. (2024). Paraphrase identification with deep learning: A review of datasets and methods. arXiv, arXiv:2212.06933. [Google Scholar] [CrossRef]

Figure 1. The flowchart of the conceptual framework.

Figure 2. The silhouette scores of text complexity clusters.

Figure 3. The distortion scores of text complexity clusters.

Figure 4. The PCA projection of text complexity clusters.

Figure 5. The feature boxplot of text complexity clusters.

Figure 6. The categories of evidence use clusters.

Figure 7. The distortion scores of argument structure clusters.

Figure 8. The silhouette scores of argument structure clusters.

Figure 9. The PCA projection of argument structure clusters.

Figure 10. The feature distribution of argument structure clusters.

Figure 11. The distribution of writing proficiency estimates across evidence use strategies.

Figure 12. The distribution of writing proficiency estimates across text complexity strategies.

Figure 13. The distribution of writing proficiency estimates across argument structure strategies.

Figure 14. The mean proficiency with standard errors across evidence use and text complexity.

Figure 15. The mean proficiency with standard errors across evidence use and argument structure.

Figure 16. The mean proficiency with standard errors across argument structure and text complexity.

Table 1. Descriptive statistics for writing strategies.

	Number ¹	Writing Proficiency ²
	Number ¹	Mean	SD
Text Complexity Type
Intermediate Composition	215	0.498	0.688
Basic Composition	130	−0.364	0.687
Elaborate Composition	61	−0.414	0.617
Evidence Use Type
Paraphrase	56	0.491	0.682
Direct Quote	116	0.468	0.856
Original Content	234	−0.202	0.680
Argument Structure Type
Linear Progression	153	0.127	0.828
Contrast-Based	129	0.106	0.729

¹ Number refers to the number of essays. ² Writing proficiency was estimated using a 1PL model. The unit is logits, with higher values indicating higher writing proficiency.

Table 2. Mean writing proficiency by writing strategy combinations.

Index	Text Complexity	Evidence Use	Argument Structure	Number ¹	Mean Proficiency
1	Intermediate Composition	Direct Quote	Linear Progression	43	0.685
2	Basic Composition	Original Content	Linear Progression	37	−0.673
3	Intermediate Composition	Direct Quote	Contrast-based	32	0.621
4	Intermediate Composition	Original Content	Contrast-based	31	0.204
5	Intermediate Composition	Original Content	Discrete Arguments	29	0.143
6	Basic Composition	Original Content	Contrast-based	27	−0.318
7	Intermediate Composition	Original Content	Linear Progression	27	0.220
8	Basic Composition	Original Content	Discrete Arguments	26	−0.220
9	Elaborate Composition	Original Content	Discrete Arguments	24	−0.650
10	Elaborate Composition	Original Content	Linear Progression	21	−0.164
11	Intermediate Composition	Direct Quote	Discrete Arguments	18	0.828
12	Intermediate Composition	Paraphrase	Linear Progression	13	0.892
13	Elaborate Composition	Original Content	Contrast-based	12	−0.456
14	Intermediate Composition	Paraphrase	Contrast-based	11	0.487
15	Intermediate Composition	Paraphrase	Discrete Arguments	11	0.867
16	Basic Composition	Direct Quote	Contrast-based	9	−0.281
17	Basic Composition	Direct Quote	Discrete Arguments	7	−0.880
18	Basic Composition	Paraphrase	Discrete Arguments	7	−0.134
19	Basic Composition	Paraphrase	Contrast-based	6	−0.062
20	Basic Composition	Paraphrase	Linear Progression	6	0.416
21	Basic Composition	Direct Quote	Linear Progression	5	−0.117
22	Elaborate Composition	Direct Quote	Contrast-based	1	−0.835
23	Elaborate Composition	Direct Quote	Discrete Arguments	1	0.220
24	Elaborate Composition	Paraphrase	Discrete Arguments	1	1.031
25	Elaborate Composition	Paraphrase	Linear Progression	1	−1.175

¹ Number refers to the number of essays.

Table 3. Predicted probability of writing strategy usage by writing proficiency.

A. Text Complexity	B. Evidence Use	C. Argument Type

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, C.; Xiong, J.; Engelhard, G. Identification of Writing Strategies in Educational Assessments with an Unsupervised Learning Measurement Framework. Educ. Sci. 2025, 15, 912. https://doi.org/10.3390/educsci15070912

AMA Style

Tang C, Xiong J, Engelhard G. Identification of Writing Strategies in Educational Assessments with an Unsupervised Learning Measurement Framework. Education Sciences. 2025; 15(7):912. https://doi.org/10.3390/educsci15070912

Chicago/Turabian Style

Tang, Cheng, Jiawei Xiong, and George Engelhard. 2025. "Identification of Writing Strategies in Educational Assessments with an Unsupervised Learning Measurement Framework" Education Sciences 15, no. 7: 912. https://doi.org/10.3390/educsci15070912

APA Style

Tang, C., Xiong, J., & Engelhard, G. (2025). Identification of Writing Strategies in Educational Assessments with an Unsupervised Learning Measurement Framework. Education Sciences, 15(7), 912. https://doi.org/10.3390/educsci15070912

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identification of Writing Strategies in Educational Assessments with an Unsupervised Learning Measurement Framework

Abstract

1. Introduction

2. Writing Strategies

2.1. Text Complexity

2.2. Evidence Use

2.3. Argument Structure

3. Measurement Framework

3.1. Conceptual Framework for Identifying Writing Strategies

3.1.1. Text Complexity Analysis

3.1.2. Evidence Use Analysis

3.1.3. Argument Structure Analysis

3.2. Clustering Approach

3.3. One-Parameter Logistic Measurement Model

4. Methodology

4.1. Participants

4.2. Procedures

4.2.1. Feature Preprocessing

4.2.2. Text Complexity Analysis

4.2.3. Evidence Use Analysis

4.2.4. Argument Structure Analysis

4.2.5. Measurement of Student Writing Proficiency

5. Results

5.1. Writing Proficiency by Strategy Type

5.2. Heatmaps

5.3. Relationship Between Proficiency and Strategy Choice

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI