An NLP-Based Exploration of Variance in Student Writing and Syntax: Implications for Automated Writing Evaluation

: In writing assessment, expert human evaluators ideally judge individual essays with attention to variance among writers’ syntactic patterns. There are many ways to compose text successfully or less successfully. For automated writing evaluation (AWE) systems to provide accurate assessment and relevant feedback, they must be able to consider similar kinds of variance. The current study employed natural language processing (NLP) to explore variance in syntactic complexity and sophistication across clusters characterized in a large corpus ( n = 36,207) of middle school and high school argumentative essays. Using NLP tools, k-means clustering, and discriminant function analysis (DFA), we observed that student writers employed four distinct syntactic patterns: (1) familiar and descriptive language, (2) consistently simple noun phrases, (3) variably complex noun phrases, and (4) moderate complexity with less familiar language. Importantly, each pattern spanned the full range of writing quality; there were no syntactic patterns consistently evaluated as “good” or “bad”. These findings support the need for nuanced approaches in automated writing assessment while informing ways that AWE can participate in that process. Future AWE research can and should explore similar variability across other detectable elements of writing (e


Introduction
Writing and written expression are almost infinitely variable.There are numerous techniques for communicating our ideas [1,2], and authors may demonstrate flexibility in meeting their discursive goals [3].Importantly, variance in writing is not merely the product of rhetorical decision making, but also emerges from the conscious and unconscious knowledge, styles, preferences, and cultures of the authors [4][5][6].Such variations complicate writing assessment because there are many ways to "succeed" [7].Training and well-defined rubrics offer structure that draws expert human evaluators' attention to key features and variations [8,9], although evaluators may also possess implicit biases that color their perceptions of student writing [10,11].These demands exacerbate the already substantial workload of writing assessment.Educators understand that offering frequent writing assignments, deliberate practice, and formative feedback are crucial for writing and intellectual development [12], but enacting these goals stresses constrained instructor resources.
We contend that a valuable and necessary opportunity to improve AWE technologies is to further explore the issue of variance.Automation relies upon predetermined (i.e., algorithmic) evaluative processes and metrics, which are driven by similarly predetermined expectations about "good" versus "poor" writing.AWE systems can only "reward" and provide feedback on aspects of writing that they have been designed to detect and recognize as worthy.Valid critiques of AWE have thus noted that AWE tools may promote constrained writing norms, contexts, and processes [30][31][32][33].Compared to human evaluators, automated systems have limited access to contextual information about students as whole persons.In classrooms, teachers might possess a deeper understanding of their students' diverse assets and needs, which they could flexibly consider when teaching or assessing writing [14,34].AWE technologies may provide less appropriate or personalized assessment and assessment because they lack human empathy, inferencing, and interpersonal knowledge [35][36][37].AWE systems must be (re)designed to examine and account for variance in student writing.
In this paper, we attend to the variability of syntactic sophistication and complexity within student writing to (a) affirm the reality of variance and (b) demonstrate an approach for addressing this variance in AWE using natural language processing (NLP).We acknowledge that syntax is only one component of writing [38].However, a focused inspection of one component is useful for encouraging others to explore similar and more expansive lines of work.In the following sections, we further discuss the importance of acknowledging variance in student writing and AWE.We then explore clusters of syntactic variation within a large corpus of student essays using NLP indices.Finally, we consider the implications of this approach and findings for AWE.

Recognizing Writing Variance in Automated Writing Evaluation
The development of AWE algorithms employs an ever-expanding toolbox of methods spanning simple correlations, linear regressions, machine learning, and neural networks [19].Regardless of the specific methodology, the process typically begins with "training data" texts that have been assessed by human raters.Such ratings may be holistic (e.g., overall quality), specific subscales (e.g., organization and content, register, and genre), and include annotated features (e.g., rhetorical moves).Next, NLP tools extract linguistic properties of the texts, ranging from descriptive features (e.g., number of words and average sentence length) to more fine-grained calculation (e.g., average number of adjectives per noun phrase).Finally, statistical methods (e.g., regression and machine learning) are implemented to map human-assigned ratings to sets of NLP metrics.These predictive relationships form the basis for AWE algorithms; patterns of NLP-derived features are interpreted as reliable and valid indicators of writing characteristics.
Although these methods have generated numerous accurate algorithms, they might neglect variance in several ways.First, algorithms can only explicitly attend to features when (a) detectors are present and (b) those indices are included in assessment models.For instance, to examine vocabulary, NLP tools might rely on measures of average word concreteness, age-of-acquisition, specificity, familiarity, and more [39].However, other properties may be inaccessible (e.g., personal emotional associations) and thus unusable.Likewise, metrics might be excluded from algorithms if initial analyses reveal "no statistically significant relationship" to human-assigned ratings.When metrics are missing or excluded, resulting algorithms cannot be readily sensitive to variance associated with those features.
Another neglect of variance may occur when algorithms do not account for nested or contextualized patterns.For example, essays naturally vary in length, but the meaning of length may depend on the task, environment, or writer.When prompts ask for "a brief explanation", then short essays are perfectly reasonable; prompts that request "detailed exploration" might warrant a longer essay.Similarly, students may write more when given ample time but write less under artificial time constraints.Students' prior knowledge, motivation, life experiences, and strategies also influence how much they write about a topic regardless of their actual ability to produce text [4].Finally, optimal text length may vary based on other features like vocabulary, syntax, and cohesion.Skilled writers may use precise word choices to convey ideas concisely.By contrast, knowledgeable writers may include ample details and elaboration that make an essay longer [40].Thus, the interpretation of any given feature as an indicator of "quality" may be nested within the variance of other features.
In the current paper, we focus attention on third aspect of variance: student writers can enact their skills in different ways [7,[40][41][42].In AWE, typical assumptions conceptualize the variance of holistic quality or multiple dimensions on linear continua from "poor" (or "low" or "weak") to "good" (or "high" or "strong").Thus, student essays might be rated as having "poor logical flow" versus demonstrating "a clear flow of ideas and arguments", or may be described as showing "unsophisticated word choices" versus "skilled command of vocabulary".However, as noted above, writers might achieve success by leveraging very different kinds or combinations of rhetorical or vocabulary strategies.Which essays are "better"?Which writers are "more skilled"?
Crossley and colleagues [7] similarly employed cluster analysis and discriminant function analysis (DFA) to identify distinct profiles of "successful" student writing using diverse NLP indices.Specifically, they first constructed a corpus of 148 "successful" essays (i.e., human-assigned scores of 4.5 or better on a linear 6-point scale).Next, nearly 200 NLP indices were extracted via Coh-Metrix and related tools [47,48] spanning lexical, syntactic, cohesive, structural, semantic, and rhetorical features.Hierarchical cluster analyses were conducted to reveal distinct groupings and DFA was used to characterize those groups based on NLP measures.Within this small sample, the researchers observed four patterns of successful writing: (1) action and depiction, (2) academic, (3) accessible, and (4) lexical.Essays in the sample were able to achieve high scores via more descriptive language, academic language, accessible and cohesive language, or more skillful vocabulary usage, respectively.Such findings-derived from ostensibly linear human ratings-argue against straightforward or linear mapping between writing features, styles, and quality.
An important implication of [7] and similar work [30,41] is that AWE can feasibly address greater variance in student writing.Crossley and colleagues utilized NLP metrics to characterize essays-the same indices and tools that underlie several AWE systems (e.g., Writing Pal; [49][50][51][52][53]).Additional human corpus judgments or annotations were not required.However, one limitation was that this work focused on a small sample of only high-scoring essays; their analyses characterized only a few successful writers.There is value in extending that work by considering a larger pool of student authors and wider range of quality.Scores are also only one window into the variance of student writing.We argue that it is valuable to first examine variance in how students write before constraining such patterns within specific "quality" expectations.

A Focus on Syntax
The current exploration focuses on syntax, which refers to how words (and word units) are combined, structured, and sequenced to produce larger units of meaning (e.g., clauses), and eventually entire sentences [54][55][56].Syntax is often linked to grammar, which reflects the rules by which linguistic units are "allowed" to be combined or transformed (e.g., verb conjugation).Notably, the current study is not concerned with grammatical errors or "typos", but rather the overall sophistication and complexity with which students construct their sentences.We acknowledge that syntax is only one component of writing, which also comprises lexical, semantic, rhetorical, pragmatic, and other dimensions.Conceptually, however, syntax operates at a level of language (see [57][58][59]) that connects lexical and discursive features, thus making it a meaningful and feasible target for this work.
Syntax is one of the central components of language in general and writing in particular.Syntactic behaviors and patterns have been shown to be related with writing quality as measured by academic evaluation and scoring (e.g., [3,60,61]).A focus on syntax is also motivated by prior research demonstrating the capacity for assessing syntax via NLP.In the current study, we employ the Tool for the Automated Assessment of Syntactic Sophistication and Complexity (TAASC) developed and validated by [46,62,63].In that work, [64] (p.8) has defined syntactic complexity and sophistication as follows: "Syntactic complexity refers to the formal characteristics of syntax (e.g., the amount of subordination) [. ..].In contrast, syntactic sophistication refers to the relative difficulty of learning particular syntactic structures [. ..], which (from a usage-based perspective) is related to input frequency and contingency.The term sophistication [. ..] refers to less frequent words as more sophisticated because they tend to be produced by more proficient writers" More generally, syntax is a central component in efforts to automate writing evaluation through NLP (e.g., [65]).For example, Jagaiah and colleagues [54] examined 36 studies on syntactic complexity measures and found variance in syntactic complexity measures across genres, but also variance across individuals within those groupings.Similarly, Kyle and Crossley [62] observed that incorporating usage-based measures (e.g., frequency of verb argument constructions) helped explain variance in L2 authors' writing quality scores.The authors proposed that these measures should be incorporated into the automated assessment of syntactic complexity, as part of the automation of writing evaluation as a whole.This research was expanded upon in [64], which focuses on the developmental trajectories of L2 writers from the same usage-based perspective, through indices of verb argument construction sophistication.Findings show a trajectory of improvement in writing (as measured by scores) over the course of two years, which is correlated with changes in syntactic complexity and verb argument construction sophistication measures (e.g., number of dependent clauses per clause and main verb frequency).These findings illustrate the importance of syntactic complexity and sophistication to writing outcomes.
Prior studies with TAASC have observed meaningful relationships between syntactic complexity and L1/L2 writing quality [54,61], lexical diversity [62], and writing development [64].In general, syntax is a central component in efforts to automate academic writing evaluation through NLP [65].NLP algorithms map various writing properties of existing essay corpora (e.g., structures, word frequency, meaning, and relationship to prompt) to human evaluations.Through this process of inference, algorithms produce mappings of writing patterns to evaluate behaviors of human evaluators.For example, [66] used a combination of syntactic, semantic, and sentiment related features of essay writing to estimate essay quality.Syntactic features (e.g., unique parts-of-speech used, sentence length, and words ending with "-ing"), helped the reported model achieve significant agreement with human raters (QWP (Quadrating Weighted Kappa) = 0.793).

Research Questions
The current study is driven by three research questions embedded within overarching considerations for AWE development and implementation.To answer these questions, we took direct inspiration from [7] to explore potential patterns using cluster analysis that are then characterized using DFA.

1.
What variance do student writers display with regards to syntactic sophistication and complexity?AWE algorithms typically capture syntactic variance by a single dimension that varies linearly from "lower" to "higher" sophistication and complexity.
In The initial corpus comprised 39,511 argument essays collected by a state-level education agency in the United States for standardized testing [67].Argument essays are a common format for assessing students' writing and rhetorical skills wherein students craft a persuasive response to a prompt (e.g., [63,68]).All essays were composed in response to one of five topics (i.e., driverless cars, exploring Venus, facial action coding, the face on Mars, or seagoing cowboys).Essays were assigned holistic scores by trained human raters on a scale of "1" (lowest quality) to "6" (highest).The writers were 6th-, 8th-, and 10th grade students.Limited demographic data included "gender" (reported as binary "female" or "male"), "race/ethnicity" (reported as American Indian/Alaska Native, Asian/Pacific Islander, Black/African American, Hispanic/Latino, White, Two or More Race/Other), "economic disadvantage" (reported as binary "no" or "yes"), "disability status" (reported as binary "no" or "yes"), and "English language learner" (reported as binary "no" or "yes").

Syntactic NLP Features
Linguistic features related to syntax were extracted using the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC; Version 1.3.8)[62,63].TAASSC included 355 indices and component scores pertaining to clause complexity, noun phrase complexity, and syntactic sophistication.
First, clauses are sentence components that comprise a subject and predicate but may not constitute a complete sentence on their own.Specifically, independent clauses may stand alone as complete sentences (e.g., "The red carpet added color to the room") that vary in complexity based on additional nouns, adjectives, and other details (i.e., complements).Dependent clauses modify other components within a sentence; they depend on the presence of another independent clause (e.g., "The interior decorator decided that the red carpet added color to the room").Dependent clauses add complexity.An adjective complement modifies or adds information to an adjective within the clause.Similarly, a nominal complement modifies or adds information to noun within the clause.Both kinds of complements can increase specificity and clarity, but too many can contribute to sentence processing difficulties.Examples (1) and ( 2) below represent adjective complements (in brackets, with the adjective) that increase clarity or make the sentence harder to parse, respectively: Anna is [delighted with her new job].

2.
Anna is [delighted that the people who interviewed her last week have made an offer and the salary is what she had hoped for].
Examples ( 3) and ( 4) below demonstrate nominal complements that make a sentence more specific or harder to parse, respectively:

3.
Ryan is a [teacher of Portuguese].

4.
Ryan is a [teacher who really likes doing fun activities and creating fun lesson plans for his students every semester].
Second, noun phrases or "nominals" are linguistic units wherein a focal noun (e.g., "carpet") is described or modified by other words (e.g., "red" or "on the floor"), but the entire phrase serves the same grammatical role as the noun (e.g., "the red carpet on the floor").The simplest noun phrases may comprise only the noun; more complex noun phrases may also incorporate objects, adjectives, adverbs, dependents, prepositional relations, and other details that add information, nuance, and context.
Finally, syntactic sophistication may also be developed based on the use of less common vocabulary, phrases, and sentence constructions.Uncommon words (e.g., "vermilion" or "carmine") are harder to understand and parse than familiar words (e.g., "red") and thus add complexity.The same is true for phrases and sentence constructions.In TAASSC, the typicality of word roots (i.e., lemmas) and sentence constructions are assessed based on their frequency ratings in the Corpus of Contemporary American English (COCA, [69]).Higher ratings indicate that a lemma or construction occurs more frequently in the English language.
TAASSC has been productively implemented in numerous studies (e.g., [54,61,70,71]).For instance, [70] studied how neural networks and NLP tools (e.g., TAAASC) can reveal the contribution of linguistic features to rubric scores, and to explore what features are important in effective rubric scoring models.The researchers found that it was possible to train a model to produce a transparent grading rubric where the most predictive NLP properties were similar to human judgments.In a systematic analysis on measures of syntactic complexity, writing ability, and writing quality, Jagaiah and colleagues [54] observed a lack of straightforward connections between these constructs, in part due to a lack of research using the same metrics and the variance associated with these measures.Outside of the field of writing, Clarke and colleagues [72] explored the potential for syntax (and other metrics) to provide early indicators of Alzheimer's disease.In sum, a growing body of literature has documented that TAASSC offers a reliable, valid, and meaningful tool for exploring the syntactic features and impact of text.
Notably, the more than 350 indices available through TAASSC include numerous redundant or highly correlated metrics-many metrics capture the same information in different ways.To reduce the number of indices used in the current analysis, we (a) reviewed the literature to identify metrics that demonstrated meaningful effects in prior studies and (b) examined metrics for multicollinearity (i.e., Pearson's r > 0.70).This theory-driven and data-driven process identified 18 concrete indices to be used in the current study.Table 1 summarizes these metrics.For measures of clause complexity and noun phrase complexity, higher values indicate a more complex structure (i.e., higher average frequency or larger variation).Measures of syntactic sophistication captured the use of common words and constructions.Thus, higher values on these metrics indicate simpler syntax and more familiar language.

Corpus Filtering and Analysis
Several steps were implemented to "clean" the corpus for analysis.First, essays that lacked accompanying demographic author data were excluded (n = 1491).Second, essays that generated two or more "0" scores on NLP indices were excluded (n = 1119).Inspection of these essays revealed textual details or errors (e.g., use of nonstandard notation or punctuation) that caused errors in the NLP tools and prevented analysis.Finally, a qualitative review of the data observed that many essays assigned a score of "1" by human raters were not valid attempts at authoring an essay (e.g., they comprised a single repeated word or highly off-topic commentary).To avoid skewed measurements and analyses, we excluded essays with a score of "1" (n = 694).The final analysis corpus comprised 36,207 essays.Summary details for the analysis corpus are provided in Tables 2 and 3.

Analysis
To address the primary research questions, two analytical methods were implemented.K-means clustering was used to identify potential syntactic patterns of student writing based on syntactic sophistication and complexity.Discrimination function analysis (DFA) was used to characterize resulting clusters based on patterns of predictive variables.Other approaches (e.g., MDA, [1,30,41,44,46]) are similarly informative for capturing variance in writing.MDA has been specifically applied to studying variance in register and genre, school writing [73], and in certain AWE settings [30], and writing evaluation [44].More complex clustering methodologies such as hierarchical clustering (e.g., [74,75]) and random forest analysis (e.g., [76,77]) can also shed light and add nuance to analyses of variance in student writing behaviors.For the current work, we selected k-mean clustering and DFA due to their relative simplicity, accessibility, and speed.These methods enable exploration of clear patterns that can then drive more precise and detailed analyses.In addition, in taking inspiration from [7], we mirrored their methodology to facilitate comparison and connections to AWE.We have selected to use K-means clustering, although other clustering and analysis methods like hierarchical clustering [78] are equally valid for grouping items by relative similarity.

K-Means Clustering
K-means clustering is an algorithm that classifies data into a certain number of groupings based on variance among the input variables (e.g., [79,80]).Specifically, algorithms identify clusters of cases that are most similar to each other (i.e., within-cluster variance) while distinct from other clusters (i.e., between-cluster variance) across input variables.Importantly, the number of clusters generated in the analysis is prespecified (i.e., k = number of clusters).K-means clustering is a commonly used clustering method used to identify categories in language research [78,81].
To identify the optimal number of clusters, the outputs of for each k from 2 to 20 (in this case) are plotted and compared.The scree plot illustrating the sum of squared errors (SSE) for each cluster can be inspected to identify the "elbow"-shaped curve in the plot-the inflection point indicating that additional clusters contribute minimal additional variance (i.e., increasing k clusters results in only minor shifts in SSE).Similar cluster number selection processes are attested in other work on linguistic data [81].
The analysis was conducted using the k-means function in the stats package in R [81].Input data included the 18 TAASSC syntax metrics identified in Table 1.Thus, this analysis reveals clusters of student writers characterized by different "patterns" or "profiles" of syntactic sophistication and complexity.

Discriminant Function Analysis
Discriminant function analysis (DFA) is a statistical process that classifies cases into distinct categories based on patterns of input variables (e.g., [82,83]).The categories are prespecified; multivariate analyses reveal the input variables that most discriminate between these groups.DFA produces a number of outputs that enable characterization of the predicted clusters, including (a) descriptive statistics for target cluster and input variables, (b) the functions (similar to linear regression equations) that determine cluster membership, (c) eigenvalues and tests of statistical significance for each function, (d) a structure matrix that reports the loadings of input variables on each function, and (e) group centroids (i.e., mean values computed from each function for each cluster).
The DFA was conducted using IBM SPSS 29.0.The target categories were the clusters identified by the k-means clustering algorithm (see Results).Input data included the 18 TAASSC syntax metrics identified in Table 1.This analysis describes the syntactic variables that best define or describe observed syntactic clusters, if any.

Analysis of Variance (ANOVA) and Linear Regression
To examine associations between observed clusters and writing quality, ANOVAs were conducted to test whether clusters differed in human-assigned scores.Subsequently, linear regression analyses were conducted to reveal the variables that most predicted variance in scores (a) across the corpus and (b) within each cluster.

Variance in Syntactic Sophistication and Complexity among Student Writers
A four-cluster solution was the most optimal and parsimonious; five or more clusters offered minimal further impact on observed SSE.Table 4 reports the number of essays per cluster, along with the means and standard deviations for syntactic variables (see [78,81] for similar K-Means analyses).All indices demonstrated statistically significant differences across clusters.Effects sizes were also generally large, although several metrics were notable, and include the following: average number of dependents per nominal (η 2 = 0.56), average lemma frequency (η 2 = 0.38), average lemma constructions combinations frequency (η 2 = 0.35), average number of nominal complements per clause (η 2 = 0.34), average number of dependents per nominal (η 2 = 0.33), and average number of prepositions per nominal (η 2 = 0.33).Thus, aspects of clause complexity, noun phrase complexity, and sophistication all contributed to differences between clusters.Three statistically significant discriminant functions were reported.Function 1 accounted for 68.7% of the variance in the clusters (eigenvalue = 2.37), Function 2 accounted for 27.3% of the variance (eigenvalue = 0.94), and Function 3 accounted for 4.0% of the variance (eigenvalue = 0.14).Multivariate tests were statistically significant for tests of Functions 1 through 3, Wilks' λ = 0.13, χ 2 (54) = 72,787.50,p < 0.001; Functions 2 through 3, Wilks' λ = 0.45, χ 2 (34) = 28,765.54,p < 0.001); and Function 3, Wilks' λ = 0.88, χ 2 (16) = 4703.70,p < 0.001.In simpler terms, Function 1 was the primary driver of category membership; many cases could be sorted based on this function alone.Function 2 also contributed substantively to classifying cases.The contribution of Function 3 was very small yet statistically significant due to the large sample size (i.e., high power).
The DFA structure matrix (Table 5) summarizes the sophistication and complexity variables that loaded most strongly on (i.e., correlated with) each function.For readability, correlations below 0.30 are not reported.
Function 1 was characterized by variations in noun phrase complexity.Specifically, influential components of Function 1 included phrases with more dependents, prepositions, and determiners per noun phrase, on average.This function was driven by complicated noun phrases (e.g., "the angry dog on the short leash") in contrast to simpler noun phrases (e.g., "the dog").This function also included higher variation in dependents per nominal.Instead of uniformly complex noun phrases, there could be a mix of simpler and complex phrasing.
Function 2 was characterized by variations in familiar language.Specifically, components of Function 2 are primarily related to the use of more frequent, and thus more common and familiar words and sentence constructions.In addition, this pattern demonstrated fewer dependents per preposition (e.g., "on the leash") instead of more complex phrases with more dependents (e.g., "on the long leash loosely held by the owner").Finally, Function 3 was characterized by variations in clause complexity.Components of Function 3 negatively related to the number of adjectival complements per clause and positively related to nominal complements per clause.Thus, this function captured clauses and sentences with more nouns but fewer adjectives (e.g., "the owner held a leash as she walked her dog" compared to "the nervous owner tightly held the frayed leash as she walked her energetic dog").
Similar to linear regression, linear discriminant functions can be used to calculate mean values for each function based on their constituent variables.These "group centroids" reveal discriminating patterns across the clusters (Table 6).Function 1 (noun phrase complexity) strongly discriminated between Clusters 2 and 3. A larger positive Function 1 value was associated with Cluster 3, whereas a larger negative value was associated with Cluster 2. Function 2 (familiar language) further discriminated between Clusters 1 and 4. A larger positive Function 2 value was associated with Cluster 1, whereas a larger negative value was associated with Cluster 4. Thus, given Functions 1 and 2, many essays might be classified within one of four distinct clusters.Function 3 (clause complexity) provided additional nuance to further discriminate and characterize the clusters.For example, Clusters 1 and 4 both exhibited somewhat negative values for Function 4, whereas Clusters 2 and 3 displayed somewhat positive values.The four clusters are further described in the following sections.

Summary of Clusters
Cluster 1 was distinguished by a use of familiar language (i.e., Function 2), defined as words and sentence constructions that occur more frequently in English.These essays also exhibited relatively higher use of adjectival complements per clause than other clusters (i.e., negative value for Function 3).Thus, adjective structures were more structurally complex and potentially more descriptive.For example, compare the adjective (in brackets) in "the student was [happy]" versus the more complex adjectival clause in "the student was [happy that he passed all his math and engineering tests]".Cluster 1 can be tentatively named Familiar and Descriptive Language.
Cluster 2 was characterized by simpler noun phrases with fewer dependents per nominal, per direct object, per preposition, and so on (i.e., Function 1).These essays also demonstrated the lowest variance in these metrics.Thus, authors of these essays consistently employed simpler syntax at the noun phrase level.Cluster 2 can be named Consistently Simple Noun Phrases.
Cluster 3 demonstrated many of the highest values for measures of noun phrase complexity (i.e., Function 1) along with the highest variance in these indicators.Thus, essays in this cluster employed more complex sentence structures but also varied in levels of complexity.In addition, essays in this cluster demonstrated the highest mean value for average number of nominal complements per clause (i.e., Function 3), further adding to overall complexity.These dual patterns of complexity and variability are often noted as hallmarks of "skillful" syntax in writing [84,85].Notably, these essays also tended to use more familiar words and sentence constructions (i.e., Function 2).Cluster 3 can be named Variably Complex Noun Phrases.
Cluster 4 was distinguished by words and sentence constructions that are less frequent in the English language (i.e., Function 2).These essays also demonstrated moderately complex noun phrases (i.e., Function 1) and more adjective complements per nominal (i.e., negative value for Function 3).Taken together, these patterns suggest that authors perhaps displayed a more extensive or sophisticated vocabulary, which was implemented descriptively and via moderately complex sentences.Cluster 4 might be named Moderate Complexity with Less Familiar Language.

Relationships between Clusters, Syntactic Sophistication, and Writing Quality
Although writing assessment encompasses more than "quality", the ability to assign valid "scores" to student writing remains an important goal for instructors and AWE [19,83,84].Thus, it is meaningful to consider how the observed syntactic clusters were associated with variations in writing quality, and whether distinct clusters achieved successful writing in different ways.
Mean holistic scores were computed and compared for each cluster (ANOVA), revealing the following significant main effect of cluster: F(3,36,203) = 326.16,p < 0.001, η 2 = 0.03.Specifically, Cluster 4 (Moderate Complexity with Less Familiar Language) reported the highest score (M = 3.64, SD = 0.90), followed by Cluster 3 (Variably Complex Noun Phrases) (M = 3.52, SD = 0.97), Cluster 1 (Familiar and Descriptive Language), and then Cluster 2 (Consistently Simple Noun Phrases).All pair-wise comparisons were significant (i.e., all p < 0.001).Superficially, Clusters 4 and 3 both exhibited signs of syntactic complexity that is often rewarded in assessment, whereas Clusters 1 and 2 perhaps align with simpler writing.Thus, this statistically significant "ordering" of clusters may seem to confirm expectations about "good" writing.However, the overall main effect and differences between clusters were quite small.
Figure 1 provides the following revealing illustration: every possible score (i.e., from 2 to 6) was observed in every possible cluster.In other words, each of the four clusters encompassed a range of writing quality.Although not equally likely, student writers who demonstrated "familiar and descriptive language" (Cluster 1) could achieve the same levels of success as students who exhibited "consistently complex noun phrases" (Cluster 3), and so on.These patterns provide evidence that observed clusters were not merely incremental manifestations of linear syntactic sophistication (i.e., from "less" to "more").In other words, distinct writing patterns are not inherently "good" or "bad" but can be enacted in varying ways that receive better or worse evaluations.
demonstrated "familiar and descriptive language" (Cluster 1) could achieve the same levels of success as students who exhibited "consistently complex noun phrases" (Cluster 3), and so on.These patterns provide evidence that observed clusters were not merely incremental manifestations of linear syntactic sophistication (i.e., from "less" to "more").In other words, distinct writing patterns are not inherently "good" or "bad" but can be enacted in varying ways that receive better or worse evaluations.Linear regression analyses (Table 7) were conducted to explore how well the 18 syntactic and sophistication variables might predict holistic essay scores.We first conducted a linear regression for the entire corpus to examine how syntax predicted quality overall.We then investigated each cluster to explore how and whether within-cluster estimates differed from whole-corpus estimates.Importantly, we recognize that syntax alone should not account for much variance in writing quality.Nonetheless, syntax contributes to perceived writing quality because syntax is a part of holistic writing skills [7,58]."Bad grammar" results in lower perceived quality (e.g., Johnson and colleagues [86]).For brevity, we omit correlation matrices for each analysis.However, for any given analysis, correlations for individual metrics were small (r < |0.20|) but nearly all were statistically significant (i.e., p ≤ 0.001).
For the entire corpus, the linear regression was significant, F (18,36,206) = 256.38,p < 0.001, R 2 = 0.11.Thus, a model based on a small number of syntactic indices accounted for about 11% of the variance in scores.Standardized beta coefficients suggest that a variety of factors influenced scores, such as noun phrase complexity (e.g., standard deviation for Linear regression analyses (Table 7) were conducted to explore how well the 18 syntactic and sophistication variables might predict holistic essay scores.We first conducted a linear regression for the entire corpus to examine how syntax predicted quality overall.We then investigated each cluster to explore how and whether within-cluster estimates differed from whole-corpus estimates.Importantly, we recognize that syntax alone should not account for much variance in writing quality.Nonetheless, syntax contributes to perceived writing quality because syntax is a part of holistic writing skills [7,58]."Bad grammar" results in lower perceived quality (e.g., Johnson and colleagues [86]).For brevity, we omit correlation matrices for each analysis.However, for any given analysis, correlations for individual metrics were small (r < |0.20|) but nearly all were statistically significant (i.e., p ≤ 0.001).
For the entire corpus, the linear regression was significant, F(18,36,206) = 256.38,p < 0.001, R 2 = 0.11.Thus, a model based on a small number of syntactic indices accounted for about 11% of the variance in scores.Standardized beta coefficients suggest that a variety of factors influenced scores, such as noun phrase complexity (e.g., standard deviation for dependents per nominal, and standard deviation for dependents per direct object) and sophistication (e.g., average frequency of lemmas and average proportion of lemma construction combinations appearing the reference corpus).Essays attained higher scores when they demonstrated variable complexity (i.e., a mix of simpler and complex structures) and used recognizable but less common vocabulary and language.
These analyses further revealed that the variables contributing to score variations were similar but not identical across clusters.For Cluster 1 (Familiar and Descriptive Language), higher scores were most associated with (i.e., the largest β coefficients) the use of less frequent vocabulary (β = −0.18),higher variability in average dependents per direct object (β = 0.17), and higher variability in average dependents per object of the preposition (β = 0.16).Notably, in comparison to the whole corpus, measures of clausal complexity (e.g., average number of adjectival complements per clause) and noun phrase complexity (e.g., average number of dependents per nominal or preposition) mattered less.Thus, when writers adopted a more familiar and descriptive style, they were more successful when using sophisticated vocabulary (e.g., precise and meaningful wording) and variable syntax, but increased complexity by itself was less meaningful.Cluster 1 might be exemplified by sentences ( 5) and ( 6) below.These sentences (and later examples) illustrate properties exhibited in real student writing.However, none of the examples are direct quotes as per nondisclosure agreements.Sentence ( 5) is modeled after a sentence from a higher scoring essay.This sentence uses familiar yet meaningful words to convey ideas with precision.In contrast, sentence (6) demonstrates that familiarity can coincide with a lack of clarity and sophistication.The words are highly familiar, yet tend to be vague in meaning (e.g., "old" and "kinds of things").Sentence ( 6) is modeled after a sentence from a lower scoring essay.

5.
Venus is the most comparable planet to earth, and sometimes, the closest in distance.

6.
The Earth is old and has many different kinds of things living on Earth.
For Cluster 2 (Consistently Simple Noun Phrases), higher scores were most associated with higher variability in the average number of dependents per nominal subject (β = 0.13), variability in the average number of dependents per direct object (β = 0.15), and higher proportion of lemma construction combinations appearing in the reference corpus (β = 0.18).In comparison to the whole corpus, overall noun phrase complexity and the use of less frequent words were less important.However, unlike Cluster 1, the impact of clausal complexity was similar to the corpus mean.Overall, student writers whose syntactic pattern demonstrated simplicity attained better scores when they used recognizable language (e.g., fewer spelling and grammatical errors) and demonstrated syntactic variability.When syntax is generally more simple, occasional instances of complexity likely "stand out".Indeed, skillful writers may even strategically rely on simpler writing to communicate most ideas, but then use greater complexity only when necessary for the topics at hand.Sentence (7) illustrates two sentences with varying complexity of noun phrase structure from a higher scoring essay, whereas sentence (8) emulates sentences with similar properties from a lower scoring essay.In both cases, the constituent noun phrases are simple, yet writers display varying degrees of skill in communicating ideas coherently.Example (7) communicates in a relatively straightforward manner.In contrast, example (8) strings together multiple ideas and noun phrases in a more tangled structure.Example ( 8) is more complex in a less effective way.

7.
Each time a person gets into a car, they put themselves at the risk of being killed or severly injured in a car accident from the second they turn the ignition to the moment they put the car back in "park".Traffic accidents claim the lives of countless innocent people each and every day.

8.
driverless cars should not be made or thought about personal.Also in the reading it states that driverless cars arent fully driverless some of them need to have the hands on the sensors on the steering wheel and the seats will vibrate when something is wrong and the car cant take control of it and you have to control the car yourself.
For Cluster 3 (Variably Complex Noun Phrases), higher ratings were associated with a less common vocabulary (β = −0.20),lower average number of dependents per object of the preposition (β = −0.20),less variability in the of dependents per nominal (β = −0.19),and more adjectival complements per clause (β = 0.15).Essays in this cluster received higher scores when writers used more sophisticated vocabulary and when noun phrase complexity did not involve overly complicated prepositions and prepositional phrases.Thus, when writers used more advanced syntax, it was perhaps important not to "overdo it".The incorporation of more descriptive detail or precision was also beneficial (i.e., adjectival complements).
Sentences ( 9) and (10) illustrate ways in which complex noun phrases manifested in higher and lower scored essays, respectively.In sentence (9), higher complexity serves to establish the writer's stance and contribute meaningful information.In sentence (10), similar properties result in a sentence that is less well organized and harder to parse.

9.
With car companies such as [company name] already planning the release of these self-driving cars, this future of transportation will increase safety, efficiency, and entertainment for humans going from one place to another and eventually make standard automobiles obsolete.10.I never want there to be flying cars because thats when people get lazy and the cars would be useless i want to be able to hop in my cars and go race around and not hop in it and read a book and watch the car drive.
For Cluster 4 (Moderate Complexity with Less Familiar Language), higher essay scores were associated with more recognizable word and construction combinations appearing in the reference corpus (β = 0.19), lower average number of dependents per object of the preposition (β = −0.15),less variability in the of dependents per nominal (β = −0.14),and more adjectival complements per clause (β = 0.14).Similar to Cluster 3, essays in this cluster received higher scores when noun phrase complexity did not rely overmuch on complicated prepositions and prepositional phrases.Given that Cluster 4 tended to exhibit more prepositional complexity, it seemed particularly worthwhile for writers to moderate that tendency.The incorporation of more descriptive detail or precision was again beneficial (i.e., adjectival complements).However, the factors that contributed to higher scores in Cluster 4 differed from Cluster 3 in a few ways.The use of sophisticated vocabulary was less important than using recognizable language (e.g., fewer typos, grammar errors, or slang terms).In addition, a higher average number of dependents per nominal (β = 0.13) and higher variability in the number of dependents per direct object (β = 0.13) somewhat contributed to higher scores for Cluster 4.
Sentences (11) and ( 12) below are somewhat lengthy and complex, thus requiring some attention to parse.However, although sentence (11) outlines its message supported by the complexity, sentence (12) presents many ideas in one sentence in a way that is harder to follow.
11. Since automobiles were first invented, they have been continuously updated in all aspects of the car, it's design, how aerodynamic it is, the amount of cylinders an engine can have, the fuel efficiency, and a large variety of other properties.12. Self-driving cars could be a more productive way for transportation and could also save a lot of lives in the process, a long with making the common person's life just a bit easier in this hard world.

Discussion
Appreciating variance in writing is an important component of valid assessment because students express themselves and achieve their writing goals in diverse ways.Consequently, for AWE systems to optimally facilitate appropriate writing assessment, these technologies must be designed to also recognize variability.In the current paper, we explored how variance in writing variables pertaining to syntactic complexity and sophistication could be captured in a large corpus of high school argumentative essays using natural language processing (NLP) tools.Specifically, the Tool for the Automated Assessment of Syntactic Sophistication and Complexity (TAASCC, [63,68]) was used to detect clausal complexity, noun phrase complexity, and syntactic sophistication.We then conducted clustering and DFA analyses to characterize possible "syntactic styles" in student essays.To the extent that NLP tools and quantitative analyses can perform these tasks, they demonstrate how AWE tools might implement similar approaches.

Syntactic Variance in Writing
Our primary research questions considered syntactic variance displayed by student writers (RQ1) and how such variations could be characterized via NLP indices (RQ2).Our analyses demonstrated four possible clusters representing different syntactic complexity and sophistication profiles.Inspection of means and DFA allowed us to define these clusters.Observed patterns also partially corroborated styles reported in prior NLP-based analyses (e.g., [7]; see also [44,87,88]).
Cluster 1 was defined by descriptive and familiar language.For this cluster, overall noun phrase complexity was moderate, but essays tended to use more frequently occurring words and constructions (COCA, [69]).Clausal complexity was somewhat higher in this cluster than in others, particularly with respect to adjectival complements.Cluster 1 essays tended to include more adjectives in clauses or information that elaborated the meaning of adjectives.Although not perfectly aligned, this cluster perhaps captured syntactic elements of the "action and depiction" style described by [7], which was characterized by an increased number of adjectives, adverbs, rhetorical devices, and words overall.
Cluster 2 was defined by consistently simple noun phrases.Cluster 2 was the most syntactically simple of all four clusters, with fewer dependents (e.g., per nominal, direct object, and preposition) and low variability.Essays also tended to use less recognizable and frequently occurring language.The syntactic patterns displayed in this cluster potentially resemble the "accessible" style displayed in [7], which was characterized by the use of more common words and constructions, lower syntactic complexity, higher cohesion, and higher lexical and semantic overlap.Thus, as above, our focus on syntax may have captured a portion of that style.
Cluster 3 was defined by variably complex noun phrases.Essays in this cluster demonstrated high (or the highest) mean values for noun phrase complexity.In addition, these essays exhibited high variability in these measures-ranging from moderate to very high complexity.Such complexity was perhaps balanced by using more frequent and familiar words and constructions.Our Cluster 3 perhaps displays some similarity to the "lexical" style shown in [7], which also featured words and constructions that are less common.
In addition, their lexical cluster was described as having greater lexical diversity, more imageable words, and more specific words.Our analyses did not include word-based measures, but we can speculate that a more sophisticated vocabulary might lead to more detailed and complex noun phrases.
Finally, Cluster 4 was defined by moderately complex noun phrases and clauses overall, along with the use of less frequently occurring words and sentence constructions.The syntactic properties displayed in this cluster may resemble the "academic" style present in [7], which was similarly characterized by syntactic complexity and less frequent lemma and construction patterns.Their "academic" style also included strong structural components and rhetorical choices, which were beyond the scope of TAASSC to detect in this study.

Style and Score
Given the fundamental work of evaluating student writing, this research also considered the associations between observed styles and essay ratings (RQ3).One possibility was that clusters might be ordered linearly by quality-representing a range from "good" or "skilled" syntax to "poor" or "unskilled" syntax (e.g., see research on writing assessment rubrics, [89][90][91]).Indeed, findings showed that Cluster 4 (Moderate Complexity with Less Familiar Language) earned the highest ratings, followed by Cluster 3 (Variably Complex Noun Phrases), Cluster 1 (Descriptive and Familiar Language), and then Cluster 2 (Consistently Simple Noun Phrases).The higher scoring clusters demonstrated moderate-to-high syntactic complexity balanced by use of familiar language-these patterns align with prior research on syntax and writing quality [54,62,92,93].
Crucially, differences in average cluster scores exhibited very small effect sizes.Statistical significance was likely due to the large number of essays analyzed.Most importantly, all possible scores were distributed across all observed clusters-successful writing was possible regardless of syntactic pattern.In addition, the pathway to success differed somewhat across clusters.When writers implemented greater syntactic complexity, it was worthwhile to moderate such complexity with more familiar language, avoid overly convoluted syntax, and perhaps interweave more and less complex sentences.However, when writers favored syntactic simplicity, it was perhaps worthwhile to demonstrate meaningful and sophisticated word choices, and occasional sentence complexity, for more precise communication.These findings corroborate broad guidance for students to improve their syntax, diction, and varied sentence structure, but also underscore that not all students may equally benefit from generalized feedback recommendations.

Implications for Automated Writing Evaluation
AWE systems employ diverse computational and machine learning processes to detect linguistic features (e.g., vocabulary, syntax, cohesion, and semantics) and perform writing evaluation based on statistical generalizations derived from prior ratings [17,19,84].Once an AWE algorithm is developed and deployed, all essays and writers can be evaluated in a rapid, consistent, and scalable manner.However, this approach may neglect critical variance because students can navigate the same tasks and goals of writing in different ways (e.g., [7,44,93,94]) that may defy uniform assessment.Although such variance is perhaps understood by expert human writing instructors, many or most AWE systems are not equipped to detect or respond to the different ways that students write.

AWE Development
Current findings offer evidence that greater algorithmic sensitivity to syntactic variance, one dimension of a more complex variance including other linguistic properties (e.g., lexicon, cohesion), in writing is both possible and necessary for AWE.Future systems need the capacity to automatically detect and respond to distinct writing styles, patterns, behaviors, or strategies exhibited by different students.Importantly, this approach specifically avoids prescriptive (and potentially biased) notions of what constitutes "good" or "desirable" writing.Moreover, this formulation avoids linear assumptions that student writing only varies from "less" (or "poor") to "more" (or "good") on given features of language.Sensitivity to variance emphasizes acknowledging how students write before evaluating how well students write, because assessment may need to differ based on students' pattern or approach.
Improved algorithmic sensitivity is attainable through at least two advancements.First, the field should continue to develop expanded automated indices that capture a broad range of writing features, behaviors, and more.For example, Kyle, Crossley, and colleagues have contributed an impressive variety of tools to the Suite of Automatic Linguistic Analysis Tools (SALAT) (e.g., [39,63,95,96]).Other teams have innovated methods for detecting and assessing student revisions [97-99], use of rhetorical moves [92,100], writing behaviors and keystrokes [94,98,99], and more.Syntax indices alone can already reveal multiple distinct profiles-almost certainly an understatement of the true variation among student writers that can be captured via rich toolkit of NLP packages.
Importantly, metrics need not be limited to writing features and processes.For instance, the current study did not formally analyze variations across different demographic backgrounds, but preliminary inspection found that all four syntactic clusters were observed across all reported grades, races and ethnicities, genders, and language backgrounds (i.e., English language learner and native English speakers).Future work will need to consider how nonlinguistic variables and demographic data could further enrich our understanding of variability, context, and nuances in student writing when paired with NLP metrics (see [101]).Writers' motivations and cultural experiences shape the knowledge and experiences they bring to writing [4][5][6], which should be respected throughout the assessment of writing.
Enhanced NLP detection (and additional variables) is only the initial step towards improved AWE sensitivity to variance in writing.The second necessary advancement is to develop alternative approaches for operationalizing that variance.The current study employed simple but accessible methods for clustering and characterization (i.e., k-means clustering and DFA) that can be readily replicated.These accessible analytical methods were intentionally chosen to conduct a coarse analysis, but nonetheless revealed meaningful clusters of student writing.Moreover, for three out of four clusters, estimations of essay scores were better within-cluster than when derived from the entire corpus.More sophisticated clustering and profiling methods (for example, MDA, e.g., [30]; latent profile analysis, e.g., [93]) will almost certainly contribute to even more nuanced understanding of student writing.Future approaches may also benefit from deeper examination of interdependencies between writing features, such as combinations of nested variables that account for the changing influences of metrics in context (e.g., the impact of clausal complexity may depend on vocabulary usage).
Large Language Models (LLMs) are another technological advancement that have already begun transforming educational technologies (see [102] for review of recent LLM use in education).LLMs are machine learning algorithms that use deep learning [103] to develop generalizations from training data.Although this process can occur without human intervention, fine-tuning is often required at a later stage (i.e., supervision).
A recent study [104] tested the performance of several LLMs (i.e., Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4) compared to humans in essay scoring.Findings showed that GPT-4 had the best performance as measured by intra-rater reliability and validity.However, GPT-4 performance worsened over time.In another study [105], researchers conducted several experiments testing fine-tuned GPT-3.5 and GPT-4 performance in conducting essay scoring and aiding instructors.Findings showed that the fine-tuned GPT-3.5 produced consistent and accurate scoring.An additional finding was that the LLM model helped human scorers perform better.Specifically, novice raters were able to learn faster, and experienced raters were able to become more consistent and efficient.To conclude, LLMs are a promising avenue in automated scoring and evaluation, which can allow for nuanced generalizations and variance.At the same time, LLMs must be supervised and fine-tuned to avoid the perpetuation of biases in the training data [101].

Implications for Instruction with AWE
Research on AWE has observed mixed but encouraging findings for the effectiveness of these systems [13,14], with critiques arising due to the formulaic, decontextualized, and/or impersonal way that AWE systems assess writing.Improving algorithmic sensitivity may address these challenges.
Instead of assessing student writing from a singular perspective (i.e., the same algorithms applied to all essays), our findings suggest an approach in which AWE systems might (a) first detect the approach(es) exhibited within an essay, and then (b) provide assessment and feedback attuned to those patterns.Variance-sensitive models and systems might operate in a multi-stage or "nested" fashion.Our data showed that essays in all four syntactic clusters could attain high scores, but the pathway to success may differ somewhat.Instead of encouraging all writers to use more sophisticated syntax and vocabulary, some writers might benefit from guidance in varying or even reducing overall complexity.Other writers may benefit from strategies for leveraging familiar and accessible language to complement complex syntax.Importantly, the implication here is not to sort or lock students into a handful of stagnant profiles that determine their destiny.Rather, the purpose is to recognize and appreciate variance in written expression, which then enables instruction and support that aligns with writers' current strengths and needs in context (e.g., scholarship on asset-based and student-centered assessment, [106,107]).AWE systems that are sensitive and responsive to variance may be better able to provide feedback that is centered on the students and their writing rather than the software.
From the increasingly popular perspective of explainable AI (e.g., [108]), it might also be pedagogically worthwhile to explain to students how AWE systems define, detect, and assess different "patterns".Students might be invited to reflect on their own preferred patterns and/or practice new patterns.Instead of learning to write in a single formulaic manner to "get a good grade from the computer", students might learn to purposefully explore or enact different patterns that will be assessed on their own merits.A more nuanced AWE system may serve to reinforce that there are multiple ways to write successfully and express ideas.Scoring and feedback need not happen in the same way for every student, and thus, diverse patterns (and students) can achieve comparable success.Through transparency and explainability regarding variance, students might (a) better understand, discuss, or debate how "scores" are determined, and (b) gain greater awareness for how intentional writing choices can influence writing quality and audiences.

Table 1 .
List of 18 TAASSC indices analyzed in the current study.Dependents per Preposition (stdev) dependents per object of the preposition, standard deviation (pobj_stdev) 10.Determiners per Nominal (avg) average number of determiners per nominal (det_all_nominal_deps_struct) 11.Prepositions per Nominal (avg) average number of prepositions per nominal (prep_all_nominal_deps_struct) 12. Adjectival Modifiers (avg) average number of adjectival modifiers per direct object (amod_dobj_deps_struct) 13.Prepositions per Preposition (avg) average number of prepositions per object of the preposition (prep_pobj_deps_struct)

Table 2 .
Student information for the analyzed corpus (n = 36,207).

Table 3 .
Number of essays per score level in the analyzed corpus (n = 36,207).

Table 4 .
Mean TAASSC index values (and SDs) for the four clusters identified in the k-means analysis.

Table 5 .
DFA function loading using TAASSC indices, ordered by function and magnitude.

Table 6 .
Group centroids based on eigenvalues for each function by each cluster.

Table 7 .
Linear regression analyses for predicting essay score across all clusters and within individual clusters.