1. Introduction
Automated essay scoring (AES), the task of employing natural language processing (NLP) technology to score student essays at scale, plays a vital role in lightening educators’ workload [
1,
2]. Recently, with the raising of massive open online courses (MOOCs), valid and reliable automated assessment tools are vital to test learning outcomes from a large number of learners [
3,
4]. In addition, during grading activities, the same teacher may allocate different scores for the same essay at different times, and different educators may score differently for the identical essay [
5]. AES systems could effectively alleviate this intrarater and inter-rater inconsistency [
6]. Two categories of approaches have been investigated to tackle the AES task: feature-based and neural-based approaches. For feature-based approaches, expert knowledge is needed to design linguistic or rubric indices [
7,
8,
9] reflecting essay grammar, content, and structures, and these manual indices serve as input features for linear regression methods. On the other hand, neural-based approaches automatically learn the features and relations between student essays and their scores through deep learning networks in an end-to-end fashion, eliminating the need for feature engineering and generally outperforming feature-based AES systems [
6,
10,
11,
12]. A vast majority of neural-based AES systems share the same goal of contesting to improve upon the state-of-the-art (SOTA) benchmark performances on a holistic predicted essay score to reflect the general essay quality through designing deep and complex neural network architectures.
Nevertheless, despite their remarkable AES benchmark performance, the neural-based AES systems work like black boxes [
7], and their produced holistic scores are far from providing adequate pedagogical information to educators in practice, which raises ethical issues about their algorithms [
13]. Researchers have worked on automated writing evaluation (AWE) systems to provide students with formative writing feedback on specific quality dimensions such as coherence, grammar errors, rhetorical moves, and topical development [
14,
15,
16,
17]. However, these AWE systems are developed based either on separate algorithms divorced from AES systems or on feature-based AES systems targeting specific rubrics with less competitive automated essay scoring results.
It is challenging to design neural-based AES systems that can provide AWE feedback. Unlike feature-based methods whose input features are explicit rubric indices, most neural-based AES systems take the student essay solely as input, neglecting another essential information, the essay instructions. Whereas for a human being’s process of composing argumentative essays, they first read and pay attention to the topics in the essay instructions and then conduct argumentative writing
“in a principled way to support a claim using reasons and evidence from multiple sources” [
18]. This human being’s reading and writing process inspired us to develop a neural-based AES system that predicts essay scores based on both essay instruction information and student essays. We believe that by feeding essay instruction information, the neural-based AES system could learn prompt-specific knowledge, which would enhance the automated essay scoring performance. In addition, from a pedagogical point of view, student academic success requires the ability to produce a high-quality argument, complete with statements, warrants, and evidence [
19,
20]. Thus, we aimed to provide AWE feedback on spotting key topical sentences (KTS) from student essays that reflect topical information either by reason or evidence. Together with a predicted score, we believe that spotting key topical sentences in argumentative essays could facilitate the teachers’ grading process when judging the quality of essays.
Specifically for our method, we deploy a topical sequence extraction agent to extract topical information from the essay prompt and then we feed both topical information together with student essays to BERT [
21] to train the AES system (denoted by Topic-aware BERT). For AWE feedback, we retrieve KTS by ranking the self-attention weights between sentences in student essays and topical information in the trained Topic-aware BERT, motivated by the argument by Clark et al. [
22] that self-attention maps in BERT could help understand what neural networks learn about language. To evaluate our method, we use an open dataset to evaluate the AES performance using the official quadratic weighted kappa (QWK) metric. Moreover, we manually create a dataset to evaluate the performance of the model to retrieve KTS. The results show that Topic-aware BERT achieves a competitive AES performance compared to state-of-the-art models and that our KTS-retrieving method is effective when applied to argumentative essays. To the best of our knowledge, this is the first study to train a robust BERT-based AES system with prompt-specific knowledge and among the very few works that develop systems that connect both neural-based AES ad AWE systems. We summarize the contributions of this work below:
We successfully link neural AES and AWE by designing a fully automatic multiagent AES + AWE system.
We propose Topic-aware BERT and improve AES performance significantly by introducing prompt-specific knowledge.
This is the first study to build an AWE system by interpreting self-attention layers in BERT. The experiments show that Topic-aware BERT achieves robust performance in spotting key topical sentences from argumentative essays as AWE feedback, by probing attention weight maps.
This paper is organized as follows.
Section 2 lists the related work of automated essay scoring and automated essay evaluation.
Section 3 presents the data used in this paper, including the ASAP dataset and a human-annotated KTS dataset. The method formulation of the topical sequence extraction, Topic-aware, and KTS-retrieving approach can be found in
Section 3. Experiments settings, baselines and comparison models, evaluation metrics, as well as empirical results and analysis, are presented in
Section 4. Finally, we conclude this paper and discuss future works in
Section 6.
5. Result and Analysis
This section presents and analyzes the results of automated essay scoring and key topical sentence retrieval experiments.
5.1. Automated Essay Scoring Result and Analysis
Table 4 illustrates the QWK scores achieved by baseline models and Topic-aware BERT when conducting automated essay scoring. In general, all baseline models’ AES performances were negatively affected by a smaller training data size, as we only conducted experiments with essays from 5152 prompts one, two, and seven, while all prompts consisting of 12,978 essays. For instance, CNN-LSTM achieved a QWK score of 0.821 for essays from prompt one with complete training data, which deteriorated to 0.789 in our experiment. All RNN-based baseline models achieved minor average QWK scores with less training data, where CNN-LSTM’s performance fell from 0.772 to 0.760 and that of CNN-LSTM-Att from 0.768 to 0.757. However, there were some exceptions in which some baseline models could benefit from less training data. For example, CNN-LSTM-Att achieved better AES performance (0.825) than when training with full prompts (0.821), and the average QWK of Vanilla BERT improved from 0.767 to 0.774. As topics and grading rubrics differed from prompt to prompt, we suspected that training with fewer prompts of essays resulted in training data with a smaller size but less variety, which could make it easier for non-topic-aware models to learn AES features with training essays of less inconsistency in the topics and grading rubrics.
On the other hand, the results of Topic-aware BERT indicated that the AES systems’ performance was significantly improved by introducing topical information from essay instructions. All four variants of Topic-aware BERT showed a robust and competitive AES performance and outperformed all three strong baseline models regarding the average QWK. In fact, Topic-aware BERT achieved the second-best AES performance on essays from prompt one and leads performance over prompts two and seven, beating the baselines trained with either experimented essays or all essays. In particular, the best performance models from Topic-aware BERT outperformed Vanilla BERT, which directly proved that making BERT aware of topical information from essay instructions spurred the AES performance.
Looking at the AES result of the variants of Topic-aware BERT, Manual-T BERT achieved the highest QWK of 0.822 on prompt one, while automatic Topic-BERT (YAKE-T BERT, Xsum-T BERT, CNN-T BERT) gained better performance on prompts two and seven, and similarly for the average QWK. Specifically, Xsum-T BERT performed very close to Manual-T BERT with a QWK score of 0.821. For prompt two, YAKE-T BERT and CNN-T BERT were the best models, acquiring QWK scores of 0.717 and 0.714, respectively. Xsum-T BERT and YAKE-T BERT gained the best performances on prompt three, with QWK scores of 0.836 and 0.837, leading jointly in the average QWK metric with 0.789. This result demonstrated that instead of providing Topic-aware BERT with human-picked topic keywords, feeding it with topical sequences from automatic extraction approaches, such as automated key-phrase extraction and automatic one-sentence or multiple-sentence summarization, could further enhance the AES performance.
5.2. Key Topical Sentence Retrieval Result and Analysis
In this section, we present the KTS retrieval empirical result of various variants of Topic-aware BERT (Manual-T BERT, YAKE-T BERT, Xsum-T BERT, and CNN-T BERT) and baselines. Particularly, we analyzed the impact of the topical sequence extraction approaches and different transformer layers of Topic-aware BERT on the KTS retrieval performance with augmentative (prompts one and two) and narrative (prompt seven) essay genres. The KTS retrieval experiment results achieved from different transformers layers of Topic-aware BERT on essays from prompts one, two, and seven are shown in
Figure 4,
Figure 5 and
Figure 6, respectively.
Figure 4 illustrates the KTS performance of different transformers layers from Manual-T BERT, YAKE-T BERT, Xsum-T BERT, and CNN-T BERT on essays from prompt one with the argumentative essay genre. The random selection and TF-IDF baselines achieved MAP scores of 0.46 and 0.50. The best-performing transformer layers from all four variants of Topic-aware BERT acquired MAP scores over 0.60, defeating baselines with notable margins. In detail, Manual-T BERT performed the best with a MAP score of 0.692, followed by Xsum-T BERT with a MAP score of 0.625. While beating baselines, YAKE-BERT and CNN BERT showed less competitive KTS performance than Manual-T BERT and Xsum-T, achieving MAP scores of 0.606 and 0.615, respectively. This was due to the fact that the topical sequences generated from YAKE and CNN were more extended than from Manual and Xsum. The concatenation of long topical sequences and student essays would exceed the limitation of 512 tokens that BERT could at the most process and would truncate some sentences from the last of student essays. The truncated sentences might contain KTS, stopping Topic-aware BERT from retrieving them. For both Manual-T BERT and Xsum-T, transformer layer 10 outperformed the other layers on KTS retrieval on essays from prompt one, while for YAKE-T BERT and CNN-T BERT, transformers layer 7 performed the best.
A similar pattern of results of KTS performance on essays from prompt two, which were also argumentative, is shown in
Figure 5. The best transformer layers of Manual-T BERT, YAKE-T BERT, Xsum-T BERT, and CNN-T BERT outperformed the random selection (MAP 0.47) and TF-IDF (MAP 0.49) baselines. The transformers layer 10 from Manual-T BERT (MAP 0.692) and Xsum-T BERT (MAP 0.625) were the best- and second-best-performing KTS retrieval systems. YAKE-T BERT and CNN-T BERT gained relatively lower MAP scores compared with those of Manual-T BERT and Xsum-T BERT. The transformers layer 10 also served as the best layer in YAKE-T BERT (MAP 0.594) and CNN-T BERT (MAP 0.672) for prompt two essays, while layer 7 acquired a very close KTS retrieval performance with MAP scores of 0.563 and 0.572, respectively. The possibility of truncating essays resulted in the inconsistency of the KTS retrieval performance of YAKE-T BERT and CNN-T BERT between prompts one and two essays.
As shown in
Figure 6, the baseline KTS retrieval approach TF-IDF achieved a strong performance with a MAP score of 0.59 on essays from prompt seven with the narrative genre. The transformers layer 10 from YAKE-T BERT was the only KTS retrieval system that outperformed TF-IDF, acquiring a MAP score of 0.640. However, Manual-T BERT, Xsum-T BERT, and CNN-T BERT achieved MAP scores of 0.572, 0.580, and 0.580, failing to surpass TF-IDF. This result indicated that Topic-aware BERT showed a robust KTS retrieval performance when dealing with argumentative essays. Nevertheless, Topic-aware could not effectively retrieve KTS from narrative essays.
5.3. Summary of AES and KTS Retrieval Results and Analysis
We summarize the AES and KTS retrieval performance and analysis in
Table 5. All variants of Topic-aware BERT outperformed strong AES baselines on the average QWK, demonstrating that with the help of topic sequence extraction, Topic-aware BERT archived a robust and competitive automated essay scoring performance. In addition, the automatic Topic-aware BERT (YAKE-T BERT, Xsum-T BERT, and CNN-T BERT) further improved the AES performance with a thoroughly automatic topic-awareness strategy, which stimulated the scalability of Topic-aware BERT AES systems. Regarding KTS retrieval performance, Manual-T BERT, YAKE-T BERT, Xsum-T BERT, and CNN-T BERT jointly surpass the baseline approaches on essays from prompts one and two, indicating a reliable and effective key topical sentence retrieval performance in argumentative essays. Specifically, as shown in
Table 2, Manual-T BERT and Xsum-T had topical sequence lengths of 4 and 12, which were much shorter than those of YAKE-T BERT (32) and CNN-T BERT (42). Correspondingly, Manual-T BERT and Xsum-T demonstrated a more robust KTS retrieval performance than YAKE-T and CNN-T because a longer topical sequence could truncate the student essays and make the KTS at the end of the student essays invisible to Topic-aware BERT. This interesting result revealed an interesting future work to study the relationship between topical sequence lengths and KTS retrieval performance. Looking at the performance of different layers in KTS retrieval, consistently, transformer layer 10 served as the best-performing layer in Manual-T BERT and Xsum-T BERT for argumentative essays. We can recommend layer 10 from Manual-T BERT and Xsum-T to retrieve key topical sentences from student essays. In particular, Xsum-T BERT, as a fully automatic approach, achieved both competitive performances in AES and KTS retrieval, which is promising for a deployment with good scalability. While for the narrative essay from prompt seven, despite the effective automated essay scoring performance, the KTS retrieval competence of Topic-aware BERT exhibited less competitive performance compared with when it processed argumentative essays. We will conduct future work to investigate the relationship between essay genres and KTS retrieval performance. In summary, we conclude that:
With the awareness of the essay topics, all variants of topic-aware BERT outperformed current best AES baselines on average QWK.
Automatic Topic-aware BERT further improved the AES performance and indicated a potential for being deployed in practice at scale.
All variants of topic-aware BERT showed reliable KTS retrieval performance in argumentative essays.
Topical sequence extraction strategies, such as Xsum, which produced a proper length of topical sequences, could stimulate AES and KTS retrieval performance.
6. Conclusions and Future Work
In this paper, we proposed Topic-aware BERT to connect automated essay scoring with automated writing evaluation. The experiments illustrated that by feeding both extracted topical sequences and student essays, Topic-aware BERT achieved solid and robust AES performance compared with various previous best AES methods. Moreover, the 10th layer of Topic-aware BERT achieved robust performance in spotting key topical sentences from argumentative essays by probing self-attention scores. With a reliably predicted essay score, the extracted key topical sentences as AWE information will accelerate teachers’ grading process, enhance plagiarism control, and improve the transparency of the AES system. In particular, one of the proposed Topic-aware BERT variants, Xsum-T BERT, thoroughly automated the AES and KTS retrieval process and achieved strong performance in both tasks, which had the potential to be widely deployed in practice. We also identified some interesting future work, such as exploring the relationship between the KTS retrieval performance of Topic-aware BERT and essay genres, as well as the topical sequence lengths. As in the previous BERT-based AES system, Topic-aware BERT could only process up to 512 tokens, which might limit the AES and KTS retrieval performance on long essays. Thus, we will investigate models that can process long documents to address this limitation. We will also conduct more experiments with other AES datasets with more genres and varying essay lengths, such as CLC-FCE and TOEFL11, together with larger annotated KTS datasets, to validate the generalization and stability of Topic-aware BERT. In the future, we plan to apply Topic-aware BERT in real classrooms to investigate its scalability and sustainability regarding resource consumption and whether it could benefit teachers during essay-grading activities.