# Automated Essay Scoring Using Transformer Models

## Abstract

## 1. Introduction

- To what extent does a transformer-based NLP model produce benefits compared to a traditional regression-based approach for AES?
- In which ways can transformer-based AES be used to increase the accuracy of scores by human raters?

## 2. Methodological Background for Automated Essay Scoring and NLP Based on Neural Networks

#### 2.1. Terms and General Methodological Background

#### 2.2. Traditional Approaches

#### 2.3. Approaches Based on Neural Networks

#### 2.3.1. Methodological Background on Recurrent Neural Nets (RNN)

#### 2.3.2. Results of Recurrent Neural Networks for Automated Essay Scoring Tasks

#### 2.3.3. Methodological Background on Transformer Models

#### 2.3.4. Results of Transformer Models for Automated Essay Scoring Tasks

## 3. Data

## 4. Method

#### 4.1. Basic Data Preparation Used in Both Approaches

- from sklearn.model_selection import train_test_split
- train_text_series, test_text_series, train_label_series,
- test_label_series = train_test_split(data[“text”], data[“label”],
- test_size = 0.30, random_state = 42)

- train_text = train_text_series.to_list()
- test_text = test_text_series.to_list()
- train_label = train_label_series.to_list()
- test_label = test_label_series.to_list()

#### 4.2. Regression Model Estimation

- Creating a frequency dictionary, with the information on how often a word was used in a polite or impolite response, and
- Computing for each response a sum score for politeness and for impoliteness, based on the words included in the responses and their values in the frequency dictionary.

- # Create frequency dictionary
- freqs = build_freqs(train_text, train_label)
- # Extract features
- train_features = np.zeros((len(train_text), 2))
- for i in range(len(train_text)):
- train_features[i, :]= extract_features(train_text[i], freqs)

- from sklearn.linear_model import LogisticRegression
- log_model = LogisticRegression(class_weight=
- ‘balanced’).fit(train_features, train_label)

- from sklearn import metrics
- print(“Confusion Matrix:\n”, metrics.confusion_matrix(test_label,
- log_model.predict(test_features)))
- print(“Mean Accuracy:\n”, log_model.score(test_features, test_label))
- print(“F1 Score:\n”, metrics.f1_score(test_label,
- log_model.predict(test_features)))
- print(“ROC AUC:\n”, metrics.roc_auc_score(test_label,
- log_model.predict(test_features)))
- print(“Cohen’s Kappa:\n”, metrics.cohen_kappa_score(test_label,
- log_model.predict(test_features)))

#### 4.3. Transformer Based Classification

- from transformers import AutoTokenizer
- checkpoint = “deepset/gbert-base”
- tokenizer = AutoTokenizer.from_pretrained(checkpoint)
- train_encodings = dict(tokenizer(train_text, padding = True,
- truncation = True, return_tensors = ‘np’))

- unique, counts = numpy.unique(train_label, return_counts = True)
- class_weight = {0: counts[1]/counts[0], 1: 1.0}

- # Definition of batch size and number of epochs
- batch_size = 8
- num_epochs = 3
- # Definition of the learning rate scheduler
- # The number of training steps is the number of samples in the
- dataset, divided by the batch size then multiplied by the
- total number of epochs
- num_train_steps = (len(train_label) // batch_size) * num_epochs
- lr_scheduler = PolynomialDecay(initial_learning_rate = 5e-5,
- end_learning_rate = 0, decay_steps = num_train_steps)
- # Definition of the optimizer using the learning rate scheduler
- opt = Adam(learning_rate = lr_scheduler)
- # Definition of the model architecture and initial weights
- model = TFAutoModelForSequenceClassification.from_pretrained(
- checkpoint, num_labels = 2)
- # Definition of the loss function
- loss = SparseCategoricalCrossentropy(from_logits = True)
- # Definition of the full model for training (or fine-tuning)
- model.compile(optimizer = opt, loss = loss, metrics = [‘accuracy’])

- model.fit(train_encodings, np.array(train_label),
- class_weight = class_weight, batch_size = batch_size,
- epochs = num_epochs)

- import tensorflow as tf
- test_pred_prob = tf.nn.softmax(model.predict(dict(test_encodings))[‘logits’])
- test_pred_class = np.argmax(test_pred_prob, axis = 1)

## 5. Results

## 6. Discussion

## 7. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

#### Appendix A.1

#### Appendix A.2

#### Appendix A.3

## References

Study | Task | Data | Model | Kappa |
---|---|---|---|---|

Taghipour & Ng (2016) | Scoring essay answers to eight different questions, some of which depend upon source information | 12,978 essays with a length of 150 to 550 words | LSTM + CNN | 0.76 |

Alikaniotis et al. (2016) | Scoring essay answers to eight different questions, some of which depend upon source information | 12,978 essays with a length of 150 to 550 words | LSTM combined with score-specific word embeddings | 0.96 |

Architecture | Examples | Tasks |
---|---|---|

Encoder | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering |

Decoder | CTRL, GPT, GPT-2, GPT-3, Transformer XL, GPT-J-6B, Codex | Text generation |

Encoder-decoder | BART, T5, Marian, mBART | Summarization, translation, generative question answering |

Study | Task | Data | Model | Kappa |
---|---|---|---|---|

Rodriguez et al. (2019) | Scoring essay answers to eight different questions, some of which depend upon source information | 12,978 essays with a length of 150 to 550 words | BERT XLNet | 0.75 0.75 |

Mayfield, & Black (2020) | Scoring essay answers to five different questions, some of which depend upon source information | 1800 essays for each of the five questions, each with a length of 150 to 350 words | N-Gram DistilBERT | 0.76 0.75 |

Actual\Predicted | Impolite | Polite |
---|---|---|

Regression results for test data | ||

Impolite | 31 | 15 |

Polite | 87 | 494 |

German BERT (small) results for test data | ||

Impolite | 29 | 17 |

Polite | 35 | 546 |

German BERT (large) results for test data | ||

Impolite | 32 | 14 |

Polite | 24 | 557 |

Model | Accuracy | F1 Score | ROC AUC Score | Cohen’s Kappa |
---|---|---|---|---|

Logistic Regression | 0.84 | 0.91 | 0.76 | 0.30 |

German BERT (small) | 0.92 | 0.95 | 0.77 | 0.52 |

German BERT (large) | 0.94 | 0.97 | 0.82 | 0.59 |

