Automated Essay Scoring: A Siamese Bidirectional LSTM Neural Network Architecture

: Essay scoring is a critical task in education. Implementing automated essay scoring (AES) helps reduce manual workload and speed up learning feedback. Recently, neural network models have been applied to the task of AES and demonstrates tremendous potential. However, the existing work only considered the essay itself without considering the rating criteria behind the essay. One of the reasons is that the various kinds of rating criteria are very hard to represent. In this paper, we represent rating criteria by some sample essays that were provided by domain experts and deﬁned a new input pair consisting of an essay and a sample essay. Corresponding to this new input pair, we proposed a symmetrical neural network AES model that can accept the input pair. The model termed Siamese Bidirectional Long Short-Term Memory Architecture (SBLSTMA) can capture not only the semantic features in the essay but also the rating criteria information behind the essays. We use the SBLSTMA model for the task of AES and take the Automated Student Assessment Prize (ASAP) dataset as evaluation. Experimental results show that our approach is better than the previous neural network methods.


Introduction
Manual scoring has a large workload and sometimes is subjective according to different experts. The goal of automated essay scoring (AES) is to enable computers to score students' essays automatically, thereby reducing the subjectivity of manual ratings and the workload of teachers and speeding up the feedback in the learning process. Currently, there are some AES systems, such as Project Essay Grade (PEG) [1], Intelligent Essay Assessor (IEA) [2], E-rater [3], and Besty that are applied to educational practice, but these systems are not promising in the future. AES is quite complicated; it depends on how much the machine could understand the language, such as spelling, grammar, semantics and other grading information. Traditional AES approaches were regarded as a machine learning approach, such as classification [4,5], regression [3,6], or ranking classification problems [7,8]. These approaches make use of various features, such as the length of the essay, Term Frequency-Inverse Document Frequency (TF-IDF), etc., to achieve AES. One drawback of this kind of feature extraction is that it is often time-consuming and the regulation for feature extraction is often sparse, instantiated by discrete pattern-matching and it being hard to generalize.
The neural network and distributed representation [9,10] have provided tremendous potential for natural language processing. A neural network can train an essay represented by distributed representation and producing a single dense vector that represents the whole essay. Furthermore, the single dense vector and the score are trained by the neural network to form a one-to-one correspondence. Without any other handcrafted features, a nonlinear neural network model has been shown its special advantages-that it's much more robust than the traditional statistical models across different domains. Recently, many researchers have studied AES using neural networks [11][12][13][14][15][16] and made quite good progress. These researchers mainly focus on convolutional neural networks (CNN) [17][18][19], recurrent neural networks [20] (RNN, the most widely used RNN is long short-term memory (LSTM) [21]), the combination of CNN and RNN (LSTM), attention mechanisms, and some special internal features representation, such as coherence feature among sentences [18]. CNN has a good application in the image [22,23], and it can also be applied to sequence models [24]. RNN is very advantageous for sequence modeling. Google applied the attention module to the language mode directly [25,26]. However, at present, the researchers applied all kinds of models to AES only considering the essay itself while neglecting the rating criteria behind the essay. In this paper, we considered this kind of information and gave an interpretable novel end-to-end neural network AES approach. We represent rating criteria by introducing some sample essays (the following short as the sample) with different ranks which were provided by domain experts (if not, manually get an average one from the dataset instead). Thereby, we get some essay pairs as new inputs to AES. Each pair consists of an essay itself and a sample. We proposed a Siamese Bidirectional Long Short-Term Memory Architecture (SBLSTMA) to receive the new input to achieve AES. Because the rating information was also involved, our SBLSTMA model can capture not only the semantic information in the essays but also the information beyond of the dataset-rating criteria. We explored the SBLSTMA model for the task of AES and used the Automated Student Assessment Prize (ASAP) dataset (ASAP, https://www.kaggle.com/c/asap-aes/data) as evaluation. The results show that our model empirically outperforms the previous neural network AES methods. Figure 1 shows the overall framework of the approach. Different from the previous approaches that train or predict the dataset directly (the above of Figure 1), we added rating criteria as a part of input (the bottom of Figure 1). Experience tells that human raters give scores not only by essays themselves but also by rating criteria (We use samples instead). Our model is to imitate this behavior of human raters. We believe that essays don't have all the rating information and some of that is beyond the essays. Therefore, to take this kind of information as a part of the input is a benefit for scoring. We briefly describe how to make use of this sample first. We simply mark v as the distribute represent function; then, v(e) and v(s) are the word embeddings of essay e and sample s, respectively. The difference between the essay vector v(e) and the sample vector v(s) is defined as the distance information of these two. Mark dist = v(e) − v(s) as the distance information, subsequently, as shown in Figure 2, dist and v(e) are fed into the model together. We mark pair (v(e), v(s)) as the new input, and we can also construct a map to represent the label of pair (v(e), v(s)). The detail description of input was described in Section 3.1.
The prime contributions of our paper are as follows: • For the first time, we introduce some samples to represent the rating criteria to increase the rating information and construct a pair consisting of an essay and a sample as the new input. We can understand it as how similar is the essay and sample or how close is the essay and sample. This, to a certain extent, is similar to semantic similarity [27] and question-answer matches [14]. We introduce it to AES.

•
We provide a self-feature mechanism at the LSTM output layer. We compute two kinds of similarities: the similarity between sentences in the essay and the similarity between essay and sample. The experiment shows that it is a benefit for the essays, which are long and complicated. This idea is inspired by the SKIPFLOW [14] approach, but we make an extension of it.

•
We proposed a Siamese Bidirectional Long Short-Term Memory Architecture (SBLSTMA); this is a Siamese neural network architecture that can receive the essay and sample in each side. We use the ASAP dataset as an evaluation. The results show that our model empirically outperforms the previous neural network AES approaches.
This paper is organized in the following way: Section 2 discusses the related work, Section 3 describes automated essay scoring, and Section 4 is about the experiment, results, and discussion. Finally, conclusions are in Section 5.

Related Works
Research on AES began decades ago. In the field of application, the first AES system named Project Essay Grade (PEG) [1] for automating the educational assessment was seen in 1967. Intelligent Essay Assessor (IEA) [2] adopts a Latent Semantic Analysis (LSA) [28] algorithm to produce semantic vectors for essays and computes the semantic similarity between the vectors. The E-rater system [3], which can extract various grammatical structure features of the essay, now plays a facilitating role in the Graduate Record Examination and Test of English as a Foreign Language. In the early research of AES, it was regarded as a semi-automated machine learning approach based on various feature extractions. Larkey [4] and Rudner and Liang [5] treated AES as a kind of classification using bag-of-words features. Attali and Burstein [3] and Phandi [2] used regression approaches to achieve AES. Yannakoudakis et al. [7] took automated essay scoring as a ranking problem by ranking the order of pair essays based on their quality. Features such as words, Part-of-Speech (POS) tagging, n-grams features, complex grammatical features, are extracted. Tandalla [29] used traditional machine learning approaches to extract multi-features to achieve AES, including regular expression from the text and trained on ensemble learning approaches like RF and GBM. Arif Mehmood et al. [30] also proposed a model performing AES using multi-text features and ensemble machine learning. Chen and He [8] described AES as a ranking problem which took the order relation among the whole essays into account. The features contain syntactical features, grammar and fluency features as well as content and prompt specific features. Shristi Drolia et al. [31] proposed a regression-based approach for automatically scoring essays that are written in English; they use standard Natural Language Processing (NLP) techniques for extracting the features from the essays. Phandi et al. [6] made use of correlated Bayesian Linear Ridge Regression approach to tackle domain-adaptation tasks. McNamara et al. [32] evaluated the use of a hierarchical classification approach to the automated assessment of essays. This research computes the essay scores by using a hierarchical approach, analogous to an incremental algorithm for hierarchical classification. Fauzi et al. [33] used an automatic essay scoring system which was based on n-gram and cosine similarity to extract features and also considered the word order. Based on the existing automated essay evaluation systems, Zupanc et al. [34] proposed an approach which incorporates additional semantic coherence and consistency attributes. They extracted the coherence attributes by transforming sequential parts of an essay into the semantic space and calculating the changes between them to estimate coherence of the essay. All of these methods mentioned above are all kinds of machine learning that needs a handcrafted features extraction. The application fields of that have certain limits, and the average accuracy is not always good.
Since deep learning was introduced into natural language processing, more and more researchers have carried out related research. Cícero Nogueira dos Santos [17] proposed a deep convolutional neural network which focuses on different levels of analysis that from character-level to sentence-level information to perform sentiment analysis of short essays. Wenpeng et al. [18] investigated machine comprehension on a question answering (QA) benchmark called MCTest. They proposed a neural network framework, termed hierarchical attention-based convolutional neural network (HABCNN), to address this task without any handcrafted features. HABCNN employs an attention mechanism to weight the key phrases, key sentences and key snippets that are relevant to answering the question. Zhang et al. [19] gave a sensitivity analysis of one-layer CNN to explore the effect of architecture components on model performance to distinguish between important and comparatively inconsequential design decisions for sentence classification. Yang et al. [35] proposed a hierarchical attention network for document classification. The model has a hierarchical structure that mirrors the hierarchical structure of documents, and it also has two levels of attention mechanisms applied at the word and sentence level, enabling it to attend differentially to more and less important content when constructing the document representation. Dong and Zhang [24] employed a convolutional neural network (CNN) for the effect of automatically learning features. Kumar et al. [36] introduced a novel architecture for AES grading by combining three neural building modules: Siamese bidirectional LSTMs applied to a model and a student answer, a new pooling layer based on earth-mover distance across all hidden states from both LSTMs and a flexible final regression layer to output scores.
Especially in 2012, Kaggle launched a competition on AES called 'Automated Student Assessment Prize' (ASAP, https://www.kaggle.com/c/asap-aes/data) sponsored by the Hewlett Foundation. Hewlett hopes data scientists and machine learning specialists help solve a fast, effective and affordable solutions for automated grading of student-written essays. At that time, the competitors mostly use machine learning algorithms which need handcrafted features extraction. Recently, many researchers have conducted a series of neural network-based AES studies using ASAP data sets. Alikaniotis et al. [11] employed a neural model to learn features for essay scoring automatically, which leverages a score-specific word embedding (SSWE) for word representations. Alikaniotis's experiment shows that SSWE is better for word embedding compared with other pre-trained word embeddings like word2vec, and LSTM [21] structure can capture the semantic information of the essay better than support vector machine (SVM). Taghipour et al. [12] developed an approach based on recurrent neural networks to learn the relation between an essay and its assigned score, without any feature engineering. They combined convolutional neural networks and recurrent neural networks for AES and demonstrated that LSTM and CNN are capable of outperforming systems that extensively require handcrafted features. In this paper, CNN was taken as an optional layer before inputting into LSTM, especially for those essays with a long length. Dong et al. [13] thought that, when using RNN and CNN to model input essays, the relative advantages of RNN and CNN cannot be compared based on the single vector representations of the essays. In addition, different parts of the essay can give different contribution for scoring. Therefore, they introduced the attention mechanism on the basis of CNN and RNN and found that the attention mechanism helps to find the keywords and sentences that contribute to judging the quality of essays. By building a hierarchical sentence-document model to represent essays, the model uses the attention mechanism to decide the relative weights of words and sentences automatically. The model can learn text representation with LSTMs which could model the coherence and coherence among a sequence of sentences. Furthermore, attention pooling is used to capture more relevant words and sentences that contribute to the final quality of essays. Borrowing the idea from Dong, we also use an attention mechanism at the LSTM layer. Tay et al. [14] described a new neural architecture that enhances vanilla neural network models with auxiliary neural coherence features and proposed a new SKIPFLOW mechanism. The SKIPFLOW model alleviates two problems: one is to alleviate the inability of current neural network architectures to model flow, coherence and semantic relatedness over time; the other one is to ease the burden of the recurrent model. To do so, the SKIPFLOW models the relationships between multiple snapshots of the LSTM's hidden state over time. As the model reads the essay, it models the semantic relationships between two points of an essay using a neural tensor layer. Eventually, multiple features of semantic relatedness are aggregated across the essay and used as auxiliary features for prediction. Then, they use the semantic relationships between multiple snapshots as auxiliary features for prediction. The SKIPFLOW mechanism based on LSTM architecture, which incorporates neural coherence features, implements an end-to-end AES approach. Inspired by this, furthermore, we put forward a self-information mechanism that is an extension from the essay to the essay and sample (rating criteria). Ref. [13,14] was also taken as a baseline in this paper.

Automated Essay Scoring
In this section, we define the input data, the evaluation metric, model architecture, and model training.

Description of Input
In supervised learning, we train the model by examples and their labels. In this paper, our inputs were reconstructed that contain the essay and sample, we need to construct a map to make a label for each new input, and after training, we need to be able to compute the original essay score by the inverse of the mapping. We define it officially as follows.
Let G be the score set, i ∈ G is a score, |G| = K, i ∈ [0, K]; Let E be the essays set, e i is the ith essay, |E| = N, i ∈ [1, N]; let S be the sample set, s j is a sample, where j is a score, j ∈ G, and |S| = C is the number of samples set S. Usually, C is less than or equal to K. Let v be the word embedding function; we simply mark v(x) as the word embedding of essay X. dist i,j = v(e i ) − v(s j ) was marked as the distance information between e i and s j . Let f be the score function, for the essay e i of which the score is j, we mark f (e i ) = j; similarly, for the sample s i of which the score is j, we mark f (s i ) = j. Mark p i,j = (e i , s j ) as an input; then, set P = p i,j |e i ∈ E, s j ∈ S as the input dataset. Compared with the original essay dataset E, the new dataset P was expanded by C times.
We use score function ϕ to represent the score of input p i,j ; that is to say, we mark the score of p i,j as ϕ(p i,j ). We define ϕ(p i,j ) as: where C = |S| is the number of the sample set. Obviously, Equation (1) is a monotone function that is used to initialize the input's label. In particular, when C = 1, Equation (1) will degenerate into Equation (2): From Equation (1), we have From Equations (1) and (3), we know that f (e i ) is independent from f (s j ), while, if we useφ(p i,j ) to denote the prediction value of ϕ(p i,j ), then f (e i ) will be changed. We usef (e i ) instead of the prediction value of f (e i ) shown in Equation (3). Then, we have: Equations (3) and (4) are used to evaluate the test results of the model. In particular, when C = 1, Equation (4) will degenerate into Equation (5): Equations (2) and (5) are consistent in form. Here, we get the new input and their scores (labels). In the actual training, we can gradually increase the number of the sample set. Empirical results show that, usually, C ≤ 5 can we get a good result; in rare cases, we need a further discussion at the circumstance of C > 5 . Now, we just use the samples as a part of the input. In two ways, we can get the samples. One is that the experts provide us with some samples with different ranks. The other one, also used in this paper, is to leverage the average value of the vector representation of all the essays which have the same rank to denote the sample vector. For the specific process, we get the samples vector according to Equation (6).
Assume that M is the number of all the essays with the same score j, e i is one of them; then, the sample s j 's vector was given by Equation (6) where v is the word embedding function which we defined earlier in this section. For the different score j, we can easily get the sample vector v(s j ) . The experiment shows that such a way to get the sample is feasible.

Evaluation Metric of Output
Essay score predictions are evaluated using objective criteria. Quadratic weight Kappa (QWK) measures the agreement between two raters. Different from Kappa, QWK considers quadratic weights by a quadratic weight matrix. This metric typically varies from 0 (only random agreement between raters) to 1 (complete agreement between raters). If there is less agreement between the raters than expected by chance, this metric may go below 0. The QWK is calculated between the automated scores for the essays and the resolved score for human raters on each set of essays. The official evaluation metric of ASAP Kaggle competition is QWK. Moreover, many follow-up researchers who use ASAP datasets to study AES take QWK as an evaluation metric. In this paper, our experiment dataset is the ASAP dataset as well. To make a better comparison with the relevant research, we adopt QWK as an evaluation metric too.
The QWK is defined as follows: where i and j are the human rating and machine rating, respectively; N is the number of possible ratings. The matrix O is constructed over the essay ratings, such that O i,j corresponds to the number of essays that received a rating i by a human and a rating j by machine. A histogram matrix of expected ratings, E, is calculated, assuming that there is no correlation between rating scores. This is calculated as the outer product between each rater's histogram vector of ratings, normalized such that E and O have the same sum. From these three matrices W, E and O, the quadratic weighted kappa is calculated by Equation (8):

Model Architecture
In this section, we introduce the overall architecture of the model. Figure 2 shows the SBLSTMA model. As shown in Figure  Usually, the third one is the best, the first one is the worst and the second is in the middle of the two. This confirms our previous hypothesis that the more input of scoring information, the better the scoring results. The detail will be discussed in Section 4.3.

Embedding Layer
Our model accepts a pair as a training instance each time. Each pair contains an essay e i and a sample s j as shown in Figure 2. The essay was represented as a fixed-length sequence in which we pad all sequences to the maximum length. Subsequently, each sequence is converted into a sequence of low dimensional vectors via the embedding layer. For the convenience of description, we use the function v to represent the process of word embedding. v(e i ) ∈ R |V|×D and v(s j ) ∈ R |V|×D are the word embedding outputs, where |V| is the size of the vocabulary and D is the dimension of the word embedding.
After word embedding, we use dist i,j = v(e i ) − v(s j ) to represent the distance information between essay v(e i ) and v(s j ). We think that the distance information can be trained in the model and the new inputs make the model easier to converge, especially for those data sets with smaller data volumes.

Convolution Layer
This layer is an optional choice that you can skip this layer, especially for essays with a short length. We do the convolutional operation on prompt 8 which has the longest average length and has the fewest examples. The specific description of the dataset is in Section 4.1. After the dense representation of the long input sequence is calculated, it is fed into the LSTM layer of the network. For the long length essays, it might be beneficial for the network to extract local features from the sequence before applying the recurrent operation. This optional characteristic can be achieved by applying a convolution layer on the output of the embedding layer.

LSTM Layer
The sequence of word embeddings obtained from the embedding layer (or convolution layer) is then passed into a long short-term memory (LSTM) network [21]: where x t and h t are the input vectors at time t. The LSTM model is parameterized by output, input and forget gates, controlling the information flow within the recursive operation. The following equations formally describe the LSTM function: At every time step t, LSTM outputs a hidden vector h t that reflects the semantic representation of the essay at position t. The final representation of the essay is again feature-extracted in the self-information layer. In the experiment, we use bidirectional LSTM [37,38] and the attention mechanism [18] in the LSTM layer.

Self-Feature Layer
In this layer, we describe how to extract the self-feature from the vectors obtained from the bidirectional LSTM layer. We think that the essay vector e i and distance information vector dist i,j should have some external relationships and the adjacent sentences in the essay should have some internal relationships, so we try to describe these relationships. Let he be the essay hidden layer, he t denotes the vector at position t of he; let hd be the distance information hidden layer, hd t denotes the vector at position t of hd. Let δ be the length of sentence (we assume the lengths are the same in different sentences). Then, we compute the similarity of vector he at position t and t + δ, and we call this similarity inner-feature: Furthermore, we can compute the similarity in the same position t of vector he and hd, and we call this similarity a cross-feature: '·' in Equations (16) and (17) are dot products. Then, both inner-feature and cross-feature are concatenated into vectors (we named these inner-feature and cross-feature directly) respectively and output to the next layer. Besides inner-feature and cross-features, we have two other main outputs: essay hidden layer and distance information hidden layer. We can do two kinds of processing for these two layers. One way is to take vectors at the last position of he and hd directly; the other way is to take the mean vector over time. We name these two vectors he-vector and hd-vector. As Figure 2 shows, four vectors are output to the full connect layer.

Fully-Connected Layer
Subsequently, we get four vectors obtained from the self-information layer: he-vector, hd-vector, inner-feature, and cross-feature. We can concatenate these four vectors into one. Then, we output the concatenate vector to the Softmax layer.

Softmax Layer
This layer is to classify the output of the fully connected layer. Its classification is achieved by Equation (18) s where X is the input vector (the output of fully-connected layer), w is the weight vector, and b is the bias.

Training
The optimization algorithm we adopt is the Adaptive Gradient Algorithm [39] and the loss function we use is cross entropy loss function. It is defined as Equation (19) H where Y,ỹ are the true label and predicted label of the training essays, respectively; p, q are the probabilities. In addition, we use the dropout mechanism to avoid training overfitting. Our training method is to train a fixed number of epochs, and each epoch was trained, the QWK value is tested with the validation data; then, the parameters of the best QWK value are saved and used for the model predicting on the test dataset. The specific training hyper-parameters are listed in Table 1.

Experiments
In this section, we describe the procedure of the experiment, including setup, baseline, results, and discussion.

Setup
The dataset we used is ASAP, a Kaggle competition dataset sponsored by the William and Flora Hewlett Foundation (Hewlett Foundation) in 2012. Many researchers have done the AES study on this dataset; choosing this dataset will help us to compare it with the previous experimental results. It contains eight prompts, each of which is a different genre. It was described in Table 2.
We take Stanford's publicly available GloVe 50-dimensional embedding [40] as pre-trained word embedding instead of training it ourselves. Because we think that using the third party pre-trained word embedding makes the model more generally and more opening, the data is tokenized with an Natural Language Toolkit (NLTK, http://www.nltk.org/) tokenizer. For those words that can't be found in pre-trained word embedding, we replace them with UNKNOW. In addition, we adopt QWK mentioned in Section 3.2 to measure the output results and use 5-fold cross-validation to evaluate our model.

Baseline
To evaluate the performance of our model, we take two models that have the best Kappa value at present as our baselines. One is the SKIPFLOW model [14] that demonstrates state-of-the-art performances on the benchmark ASAP dataset. The other one is also based on ASAP called attention based recurrent convolutional neural network (LSTM-CNN-att) [13] which incorporates the latest neural algorithms such as attention mechanism, CNN, LSTM, etc., The two models both adopt 5-fold cross-validation to evaluate and the measure metric is QWK. The results of the two baseline model are listed in Table 3.

Results and Discussion
The results are listed in Table 3. Our model SBLSTMA outperforms both baseline models (LSTM-CNN-att and SKIPFLOW) by approximately 5% on average QWK (quadratic weighted Kappa). The results are statistically significant with p < 0.05 by 1-tailed t-test.
From Table 3, we know that the empirical results have been significantly improved. We think that this is because the knowledge of the rating criteria-distance information plays a very significant role. To explain it, we further decompose the model SBLSTMA to another two submodels. As described in  Table 4, where the sample sets used were listed in Table 5.
The information distance is based on the sample set described in Section 3.1. It is directly related to the quality of the experimental results. We need to find the samples that could reflect the rating criteria as accurately as possible. The maximum element of sample set depends on the range of essay's score, but we can't select all the different score essays as the samples, especially for the essays with a large score range; if so, the training will be very time-consuming, and the results are not necessarily good. Empirical results show that, usually, for the dataset that has a narrow score range, we can take all the samples with different scores as a sample set, such as prompts 3, 4, 5, and 6; for the dataset that has a large score range, we can make some of the samples as a sample set, such as prompts 1, 2, 7, and 8. The way we get a sample set for the dataset that has a large score range according to the steps as follows: 1 According to Equation (6), we compute all the samples s j of each prompt.
2 For each s j in a prompt, make a pre-training under Mb + Mc and gives a sort, of which the order is sorted by the quality of Kappa value of the training results.
3 Take the first sample in the sort gives in step 2 as the initial sample set. If the training results are less than the threshold (the result expectation was initialized before), then continue to add the second sample in the sort into the samples set,. . . , until the results are greater than the threshold or all the samples are added into the sample set.
Take prompt 4, for example: the scores are 0, 1, 2, 3, and the corresponding samples are s 0 , s 1 , s 2 , s 3 . By pre-training, we get a sort of [s 2 , s 1 , s 3 , s 0 ], which means that the training result of s 2 is the best one, s 1 is the second one, and so on. Then, we first take sample set {s 2 } as the initial sample set, {s 2 , s 1 } as the second one, and so on. Table 5 shows the samples that we used in the experiment.
The results of each decomposed submodel listed in Table 4 shows that the Kappa value under model Mb + Mc is better than Ma + Mc. It means that the distance information as an input is useful for training. Such an input based on rating criteria contains more rating information, and it does reflect a certain distance between the essay and sample. For a more intuitive explanation, we provide the Kappa value diagrams of the first 100 epochs of all eight prompts under Ma + Mc and Mb + Mc shown in Figure 3.    Table 6 also tells the mean value and standard deviation of prompt 8 are relatively worse for the first 100 epochs. We consider this due to the fewest number of essays and the longest essay length and the largest size of the score range of prompt 8. For the other prompts, we can increase the number of samples set to improve the training effect, but, for prompt 8, we are not able to do this. When increasing the number of the sample set of prompt 8, the training process is not stable and is hard to converge. Therefore, in the experiment, the sample set number of prompt 8 is the smallest one. Furthermore, from Table 4, we know that the results under Ma + Mb + Mc are the best. The average Kappa value of Ma + Mb + Mc is 0.44 greater than that of Mb + Mc. In particular, prompt 2 and prompt 3, which have the worst Kappa value in baseline models, were improved obviously in our model. We think that the input under this model contains more information: essay, distance information and self-feature mechanism, which are good for rating. The value of parameter δ, which denotes the length of sentence defined in Section 3.3.4, was fed as 10. To explain it clearly, we take prompt 2 and prompt 3 for example. We give these two prompts'

Conclusions
In this paper, we represent the rating criteria behind the essay by some samples and take it as a part of the input. Meanwhile, a self-feature mechanism at the LSTM output layer was provided as well. Then, we propose a novel model, a Siamese Bidirectional Long Short-Term Memory Architecture (SBLSTMA), to learn the text semantics and grade essays automatically. Our approach outperforms the baseline by approximately 5%. By decomposing the model, we find that the model with distance information input is much better than the one without. It means that it is feasible to represent rating criteria from samples. We also hypothesize that distance information derived from the difference between the examples and the mean example benefits all the other supervised learning methods. We will try using this approach in other fields in the coming future to check whether the hypothesis is right or not. In addition, we will also consider applying data augmentation technology to enhance the essay dataset of which the example is relatively small.