Item Difﬁculty Prediction Using Item Text Features: Comparison of Predictive Performance across Machine-Learning Algorithms

: This work presents a comparative analysis of various machine learning (ML) methods for predicting item difﬁculty in English reading comprehension tests using text features extracted from item wordings. A wide range of ML algorithms are employed within both the supervised regression and the classiﬁcation tasks, including regularization methods, support vector machines, trees, random forests, back-propagation neural networks, and Naïve Bayes; moreover, the ML algorithms are compared to the performance of domain experts. Using f -fold cross-validation and considering the root mean square error (RMSE) as the performance metric, elastic net outperformed other approaches in a continuous item difﬁculty prediction. Within classiﬁers, random forests returned the highest extended predictive accuracy. We demonstrate that the ML algorithms implementing item text features can compete with predictions made by domain experts, and we suggest that they should be used to inform and improve these predictions, especially when item pre-testing is limited or unavailable. Future research is needed to study the performance of the ML algorithms using item text features on different item types and respondent populations.


Introduction
In educational assessment, the analysis of test items is crucial for designing reliable, valid and fair tests.Item difficulty, the most important item characteristic, is commonly estimated using classical test theory (CTT) and item response theory (IRT) models based on test-taker responses [1]; however, item pre-testing is not always possible, or it may be limited, e.g., due to security or legal reasons.In such situations, automated estimation of item difficulty based on their wording can inform test construction.
Various properties of text wording of a given test item determine how difficult the item is for a test-taker.The item text features, such as length, word frequencies related to established corpora, characteristics of linguistic similarities, and readability indices, can be used to predict item difficulty using machine learning (ML) algorithms.ML and natural language processing (NLP) are already used in different areas of education for automated essay or item scoring [2][3][4], automated item generation [5][6][7][8][9], data-driven intelligent tutoring systems [10], online proctoring and cheating detection [11][12][13], and in other situations [14][15][16][17].In addition to commonly used methods such as linear regression or decision trees [18], regularization approaches and neural networks are sometimes used to estimate the item difficulty from item wording based on item features [19].A wide range of ML algorithms has been used in this context in the past [18,20].However, their predictive performance is usually not compared; moreover, ML algorithms are rarely compared to the performance of domain experts, which is crucial to determine to what extent the ML algorithms are capable of improving the predictive accuracy of human raters.This is an area of focus in the study.
To address this gap, we introduce a framework for predicting item difficulty using textual features from item wording.We assess the predictive accuracy of multiple ML methods, and we compare them with the predictions made by domain experts.The tools of choice for the prediction we apply on the item features are supervised ML regression methods, namely regularization techniques-such as the least absolute shrinkage and selection operator, ridge regression and elastic net regression-support vector machines, regression trees, random forests, and artificial neural networks with back-propagation [9,21].We predict the item difficulty as a continuous dependent variable, as it would be returned from student response data.Furthermore, switching the same algorithms into a classification fashion, we predict the membership of each item in one of the predefined difficulty intervals.We assume that classification into one of a few item difficulty intervals could be easier and more accurate for the algorithms than predicting a precise difficulty point value.We hypothesize that ML algorithms are able to compete with human domain experts in predicting (or classifying) item difficulty and that they may further inform and improve the experts' predictions.
The paper proceeds as follows.We start by describing the data preparation needed for the implementation of ML algorithms on cognitive test items, including the text preprocessing and extraction of item features.We then describe the ML algorithms used in this study in Section 2, Materials and Methods.We briefly describe applied software, model architecture, algorithms' pre-setting, and tuning parameter values in Section 3, Implementation.Next, in Section 4, Results, we describe the results, namely the comparison of the accuracy of item difficulty predictions returned by different artificial ML algorithms and those performed by domain experts.Finally, we discuss the key findings in Section 5, Discussion, and offer some deductions in Section 6, Conclusions.

Materials and Methods
A description of a dataset we used for item difficulty prediction and of the applied ML algorithms we built on follows.

Dataset and Item Text Processing
For this study, we use item wordings from the English as a foreign language test administered over eight years (2016-2023) as a part of the Czech matura exam.We use items from reading comprehension sections containing multiple-choice items with a singleparagraph passage and four response options, denoted as Section 5. We also utilize a dataset of test-takers' answers for the calculation of difficulty for each item as described in more detail in the next section.Finally, item difficulty evaluation by domain experts comes from another (internal) dataset.
Item text wordings are extracted from portable document format-based files (with suffixes .pdf)using optical character recognition (OCR).Then, we apply the scraping methods employing empirical approaches such as regular expressions' masking, by which we obtain an unstructured text for each item's wording, split into item passage, item question, key option (the correct answer), and distractors (incorrect answers).Next, the text is tokenized, i.e., sentences are split into atomic parts (tokens), in this case, words.In the next step, stopwords and special characters are removed, and the tokens are lemmatized, i.e., they are transformed into their corresponding lemmas [22], as schematically indicated in Figure 1.Finally, item text features are derived [23].We consider four types of item text features.Firstly, the word counts feature is easily calculated using lengths of vectors of item text tokens.Secondly, using The Corpus of Contemporary American English (COCA) [24,25], the word frequencies are assessed compared to usual frequencies of given words in ordinary language.Then, the lexical similarity is calculated using Euclidean and cosine metrics to describe how textually similar (or close) are vectors of tokens of item wording's different parts, e.g., how similar the item question and its key option, i.e., the correct answer, are, considering that their high lexical similarity may tend to make the item easier.Additionally, the lexical similarity between the key option and the distractors, i.e., incorrect answers, is calculated, considering that large dissimilarity can make the item easier.Lastly, we compute the readability indices depicting how easy-to-read and easy-to-understand the wording of the text is.In general, the readability indices usually follow formulae of the form where f is a function in an explicit form using a vector of absolute and relative counts of words and parts of speech of a given text ν T word counts , a vector of common or unique word's frequencies compared to everyday language, ν T word frequencies , and various combinations of previous two properties, ν T word counts ⊗ word frequencies , as suggested by [26].A more detailed explanation of individual item features derived using the above-described approaches is in Appendix A.
Eventually, using the above techniques, we derive more than 60 text features per item and list them into a structured dataset of size n × k so that each column represents one feature for each of n items, and each row contains a vector of all k features for a given item.

Item Difficulty Based on Student Responses
Having data from more than 50 thousand test-takers answering the items each year, we enrich the dataset of item text features constructed in the previous step by the item difficulty estimated using Rasch model [1] (p. 158), [27,28] from student responses.The Rasch model is relatively simple but can estimate item difficulty for each item; more complex models can describe other item parameters, such as item discrimination or item guessing, that are not of interest in this study.The Rasch models assumes that a test-taker with ability θ p answers item i correctly with a probability where y i is the difficulty Y of item i, that is of main interest in this study (thus the notation).We use the conditional maximum likelihood method [1] (p. 165) to estimate difficulties for each item i ∈ {1, 2, . . ., n} based on the Rasch model (1).The conditional likelihood method accounts for the overall ability of the tested sample, which may differ each year; the item difficulty's estimate is proportional to a portion of incorrect answers to the item adjusted by a proportion of the total number of correct answers.As an output, we obtain a vector (y 1 , y 2 , . . ., y n ) T of n values of item difficulty Y for each item.Note that estimates of item difficulty based on student responses are close to the true item difficulties when a representative and a sufficiently large sample of test-takers are available.This was the case in our study; however, such a respondent sample may not be available in all situations.

Machine Learning Algorithms
In this study, we compare the performance of several ML methods for predicting and classifying item difficulty.Let us define the regression and classification tasks more formally before describing the supervised regression and classification algorithms.
Assume we initially have k ∈ N item text features X 1 , X 2 , . . ., X k and (k + 1)-th variable Y, a dependent one, i.e., item difficulty, derived as indicated in the previous section.As an output of the Rasch model from Formula (1), the item difficulty Y is estimated as a continuous variable.The regression task algorithms predict a value y i of the item difficulty Y for item i using values x i,1 , x i,2 , . . ., x i,k of all item text features X 1 , X 2 , . . ., X k as predictors.
However, for test construction, predicting an exact point value from an item difficulty continuum is unnecessary; test developers often rely on the item difficulty category, thus classifying item difficulty into only a few, e.g., five categories, is sufficient.Thus, we also implement the classification task.As the first step, the item difficulty Y is categorized, obtaining Y c , so that it is split into m ∈ N disjunctive intervals {c 1 , c 2 , . . ., c m } of the same size using appropriate quantiles.Thus, union m =1 c is a range of item difficulty variable Then, within the classification task, item feature X j , where j ∈ {1, 2, . . ., k}, is treated as an independent variable for a classification model, which predicts the most-likely interval c * ∈ Y c of the categorized item difficulty Y c .
A flowchart of the regression task is in Figure 2; similarly, a classification task scheme is in Figure 3. Regardless of the regression or classification task, the predicted item difficulty values are compared to the 'true' ones as estimated using the Rasch model.To increase reproducibility as much as possible, algorithms are learned on training subsets, while point estimates of the predictive metrics are estimated on testing subsets.This is repeated several times to obtain a more robust estimate of the predictive performance, averaging all point estimates collected from individual iterations.Domain experts estimate the item difficulty Y using their empirical knowledge in the field, and their item difficulty estimates Y might also be categorized to create Y c .Thus, domain experts can be treated as "another" regression and "another" classification algorithm and their performance can be compared to the predictive and classification performance of ML algorithms.Many ML algorithms have both the regression and classification version [29], as we describe in more detail in the next section.

Regularization
Although regularization techniques could serve as regression algorithms, they also offer an option to select a subset of item features used for model building.Therefore, regularization methods enable feature selection, which helps reduce the problem's dimensionality with minimal loss of information.
LASSO (Least Absolute Shrinkage and Selection Operator) regression estimates a value y i of item i's difficulty Y using least squares and L1 regularization-based coefficients β 0 , β 1 , . . ., β k minimizing the following term, where x i,j is a value of j-th feature of i-th item with j ∈ {1, 2, . . ., k} and λ LASSO > 0 is a penalization term [30].Similarly, ridge regression uses L2 penalization and penalization term λ ridge > 0 to minimize while item difficulty Y's value y i is estimated for item i using its item text features x i,j with j ∈ {1, 2, . . ., k} [31].Finally, elastic net regression combines both L1 and L2 penalization and minimizes the following function, to estimate item i's difficulty Y using its text features x i,j , where j ∈ {1, 2, . . ., k}.Assuming both penalizations, i.e., the L1 and L2 terms in Formula (4) are convex [32], elastic net usually reaches values of the function in Formula (4) as minimal as LASSO or ridge regression individually does, and, thus, performs at least as good as the previous two regularization algorithms [33].
Since Formulae (2)-( 4) are minimized while coefficients β 0 , β 1 , . . ., β k are estimated, the terms λ LASSO ∑ k j=1 |β j |, λ ridge ∑ k j=0 β 2 j are also minimized.Thus, if λ LASSO = 0 or λ ridge = 0, penalization terms in Formulae ( 2)-( 4) are removed, and the functions in the formulae become ordinary least squares usual for multivariate linear regression.Otherwise, whenever is λ LASSO > 0 and λ ridge > 0, then, for β j close to zero, such a coefficient tends to be shrunk towards zero, and, consequently, j-th item feature is removed from the model while βj = 0. Thus, regularization techniques could also work as feature selectors.LASSO is considered a better feature selector than ridge regression [34].Intuitively, assuming j-th item feature X j is likely to be removed from the model, so it is 0 < j term large enough so that removing the j-th item feature X j from the model would reduce the penalization term significantly, the j-th item feature is removed.Thus, for constant values of λ LASSO = λ ridge , due to ( †), term λ ridge • β 2 j in the ridge regression is not as large as term λ LASSO • |β j | in the LASSO, and, consequently, it is less likely that j-th item feature X j is removed from the ridge regression model than from LASSO model, keeping the penalization levels the same for the two models.

Naïve Bayes Classifier
Naïve Bayes classifier classifies i-th item into the most likely class c * of item difficulty Y.The Bayes theorem assumes that a relationship between conditional probabilities P(Y i = c l | ∀x i,j ) and P(∀x i,j | Y i = c ), where ∀x i,j term stands for a joint proposition The non-conditional probabilities P(Y i = c ) and P(∀x i,j ) are constant for a given dataset [35] and can be easily estimated as where I(A) is an identifier function which is equal to 1 if and only if proposition A is true, otherwise it is equal to 0, i.e., Thus, proportion ) is constant and Formula (5) can be rewritten as and as far as we assume classes c 1 , c 2 , . . .c m are independent, we may also write With Naïve Bayes, item i is classified into interval c * so that For categorical item features X j , probability P(X i,j = x i,j | Y i = c ) is estimated similarly to Formula (6); for continuous variables X j , it is estimated using cumulative version of normal distribution function, i.e., P(X i,j for small positive > 0.

Support Vector Machines
Assuming the space of all item features X 1 × X 2 × • • • X k , support vector machines use a hyperplane to split the space into two disjunctive subspaces (of different classes).The splitting maximizes the margins, i.e., the distance between the two closest points, so that the first comes from one subspace (of the first class) while the latter comes from the second subspace (of the latter class).The hyperplane is orthogonal to the distance of the two closest points, assuming each subspace contains ideally observations of only one class; see Figure 4 for details.Assuming m classes, since one model of support vector machines can classify into only two classes, ( m 2 ) models in total are built [36].Each model of support vector machines searches for a splitting hyperplane that follows a form of where w is a vector orthogonal to the splitting hyperplane, and b is maximally tolerated margin's width.Additionally, the two closest points from both subspaces are elements of mutually parallel hyperplanes (and also parallel to the splitting hyperplane), i.e., w T x i − b > 0 and w T x i − b < 0, respectively.Finally, the distance between the two closest points of different classes is 2b w , i.e., a width of both margins, and it should be maximized with respect to the existence of two distinguishable hyperplanes for two closest points of different classes, so that where b as the tolerated margin width, i.e., a user's tuning parameter, is usually chosen as b ≥ 1.
A kernel trick with various kernel functions is applied when the points that belong to different classes are not linearly separable.In principle, the universe of item features, . ., that increase the universe dimensionality [37] and, eventually, after that, it becomes linearly separable as indicated in Figure 5.
The margin between the hyperplane of the support vector machines (solid line) and the closest points of both subspaces (dashed lines) is maximized by the algorithm.
The classification of item i into difficulty Y's class c * is then performed using a voting scheme, i.e., the class c * is the one that the majority of all ( m 2 ) models votes for, i.e., using the same mathematical notation and identifier function as defined in Formula ( 7).When regression is applied, trivial (usually constant) models are built for each subspace of the space, divided by the splitting hyperplane.Therefore, averages of all coordinates of all observations belonging to a given subspace are calculated, representing the regression model of the given subspace.
A visualization of the kernel trick's principle.

Regression and Classification Trees and Random Forests
Classification trees, also called decision trees, partition the dataset into subdatasets to contain ideally observations of only one class of item difficulty Y.The partitioning is performed successively from the original dataset by binary splitting; a given criterion is minimized within each dataset splitting.In other words, item features' universe subspaces, including, if not all, then the vast majority of all points from one class of item difficulty Y.Each step of the dataset partitioning, i.e., splitting a parenting dataset into two new child subdatasets, enables the growth of a typical tree plot, dendrogram, by adding two new child branches; see Figure 6.The partitioning is applied multiple times until the dataset is split according to item difficulty Y classes' distribution [38].Assuming ρ η, is a proportion of observations of class c in part of the dataset that is defined by all node rules from root to node η, then ρ η, should be maximized as much as possible using an impurity criterion Q(η).The most often used impurity measures are the misclassification error, the Gini index, and the deviance, also called cross-entropy, Obviously, the impurity measure Q(η) is minimized in each dataset's partitioning since the lower the impurity measure is, the larger the proportion ρ η, is.Trees tend to overfit the distribution of classes in the dataset; it means the tree growth is stopped no sooner than all leave nodes have the impurity criterion as minimized as possible.To avoid overfitting, various stopping criteria or pruning are applied [39].
Once the tree is grown, it enables to classify item i into difficulty Y's class c * , so that ρ leaf node determined by all node rules from root to the node, , using the introduced notation and identifier function from Formula (7).Trivial (constant) models constructed for each subspace transform the classification trees into regression trees [40].
Multiple trees create a structure called random forest.Individual trees of a given random forest are mutually independent and different.This is ensured by applying only a subset of all item features pre-selected using a bootstrap for each new tree growing in a random forest.Finally, the classification or regression output of the random forest is determined by the voting scheme of individual trees [41], similarly as for the support vector machines: item i is classified into such a difficulty Y's class c * for which the majority of all trees in the random forest votes, i.e.,

Neural Networks
Neural networks are universal algorithms suitable for regression and classification tasks.An architecture of a neural network consists of a layer of input and output neuron(s) and several hidden layers so that each hidden layer consists of multiple neurons [42].
An example of the neuron is in Figure 7. On input of the neuron, there is a vector of signals from neurons of a previous layer, i.e., z l−1 = (z l−1,1 , z l−1,2 , . ..)T , multiplied by a vector of weights w l−1 = (w l−1,1 , w l−1,2 , . ..)T .If l = 1, the neurons of the first layer accept weighted signals from a vector of item i's features, x i = (x i,1 , x i,2 , . . ., x i,k ) T .Weighted signals from (l − 1)-th layer are summed up together with bias term b l within Σ function, i.e., and proceeded to the σ function, which is an activating function, usually of the sigmoid form, so that signal y l,1 on output from the neuron of l-th layer is which is finally transcended to the next, (l + 1)-th layer.Vectors of weights, w l = (w l,1 , w l,2 , . ..)T are adjusted within each iteration of so-called backpropagation when the weights are increased or decreased by small gradients to minimize the loss function, often implemented as L1 or L2 penalization [43].In the regression framework, besides neurons in a hidden layer, we implement a single neuron in the output layer, returning continuous estimate ŷi of item i's difficulty.In the classification framework, there are m output neurons representing classes {c 1 , c 2 , . . ., c m } and we adopt voting for c * l in classifying network [44], as follows y # of layers, .

Variable Importance Analysis
While the importance analysis is not a stand-alone algorithm for item difficulty (or its categorized variant) prediction, it enables us to evaluate how "important" a given variable is for a model, considering the predictive performance; in other words, how much poorer the model would predict if it lacked the given variable [45].
We apply two measures of variable importance; each variable, i.e., item feature, has its own value of the importance measure, considering a given dataset and model.Before we introduce the measures, we define the mean square error, MSE, as for vectors y = (y 1 , y 2 , . . ., y n ) T and ŷ = ( ŷ1 , ŷ2 , . . ., ŷn ) T of observed and predicted difficulties of n items, respectively.The first importance measure is MSE increase (X j ), which is equal to an increase of mean square error of item difficulty prediction in such a model where values of the given item feature, X j , are randomly permuted [45].To be more specific, we firstly calculate mean square error MSE {−∅} of a full model with all original item features, then we compute mean square error MSE {−j} of a model where item feature X j has randomly shuffled values.Finally, MSE increase (X j ) is defined as The more important item feature X j for adequate and accurate prediction of item difficulty, the larger the prediction error, measured using mean square error MSE, when the item feature X j is missing in the model.Thus, the greater the value of MSE increase (X j ), the more important the item feature X j for item difficulty prediction.
The second importance measure, node purity increase, NodePurity increase (X j ) is defined similarly.Once impurity metric Q(η) is chosen, i.e., either misclassification error (8), Gini index (9) or deviance (10), the node purity increase, NodePurity increase (X j ), for item feature X j is simply an increase of "1 minus impurity metric" term averaged over all leaf nodes if the item feature X j is newly introduced into a new model [45].Thus, having the averaged "1 − node impurity" term, (1 − Q(η)) {−j} , of a tree model with all original item features except for item feature X j , and averaged "1 − node impurity", (1 − Q(η)) {−∅} , of a model where item feature X j is already included, the NodePurity increase (X j ) is then Again, the more important item feature X j for the predictive model performance, the higher average "1 − node impurity" increase, i.e., the higher average purity increase we can expect once the item feature X j is introduced into the model.Thus, the larger the value of NodePurity increase (X j ), the more important the item feature X j for item difficulty prediction.According to some sources, e.g., [46], MSE increase (X j ) measure should be preferred to NodePurity increase (X j ), since the latter one is biased.

Evaluation of Algorithm Performance
Regression and classification tasks are evaluated using mutually different performance metrics.To obtain more robust estimates of the performance metrics, both regression and classification models are trained multiple times using various training sets, which enables us to average the metrics using all point estimates, collected one per each crossvalidation iteration [47]; see Figures 2 and 3. We also compare an item difficulty prediction performance of the ML approaches with the performance of domain experts.

Evaluation of Regression Performance
The models within the regression task are evaluated and compared using root mean square error (RMSE), i.e., RMSE(y, ŷ for vectors y = (y 1 , y 2 , . . ., y n ) T and ŷ = ( ŷ1 , ŷ2 , . . ., ŷn ) T of observed and predicted difficulties of n items, respectively.Obviously, inspecting Formulae ( 11) and ( 14), we obtain the following identity, MSE(y, ŷ) = RMSE(y, ŷ) 2 .Since RMSE indicates the significance of error between observed and predicted item difficulties, the lower RMSE indicates the better predictive performance of a given regression algorithm.

Evaluation of Classification Performance
Assuming there are m observed classes that are predicted using a classifier, we could calculate a number of cases n u,v when 'true' class c u is predicted as class c v , where u ∈ {1, 2, . . ., m} and v ∈ {1, 2, . . ., m}.Listing these frequencies in a table, we obtain Table 1, called the confusion matrix.Predicted Class ( Ŷc ) The better and more accurate the classification is, the higher frequencies n i,i are aligned across the confusion matrix's principal diagonal.Thus, marking the confusion matrix as C and assuming vectors y c of observed item difficulty classes and ŷc of predicted difficulty classes, we define predictive accuracy as the ratio of correctly classified items, predictive accuracy(y c , ŷc ) = The higher the predictive accuracy, the better and more accurate the classification is [48].Each of m classes of item difficulty Y c are of equal size in the dataset (classes are split using quantiles) ( †).Assuming a classifier would predict difficulties y c as vector ŷc,r as a random guessing algorithm, then an expected value of its predictive accuracy is Values of predictive accuracy greater than 1 m indicate that a classifier performs better than a random guessing algorithm.
In practice, the very accurate prediction of a correct difficulty class is unnecessary.A prediction close enough to the correct difficulty class, i.e., the correct one or one class below or above the correct one, is still useful.Thus, we also measure the classifiers' performance using an extended predictive accuracy.The item i is evaluated as correctly classified if it is classified in the correct difficulty class ŷc,i = y c,i = c , or one class higher if such a class exists, ŷc,i = c +1 , or one class lower if such a class exists, ŷc,i = c −1 , compared to the difficulty class estimated from student response data, thus extended predictive accuracy(y c , ŷc ) = where c −1 is one class below c , and c +1 is one class above c , respectively, if it exists, and an empty set otherwise.Thus, a probability that a classifier in this sense correctly classifies category with subscript ∈ {2, 3, . . ., m − 1} is equal to |{ −1, , +1}| m = 3 m , while a probability that a classifier correctly classifies the first or last category with subscript m , respectively.Again, assuming a classifier would predict difficulties y c as vector ŷc,r as a random guessing algorithm, then an expected value of its extended predictive accuracy is Thus, any values of extended predictive accuracy that are greater than 3m−2 m 2 show that a classifier predicts better than a random guessing procedure.

Cross-Validation
To obtain more robust estimates, the performance metrics are re-estimated multiple times within f -fold cross-validation, where f ∈ N and f ≥ 2, using dataset splitting into training and testing subset of sizes of f −1 f % and 1 f %, respectively, and then averaged [49], see Figure 8.In particular, for even better comparison and integer-like sizes of both the training and testing subsets, it might be optimal to choose f as a divisor of sample size n; then the portions f −1 f % and 1  f % for training and testing subsets, respectively, are of integer number sizes.
Assuming that p-th iteration of the f -fold cross-validation outputs point estimates of root mean square error, predictive or extended predictive accuracy Mp , finally, we could average the estimates as to obtain a robust and unbiased estimate of Ê(M) = M, i.e., the root mean square error, predictive or extended predictive accuracy [50], respectively.

Relationship between Model's Predictive Performance and a Number of Item Features in a Model
A value of the root mean square error, RMSE, following Formula ( 14) is not closely related to a number of item features considered within a model.Thus, model enrichment by any new extracted text item features could not necessarily improve predictive model performance.There are more details, formal derivation, and mathematical rationale of the relationship between the model predictive performance and the number of item features on model input in Online Supplement listed in Data Availability Statement at the end of the article.

Implementation
Text preprocessing and the entire analysis were implemented in statistical language and environment R [51].For evaluation of the classification task, the continuous difficulty Y, estimated from student response data, of an original range −2.48, +1.63) was split into m = 5 disjunctive intervals, denoted Y c ∈ {c 1 , c 2 , c 2 , c 4 , c 5 }, of the same size using quintiles, specifically −2.48, −0.80), −0.80, −0.44), −0.44, +0.03), +0.03, +0.52), +0.52, +1.63), and labeled as {very easy, easy, moderate, difficult, very difficult}.Thus, regarding item difficulty, the dataset of item text wordings is well-balanced.While the final number of item features derived from their text wording is k = 69, the number of items is n = 40.Regarding the f -fold cross-validation, due to a straightforward advantage of whenever n is divisible by f ≥ 2, we choose for f = 20.Thus, since n f = 40 20 = 2, we applied a leave-two-out cross-validation.
Domain experts' evaluation of item difficulty originally uses an arbitrary scale of 1.0, 2.5 .To make the experts' evaluation comparable with the outputs of classifiers, we split the experts' scale in the original logic the scale was designed, i.e., we consider m = 5 equidistant intervals over the range of 1.0, 2.5 .Thus, we create m = 5 intervals of length 0.3 and name them also as {very easy, easy, moderate, difficult, very difficult}.Given the assumed Rasch model (1), the obtained 'true' item difficulty is on a logistic scale where very low and very high values are less common, yet possible.For this reason, we split the Rasch-based item difficulty using quantiles.The domain experts, on the other hand, naturally designed the difficulty evaluation scale in a linear fashion, which is our rationale for equidistant scale splitting.
The difficulty of items was estimated from student responses data using the Rasch model by the function RM() of eRm package [52].Text preprocessing was performed using R package quanteda [53].Regularization was implemented with a function glmnet() of glmnet package [54].Naïve Bayes classifier and support vector machines were built using naiveBayes() and svm() functions of e1071 package [55].The radial kernel function was chosen for the kernel trick if applied.Classification and regression trees were enumerated by the function rpart() of rpart package [56].Random forests' models were learned using function randomForest() from randomForest package [57], each time using 500 trees in a model, similarly as neural networks were modeled using neuralnet() function and neuralnet package [58].The neural networks contain one hidden layer with the same number of neurons as item features on input.

Results
To assess the possibility and performance of item difficulty prediction from their textual wordings using ML methods, we applied the above-described methodology to the dataset of our interest.Firstly, we built supervised models of the regression task to estimate item difficulty as a continuous variable.There are outcomes of this approach more in detail in Table 2 presented using the root mean square errors (RMSE) for the n = 40 single-paragraph items, averaged over all f = 20 iterations of the f -fold cross-validation, across seven different regression algorithms and domain experts' estimates, too.The lower value of RMSE an algorithm outputs, the better accuracy and reliability its item difficulty estimate reaches.
A comparison of the algorithms highlights the varying performance levels between the models.Among the evaluated models, the regularization algorithms, i.e., LASSO regression, ridge regression, and elastic net, demonstrated superior performance by yielding the lowest RMSE value, indicating the highest accuracy and reliability.In particular, the elastic net returned the lowest RMSE of 0.666 among the regularization approaches (and, thus, among all models, too).Additionally, considering the data and model settings, the elastic net model outperformed domain experts in the continuous item difficulty prediction since domain experts reached an RMSE of 1.004.On the other hand, the regression trees and neural networks algorithm produced the highest RMSE value of about 0.978 and 0.971, respectively, suggesting less accuracy and reliability than the other models.The remaining algorithms displayed moderate performance levels.Meanwhile, regression trees and domain experts had higher but comparable RMSE values, further emphasizing the superior performance of the elastic net algorithm in this analysis.Since the domain experts evaluate item difficulty mostly using numbers such as 1.0, 1.5, 2.0, 2.5 as described in Section 3, Implementation, they are a priori handicapped to estimate an exact point value of the item difficulty.Applying Sheppard's correction [59], their RMSE as a measure following the logic of the second moment is overestimated by a term of width of the interval between valid values 2 12 = 0.5 2 12 ≈ 0.02.However, in case all domain experts would systematically over-or under-estimate the true item difficulty, their RMSE could be, in theory, overestimated by the width of the interval between valid values, thus, by 0.5.Additionally, Table 3 presents the predictive and extended predictive accuracies of different classification algorithms, including Naïve Bayes classifier, support vector machines, classification trees, random forests, neural networks, and domain experts.Assuming that only an approximate match of a true and predicted category of item difficulty is sufficient for applications, we focus on extended predictive accuracy.From the ML algorithms, random forests output the highest extended predictive accuracy with a score of 0.650, while Naïve Bayes classifier showed the lowest extended predictive accuracy, achieving a score of only 0.425.Domain experts achieved a superior accuracy of 0.650, indicating their important role in the classification of item difficulty.
For a better understanding of individual classifiers' predictive capacity, we plot the confusion matrices for each algorithm (see Figure 9), where each row represents numbers of items in each of the observed classes, while each column represents numbers of items in each of the difficulty classes predicted by the algorithm.The numbers in cells of the confusion matrices are summations over all iterations of the f -fold cross-validation.Overall, the results suggest that the ML algorithms could benefit from further improvement to accurately classify items in all classes of difficulty, especially in the middle classes, i.e., from easy to difficult.The domain experts did not use the highest category, very difficult much for these items; this may be caused by the fact that the test is in general easy and especially this type of item may appear simple compared to exercises from high school textbooks.
Tables 4 and 5 present the variable importance analysis of different item text features applied in our model for item difficulty prediction and classification.While Table 4 uses MSE increase metric, Table 5 utilizes NodePurity increase metric of variable importance.Both measures are reported in Tables 4 and 5 as an average ± standard deviation based on f = 20 point estimates from all iterations of f -fold cross-validation.The MSE increase as a metric of an item feature's importance operates with mean square error (MSE), which is a squared value of RMSE; it is more suitable for regression models and prediction of item difficulty as a continuous variable.Whereas NodePurity increase as a metric of item feature's importance calculates impurity of leaf nodes when classifying into a category of item difficulty; thus, it performs better in the classification of item difficulty.Both measures can provide valuable insights into feature importance; however, they may result in different rankings as they capture distinct aspects of model prediction performance.By considering both metrics, we can comprehensively understand item feature importance and make informed decisions for analysis and interpretation.
According to Table 4, the number of all characters in item wording seems to be the most crucial feature for item difficulty, with MSE increase of 5.912 ± 0.0.673,followed by the word length's standard deviation (in characters) with MSE increase about 4.845 ± 0.799.Various features such as readability indices, indices of similarity or portion of shared words between item passage, distractors, item question or key option, as well as longest and average word length in item wording, follow, with MSE increase between about 0.900 and 3.500.
In Table 5, the same two features seem to determine the classification of item difficulty the most-the word length's standard deviation (in characters) with NodePurity increase about 1.644 ± 0.121, and the number of all characters in item wording with NodePurity increase of 1.455 ± 0.137.Additionally, some of the readability indices, numbers of monosyllabic and rare words, or similarity between different parts of item wording are important for correct item difficulty prediction, returning NodePurity increase in an interval of 0.030-0.080.
A detailed explanation of individual item features listed in Tables 4 and 5 is in Appendix A. Note that although we sorted the item features in decreasing order according to the importance measures in Tables 4 and 5, the intervals for importance measures' mean values, indicated by ± standard deviation terms, overlap between various item features.Thus, the importance analysis is only illustrative.
Table 6 provides a summary of elastic net regression's model following Formula (4) that minimized the root mean square error, RMSE, with λLASSO ≈ 1 and λridge ≈ 0. While most item features were removed by shrinking their coefficients towards zero, the item features listed in Table 6 are those that remained in the model.Compared to the item features' importance analysis, the elastic net model could tell us not only which item features are essential for the final model but also what is the approximate direction of a relationship between the features and item difficulty.The elastic net model suggests that a larger total number of characters in item text wording increases item difficulty ( β = 0.002 > 0), and that greater Dalle-Chall and FOG readability indices also make the item more difficult ( β = 0.004 > 0 and β = 0.026 > 0, respectively).In addition to this, an increased standard deviation of word lengths within item wording ( β = 0.809 0) and an average sentence length (words) in distractors ( β = 0.002 0) increase item difficulty as well as does the greater proportion of common words in the passage and distractors ( β = 0.630 0) (the passage and distractors-common words 1 is a proportion of a number of common words in the item passage also found in the wording of distractors, to a number of all words in the item passage).Finally, considering Table 6, increased word2vec similarity between key option and distractors is associated with a higher item difficulty ( β = 0.023 > 0) (the key option and distractors-word2vec similarity is a similarity of the key option and distractors of the item wording based on word2vec algorithms, where vectors of tokens for both parts are generated and the similarity between them is captured from the context).These features were also detected as important by the importance analysis.
An example of a decision tree, estimating categorized item difficulty as an interval, is in Figure 10.The tree in the figure uses various item features such as the word length's standard deviation (in characters), frequency of uncommon words-according to COCA corpus, item passage and key option-common words 1 , key option and distractors-number of features from a document-feature matrix, distractors-average sentence length (in words), passage and distractors-word2vec similarity, and number of all characters in item wording.An interpretation is possible and relatively straightforward-in general, if the item's words vary significantly in their lengths, the frequency of uncommon words is high, the proportion of words common for key option and distractors is low enough, item passage and distractors are dissimilar enough, or the item wording is long enough, then the item's difficulty is relatively high.
More specifically, if the word length's standard deviation (in characters) is not lower than 2.3, then the item's difficulty is difficult (in +0.03, +0.52)) or very difficult (in +0.52, +1.63)).Otherwise, when the frequency of uncommon words-according to COCA corpus is lower than 0.26, the item's difficulty could be easy (in −0.80, −0.44)) or difficult (in +0.03, +0.52)), according to the common words 1 among the item passage and key option.Conditional on the previous rules, whenever the number of features from a documentfeature matrix of key option and distractors, i.e., a number of common words both in item key option and distractors, is less than 15, the item's difficulty is very easy (in −2.48, −0.80)) easy (in −0.80, −0.44)), or could be difficult (in in +0.03, +0.52)) for the number of characters at least 1062.If the word2vec similarity between passage and distractors is lower than 0.79, then the item difficulty is moderate (in −0.44, +0.03)).Otherwise, the item difficulty depends on the number of characters in the item wording-usually, if the difficulty could be one of two different difficulty classes, a lower character number in item wording tends to classify the item into the easier class, as we can see in the last-but-one nodes in the tree in Figure 10.

Discussion
In this work, we provided a framework for predicting the difficulty of cognitive test items from their wording.We extracted various text features from English reading comprehension items and employed a number of ML algorithms.Our work is unique in that it compares a wide range of ML algorithms, both for regression and classification tasks, as well as in relating the predictions to those of domain experts.We also provide reproducible R code, which can be used and built on in future studies.The prediction of item difficulty using item text features may save time and resources needed for pre-testing and may help especially in situations when pre-testing is limited or not feasible.ML prediction of item difficulty presented in this work has the potential to be more precise than domain experts, and if not fully replacing domain experts, it may be used to guide and improve their predictions, as well as any imprecise estimates coming from pre-testing based on small or less representative samples.Among all regression task algorithms, regularization approaches seemed to overcome others, similar to [60,61].This is expectable given that the amount of data included in the training subset was relatively low.All ML algorithms outperformed domain experts in this task, although the domain experts are handicapped by not using a continuous scale, as mentioned in Section 4. To govern the accuracy-precision trade-off towards higher accuracy [62], we also considered the task of classifying the item difficulty into only a few categories.Domain experts slightly outperformed ML algorithms in the accuracy of difficulty classification when the task was to classify the item difficulty into five categories.From the ML algorithms, the random forests predicted with the highest extended predictive accuracy and performed almost as well as domain experts.We suppose that random forests could return the best predictive performance since this algorithm is a priori ensembled, embedding multiple decision trees.
It is hard to compare our results to those of other studies, given that different studies train ML algorithms on data which may differ in the topic, the number of available items, variability of item content and difficulty, as well as used difficulty scale or difficulty distribution among various parts of the scale.Benedetto et al. in [63] applied ML techniques on multiple true-false questions from CloudAcademy to predict the question difficulty and received RMSE about 0.700-0.900for random forests, decision trees, support vector machines, and linear regression.In another paper, Benedetto et al. [64] introduced an R2DE model for newly generated items and automatically predicted their difficulty, originating from interval −5, +5 with RMSE of 0.823, which is approximately comparable to our results, i.e., RMSE of 0.668 (elastic net) on item difficulty coming from an interval −2.48, +1.63).Using word embedding and support vector machine with the radial kernel, Ehara in [65] reported RMSE about 3.632 for item difficulty prediction on English vocabulary tests with a pre-estimated difficulty range in −2, +4 ; since our dataset if of similar difficulty range, we received better performance for item difficulty prediction in case of support vector machines-an RMSE of 0.716.Lee et al. in [66] predicted item difficulty for C-tests, i.e., tests where the second part of every second word is missing and should be fulfilled by a test-taker, and reached an RMSE of 0.240 using advanced architectures of support vector machines and neural networks.Regarding the adaptive scenarios, Pandarova et al. in [67] predicted the difficulty of cued gap-filling items using common item features and several ridge regression models and obtained an RMSE of 0.770.Qiu et al. in [68] trained a document-enhanced attention-based neural network on data from medical online education websites in China to predict the correct-answer ratio (in the range of 0 to 1) and output RMSE of 0.131.They also compared the approach with support vector machines-based prediction, yielding an RMSE of about 0.172, which is, considering their difficulty range 0, 1 , comparable with our results.Ha et al. in [69], and Xue et al. in [70] published, besides response times, prediction of item difficulty using medical datasets based on correct-answer ratios (i.e., difficulty in a range of 0 to 1) and employing various ML methods and transfer learning, resulting in an RMSE in the range of 0.200-0.300.Similar approaches and results as Ha et al. in [69] are also reported by Yaneva et al. in [71].Yin et al. in [72] proposed a new text-embedded and hierarchical pre-trained model QuesNet for item representation, that is able to predict item difficulty, ranged in the interval 0-1, with an RMSE of 0.253.Several studies went deeper into item difficulty classification rather than continuous prediction.Hsu et al. in [73] predicted item difficulty (of five levels, i.e., very easy, easy, moderate, difficult, very difficult) in social studies tests using semantic spaces and word embedding techniques, by which they reached accuracy about 0.350 and extended accuracy about 0.780.Similar to our study, they also found that semantic similarity between an item stem and the options strongly impacts item difficulty.One year later, Lin et al. in [74] remade the analysis by Hsu and applied long short-term memory on the same problem and datasets; they received an accuracy of 0.370 and extended accuracy of 0.840.Compared with the above-mentioned studies, our analysis is limited by the number of items available for training the ML algorithms, as well as by the relatively low and homogeneous item difficulty related to the level of the exam, which was set to B1 according to the Common European Framework of Reference for Languages (CEFR) standard.
This study opens several paths for further research.One possible path to improving the algorithms presented here is to extend or improve the extracted item text features while keeping in mind that simply boosting a number of item features would not necessarily improve model predictive performance; see Section 2.4.4.We focused on text content rather than context within the item difficulty prediction using their text wording.In our case, various readability indices and indices of similarity between individual parts of item text wording seemed to be important for the difficulty prediction, similarly to [73].Additionally, considering the elastic net summary, the standard deviation of item words' length (in characters) was of significant importance.The contentual features are easier to extract, while they may reduce information encoded in the textual wording significantly [75].Further research may consider also incorporating contextual analysis, which, however, also requires extensive samples of textual data [76].Other future paths include tuning the settings of the involved ML algorithms or even including further ML methods.
Involving a wider range of training datasets is another possible path to follow.Our work focused on predicting item difficulty in the reading comprehension section of the English language test; however, the possible usage of the methods presented here is much wider.Similar methods may find their use in the prediction of item difficulty in other knowledge tests [69,70,77], or to provide a better understanding of the rating of the quality of grant proposals [78,79] when a text complementing numerical ratings is available.Text analysis and ML methods may provide a deeper insight into item-level differences in responding and explain so-called differential item functioning (DIF) [80][81][82] or itemlevel between-group differences in change after treatment (differential item functioning in change, DIF-C) [83].Given the increasing computational power, we expect more research implementing textual data analysis will complement the analysis of rating data in the future.

Conclusions
To conclude, the text analysis of item wording may be useful for the prediction of item difficulty, especially when item pre-testing is limited or not available.Machine learning algorithms, particularly regularization or random forests, may be able to inform and improve item difficulty estimates of the domain experts.Future studies should consider more complex and deeper text analysis, including context analysis, as well as other ML methods, and method tuning to even further improve the performance of the item difficulty prediction.

Funding:
The study was supported by the Czech Science Foundation Grant Number 21-03658S, by the institutional support RVO 67985807, and by the Charles University programme Progres Q15 "Life course, lifestyle and quality of life from the perspective of individual adaptation and the relationship of the actors and institutions".Data Availability Statement: Data, source code for the ML analysis, and further supplementary material are available at OSF platform at https://osf.io/nzfgk/(accessed on 27 September 2023).Original data with item wordings are available at https://data.cermat.cz/(in Czech) (accessed on 30 March 2023).

Acknowledgments:
The authors thank the Centre for Evaluation of Educational Achievement for sharing insights on item difficulty evaluation and for data of preliminary difficulty predictions by domain experts.We also thank anonymous reviewers and Eva Potužníková for suggestions to previous versions of the manuscript and Filip Martinek for assistance with software computations.

Conflicts of Interest:
The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A
In this part of the appendix, we describe selected item features and their definitions in more detail, particularly those listed in Tables 4 and 5.The wording of an item usually consists of the following parts: an item passage, a question, a key option, and distractors.The item passage is an introductory text of varying length that mentions important terms or definitions asked in the following item question or describes the item's context.The item question is followed by a permutation of a key option, i.e., a correct answer, and several distractors, i.e., incorrect answers.In summary below, we mark any of the item wording part as {A}, {A} ∈ {item passage, question, key option, distractors}, and any pair of the item wording parts as {A and B}, {A and B} ∈ {key option and distractors, item passage and distractors, item passage and key option, question and distractors, question and key option, item passage and question}.
Each item feature is either a characteristic of an entire item text wording (i.e., there is one numerical value of the item feature for the item) or of each item wording part (i.e., there is one numerical value for each wording part), or a pair of item wording parts.In case the item feature is a numerical characteristic of part A of the item wording, or pair of parts {A and B} of the item wording, it is indicated below using {A}: "item feature label", or {A and B}: "item feature label" notation, respectively.where n difficult is a number of words not included in Dale-Chall list of 3000 familiar words, n w is a total number of words in a text of the item wording, and w is a value computed as a number of words divided by a number of sentences, i.e., it is an average number of words per a sentence [84].The greater is a value of Dale-Chall index for a given text, the more difficult is to read the text.

FOG index
The readability score of a text of the item wording based on Gunning's Fog Index.The formula is FOG index = 0.4 • w + 100 • n words with ≥ 3 syllables n w , where, again, w is a value computed as a number of words divided by a number of sentences, i.e., it is an average number of words per a sentence, n w is a total number of words in a text of the item wording, and n words with ≥ 3 syllables is a number of words with three or more syllables in a text of the item wording [85].If the average length of a sentence or the number of words with three or more syllables in a text increases, the FOG index increases, too.

SMOG index
The readability score of a text of the item wording based on Simple Measure of Gobbledygook (SMOG) index, so SMOG index = 1.043 • n words with ≥ 3 syllables • 30 n s + 3.129, where n words with ≥ 3 syllables is a number of words with three or more syllables in a text of the item wording and n s is a number of sentences in a text of the item wording [86].Whenever the term where c is an average number of characters per a word, w is an average number of words per a sentence, n prep is a number of prepositions and n w is a total number of words in a text of the item wording [87].Traenkle-Bailer index decreases, if the average number of characters per a word, average number of words per a sentence, or average number of prepositions per a word increases.

{A and B}-euclidean distance
Let us assume two textual parts of item wording, A and B, so that a union of their tokens has a length l ∈ N. Additionally, let us assume two vectors of the same length l, i.e., t A = (t A,1 , t A,2 , . . ., t A,l ) T and t B = (t B,1 , t B,2 , . . ., t B,l ) T , where t A,i = 1 (or t B,i = 1) if and only if text A (text B) contains token i, otherwise is t A,i = 0 (or t B,i = 0), for ∀i ∈ {1, 2, . . ., l}.The euclidean distance between the parts A and B is The more similar the parts A and B of the item wording are, the lower the value of euclidean distance d(A, B) is.
{A and B}-cosine similarity Again, let us assume two textual parts of item wording, A and B, so that a union of their tokens has a length l ∈ N. The more similar the parts A and B of the item wording are, the higher the value of cosine similarity cos(A, B) is.

Figure 1 .
Figure 1.A scheme of text processing procedures and extraction of item text features.

Figure 6 .
Figure 6.Linear splitting of the variables' space (on the left) and an appropriate tree representation (on the right).

Figure 7 .
Figure 7.A scheme of one neuron in neural network.

Figure 8 .
Figure8.Within p-th iteration of f -fold cross-validation, where p ∈ {1, 2, . . ., f }, f > 1 and f ∈ N, a model is trained using the training set (colored in white) and tested using the test set (colored in grey), i.e., the ( f − p + 1)-th of f equal-size parts, which the entire dataset was originally split into.

Figure 9 .
Figure 9. Summative confusion matrices for five classification algorithms and domain experts, respectively.For each algorithm, within each iteration of the f -fold cross-validation, a partial confusion matrix was calculated from training 1f fraction of the dataset, and the resulting f confusion matrices were combined into one final summative confusion matrix, which is displayed.The blue color indicates cells considered for calculating the extended predictive accuracy.

Figure 10 .
Figure 10.An example of a decision tree classifying the categorized item difficulty into a difficulty class (and an appropriate interval).

√
n words with ≥ 3 syllables n s increases, i.e., the square root of a number of words with three or more syllables per a sentence, readability increases in difficulty and the SMOG index increases.Traenkle-Bailer index The readability score of a text of the item wording based on Traenkle-Bailer index (mostly used in German-speaking countries) is calculated as T-B index = 224.68− (79.83 • c) − (12.24 • w) − 129.29 • n prep n w ,

Table 2 .
Values of root mean square error (RMSE) for seven regression algorithms and domain experts, respectively, estimating item difficulty as a continuous variable, calculated over f = 20 iterations of the f -fold cross-validation.

Table 3 .
Values of averaged predictive and extended predictive accuracies for five classification algorithms and domain experts, respectively, estimating item difficulty as a categorized variable, calculated over f = 20 iterations of the f -fold cross-validation.

Table 4 .
Top twenty item features with the highest value of importance for item difficulty prediction, measured using MSE increase .The MSE increase measure is reported as an average ± standard deviation based on f = 20 point estimates from all iterations of f -fold cross-validation.A detailed explanation of individual item features listed in the table is in Appendix A. The abbreviation COCA stands for The Corpus of Contemporary American English, DF matrix for document-feature matrix.

Table 5 .
Top twenty item features with the highest value of importance for item difficulty prediction, measured using NodePurity increase .The NodePurity increase measure is reported as an average ± standard deviation based on f = 20 point estimates from all iterations of f -fold crossvalidation.A detailed explanation of individual item features listed in the table is in Appendix A. The abbreviation CEFR stands for The Common European Framework of Reference for Languages, DF matrix for document-feature matrix.

Item Feature Description or Definition of the Item Feature
Total number of unique tokens, i.e., words in a text of the item wording.{A}-number of tokens Total number of unique tokens, i.e., words in a text of part A of the item wording.number of monosyllabic words Number of monosyllabic words, i.e., words with only one syllable in a text of the item wording.{A}-number of monosyllabic words Number of monosyllabic words, i.e., words with only one syllable in a text of part A of the item wording.number of multi-syllable words Number of multi-syllable words, i.e., words with more than three syllables in a text of the item wording.{A}-number of multi-syllable words Number of multi-syllable words, i.e., words with more than three syllables in a text of part A of the item wording.average word length (characters) Average number of characters in words in a text of the item wording.{A}-average word length (characters) Average number of characters in words in a text of part A of the item wording.longest word length (characters) Number of characters contained by the longest word in a text of the item wording.{A}-longest word length (characters) Number of characters contained by the longest word in a text of part A of the item wording.average sentence length (words) Average number of words in sentences in a text of the item wording.{A}-average sentence length (words) Average number of words in sentences in a text of part A of the item wording.word length's standard deviation (characters) Standard deviation of a number of characters in words in a text of the item wording.{A}-word length's standard deviation (characters) Standard deviation of a number of characters in words in a text of part A of the item wording.number of uncommon words, according to COCA corpus Number of words in a text of the item wording that appear uncommonly as defined in COCA (Corpus of Contemporary American English) corpus.number of rare words, according to COCA corpus Number of words in a text of the item wording that appear rarely as defined in COCA (Corpus of Contemporary American English) corpus.frequency of the A1 words (CEFR) Frequency of words in a text of the item wording at A1 level in CEFR (Common European Framework of Reference for Languages) scale.frequency of the B2-C2 words (CEFR) Frequency of words in a text of the item wording at B2-C2 levels in CEFR (Common European Framework of Reference for Languages) scale.number of footnotes (hints) in the item Total number of footnotes or hints in a text of the item wording.
Additionally, let us assume two vectors of the same length l, i.e., t A = (t A,1 , t A,2 , . . ., t A,l ) T and t B = (t B,1 , t B,2 , . . ., t B,l ) T , where t A,i = 1 (or t B,i = 1) if and only if text A (text B) contains token i, otherwise is t A,i = 0 (or t B,i = 0), for ∀i ∈ {1, 2, . . ., l}.The cosine similarity between the parts A and B is