Semantic Indexing of 19th-Century Greek Literature Using 21st-Century Linguistic Resources

: Manual classiﬁcation of works of literature with genre/form concepts is a time-consuming task requiring domain expertise. Building automated systems based on language understanding can help humans to achieve this work faster and more consistently. Towards this direction, we present a case study on automatic classiﬁcation of Greek literature books of the 19th century. The main challenges in this problem are the limited number of literature books and resources of that age and the quality of the source text. We propose an automated classiﬁcation system based on the Bidirectional Encoder Representations from Transformers (BERT) model trained on books from the 20th and 21st century. We also dealt with BERT’s constraint on the maximum sequence length of the input, leveraging the TextRank algorithm to construct representative sentences or phrases from each book. The results show that BERT trained on recent literature books correctly classiﬁes most of the books of the 19th century despite the disparity between the two collections. Additionally, the TextRank algorithm improves the performance of BERT. validation, D.D.; formal analysis, D.D.; investigation, D.D.; resources, D.D., S.Z. and G.T.; data curation, D.D.; writing—original draft preparation, D.D., S.Z. and G.T.; writing—review and editing, D.D., S.Z. and G.T.; visualization, D.D.; supervision, G.T.; project administration, G.T.; funding acquisition, G.T.


Introduction
The role of cultural heritage in sustainability is mainly perceived in terms of monuments and sites and their role in raising awareness of the landscape or in contributing to the touristic development of the area. Literature can have an important role in this respect, as it helps people understand the cultural context of an era, of a nation, of a monument, of a place, etc., thus enabling them to adopt more inclusive and equitable attitudes and behaviors. From an opposite perspective, the sustainability of cultural heritage through information technology is also very important, as several pieces of cultural content have not yet been fully digitized. This is particularly relevant to works of literature, especially of the past centuries.
Semantic indexing of works of literature using concepts related to their subject, such as genre or form terms, is an important process enabling the work to be searched and retrieved by such concepts. Taking into account that people often exhibit preferences for specific genres [1][2][3], information about genre is a useful search filter for finding literature. Moreover, without indexing with relevant metadata, works of literature are not easy to discover and eventually use, which puts the maintenance and sustainability of such cultural content at risk. A well-known literature classification scheme is the Genre/Form Terms for Library and Archival Materials (LCGFT) [4] , which is used by most libraries around the world [5]. Manual indexing of works of literature with concepts is a time-consuming task that requires domain expertise. Automated indexing systems based on natural language understanding can help humans do this work faster and more consistently.
This paper documents our approach towards the construction of a system for automatically classifying Greek books of the 19th century with genre/form concepts. Our work is part of the ECARLE research project [6], which concerns the semantic enrichment Οur work aims to answer the following research questions: 1.
Is it possible to create an accurate learning model for automatically classifying Greek books of the 19th century written in the katharevousa variety of the Greek language, using resources from the 20th and 21st century written in the demotic (modern Greek) variety of the Greek language? 2.
Can a representative set of sentences/phrases extracted from each book improve the results of a transformer text classification model compared to using consecutive parts of the book?
To answer the first question, we experimented with two data sets, one from the 19th century (ECARLE) and one from the 20th and 21st century (OL). We used the BERT model, which has previously achieved state-of-the-art results. We hypothesize that BERT will manage to transfer knowledge from modern Greek to katharevousa, as it models language at the sub-word level and the two Greek language varieties share common sub-words. To answer the second question, we experimented with the TextRank extractive summarizer to extract sentences/phrases, hypothesizing that it will manage to distill the necessary information for genre/form classification from each book.
The rest of this paper is structured as follows. Section 2 presents a comprehensive review of related work and the current state of the art in text classification. Section 3 outlines the methodological approach we used. Section 4 describes the evaluation of our approach, including the experimental setup (parameter tuning and datasets preprocessing), the results on both the OL and the ECARLE datasets, and the discussion of the results. Finally, Section 5 concludes this paper and mentions directions for future work.

State of the Art and Related Work
We first discuss traditional and state-of-the-art methods for text classification in general. Next, we review related work focused on semantic indexing of literature, contrasting them with the approach we adopted in this work.

State of the Art
Traditional approaches to text classification typically involve feature engineering and supervised learning. The most common machine learning algorithms used in this task are decision trees, pattern rule-based classifiers, support vector machines (SVMs), neural networks, and Bayesian classifiers [21], while in the feature selection phase, several evaluation metrics have been explored, such as term frequency, information gain, and chi-square [22].
Recent approaches work directly with the text, using complex neural networks for extracting patterns. The authors of [23] use dynamic embedding projection-gated convolutional neural networks, outperforming other approaches on four well-known text classification datasets. A hybrid approach was used by [24] incorporating predictive text embedding and graph convolutional networks for text classification to address the limitations of both methods enabling faster training and improving performance in small labeled training set scenarios. The study in [25] uses a vector representation based on a supervised codebook in document classification in Nepali. These approaches perform well in text classification, but they have not been tested in a large number of benchmarks and natural language processing tasks.
Transformer-based models generally perform well in many natural language processing tasks, including text classification. A recent study shows that BERT and its variations outperform all the previous machine and deep learning techniques, such as SVMs, naive Bayes, convolutional, and recurrent neural networks [11]. The study shows the results of many text classification tasks and benchmarks, such as sentiment analysis, question answering, and topic classification. Towards this direction, we use the BERT model to classify Greek books based on their form/genre concepts, and we build on previous research to overcome the problem of long documents by leveraging the TextRank algorithm to create appropriate inputs for the model.

Related Work
For genre classification of literary texts, the most common features are based on stylometrics [26], which are extracted by statistical analysis of the texts [27][28][29][30]. The authors of [27] used as style markers (i.e., countable linguistic features) the frequency of occurrence of the most frequent words of the entire written English language using the British National Corpus to automatically detect the genre of the given text showing that these frequent words are reliable discriminators of text genre. In the same spirit, we use the TextRank algorithm to extract sentences/phrases considering that the output of the algorithm can be used as a discriminator of genre/forms concepts. Stylometrics, content-based features, and social features were used in [28] for genre classification of German novels. The authors showed that even though topics are orthogonal on genre, the SVMs considering topic-based features achieved the highest accuracy compared to other traditional machine learning algorithms such as k-nearest neighbor, naive Bayes, and multilayer neural networks. Our study shares the same limitations with this work in terms of the amount of data for training the models. However, we also have an extra obstacle related to the language varieties.
Other studies on analyzing literary texts place more emphasis on the representation of each document as a multidimensional vector, which is given as input to a classifier. Towards this direction, Yu [31] experimented with four different text representations (based on absence/presence of a word, frequency of words, normalized frequency of the words, and idf-weighted frequency of the words) in eroticism classification of Dickinson's poems and sentimentalism classification of chapters in early American novels. The authors of [32] propose the encoding of books as binary vectors, where each dimension corresponds to the existence or not of a character 4-gram in the document. In contrast to these approaches and other similar ones, we use the BERT model, which can both represent the given input as a dense vector and categorize it into a genre/form concept. In this way, the model itself encodes the information needed for solving the task, instead of being affected by human choices on document representation.
In the case of very long documents such as books, the extraction of smaller parts of texts is necessary. In this context, Worsham and Kalita [18] propose different methods to select a representative text for training the learning models, such as extracting the first, last, or random 5K words of a book or 2.5K words of each chapter of the book. We similarly sliced the text into parts, but we also proposed the use of TextRank for constructing more sophisticated sets of sentences/phrases for the classifier.
There are also approaches which categorize books by focusing on other parts of the book, such as the book cover [33], as well as approaches which categorize web documents [34]. These approaches are out of the scope of this paper, as the first one considers text with images or only images, while the second one considers web documents that have a very different structure to literature books.
There are few studies on text classification of Greek literature. Most of them show the language independence of their approach in the genre identification task [8,9,35], or test several feature engineering approaches [7,13]. None of the existed studies dealt with the language variety problem, while as far as we know, none of the existed studies have experimented with text classification in Greek literature leveraging transformerbased models.

Methodological Approach
This section introduces the approach that we used for constructing a 19th-century Greek literature semantic indexing model, as well as the two datasets that are involved in our study.

Approach
Our approach is based on a BERT model that was trained on the Greek language [36]. Specifically, it was trained on the Greek part of Wikipedia, the Greek part of European Parliament Proceedings Parallel Corpus, and the Greek part of OSCAR, a cleaned version of Common Crawl. We fine-tune this model on the semantic indexing task of our case, using the modern Greek books of the OL dataset. We then evaluate it on the 19th-century books of the ECARLE dataset.
As BERT cannot accept very long sequences of text as input, we had to find a way to distill representative content from the books that could both fit as input to BERT and contain useful information for discriminating among the four different literature categories. A common way to deal with this problem is to select parts of a long document, e.g., randomly, up to the desired number of tokens. Another way is to use transformer-based models to represent several parts of the document, which are then used by a traditional supervised model as input. Here, we adopt a simpler method that employs TextRank [14], an extractive summarization algorithm, to distill a small set of representative sentences/phrases from each document. These pieces of text are then given as input to the BERT model for the classification task. Figure 1 illustrates this approach. The long document is given as input to TextRank, which extracts the top N sentences/phrases and passes them to the BERT model. The input sentences/phrases are separated by the [SEP] special token which corresponds to the end of a sentence.

Data
Literature experts selected 107 Greek books [20] from the library of the Aristotle University of Thessaloniki for the purposes of the ECARLE project. The experts classified these books into eight categories: prose, poetry, letters, essays, lexicons, encyclopedias, magazines, and manuals. The books were already in digital form through scanning. However, extracting their text via OCR proved to be very challenging. Despite employing several state-of-the-art tools, from open source libraries and commercial software to training deep learning models on sample transcribed pages of the books [37], the OCR accuracy was far from satisfactory. Some characters were not recognized correctly, e.g., πέχνην instead of τέχνην (art), or were completely missed, e.g., ςρεθλόν instead of παρελθόν (past), by the OCR. In some cases, the intensity of this phenomenon led to whole words and phrases missing or without meaning. In addition, sometimes the OCR process misinterpreted page headers and footers as normal page text. Due to these difficulties, the extraction was accomplished for only 57 of the books, 29 of which were essays, 15 prose, 11 poems, and 2 manuals.
As the size of the ECARLE dataset was too small for training machine learning models, we constructed a second dataset by leveraging the content of the Open Library, a repository of more than seven thousand Greek digital books distributed freely and legally on the Internet in PDF format. The digital books of Open Library are classified into 40 thematic categories. In addition, each book is accompanied by one or more tags and metadata, entered by the platform's administrators. The literature books of this repository are classified into 8 main categories: classic literature, novels-novellas, short stories, poems, essays, plays, children's literature, and comics. Two out of the four categories of interest exist in this categorization: poems and essays. The novels-novellas and short stories categories were jointly considered as members of the prose category. Lacking a category related to manuals, we considered the books having the word manual in their metadata.
Typically, the category of each book is also included in the metadata. This is not the case for all books, however. In addition, we observed that some books that were classified in one of the categories of our interest, had another one in the metadata. For example, the book ΄Ησουνα κάποτε εδώ (You were once here) belongs to the poetry class, but includes prose in the metadata. To avoid noisy examples, we decided to keep books that include their category in the metadata and at the same time do not contain another member from our category set in their metadata. Furthermore, we manually removed books that did not contain any readable characters due to the PDF extraction process. The final dataset contains 124 essays, 254 prose, 177 poems, and 200 manuals from the 20th and 21st century. For extracting the plain text from the PDF files of Open Library, we used the Python library PDFMiner [38]. Figure 2 illustrates the workflow of creating the two datasets including the conversion and preprocessing of the original collections. Firstly, we mined the books from the Open Library and the Library of Aristotle University and applied PDF extraction using PDFMiner tool and OCR conversion accordingly (technical details about OCR can be found in [37]). Some books of the Open Library did not have readable characters at all, so we manually removed them. Each dataset passes through the TextRank algorithm using PyTextRank [39] from SpaCy library for creating sentences/phrases for each book. Then, the changed books are tokenized based on the BERT tokenizer provided by the transformers [40] Python library. After the tokenization phase, the OL and ECARLE datasets are ready for the training and testing.  As we can see, there is a gap from 1900 to 1970 between the two datasets apart from three books which were published in 1917, 1959, and 1963, respectively. All the books in the ECARLE dataset were published before 1900, while most of the books in the OL dataset were published after 2010. One important difference between the books of the two datasets, stemming from the different century that they were written, is the variety of the Greek language that they use. Books in the ECARLE dataset are mainly written in the katharevousa variety, while books in the OL dataset are written in modern Greek (demotic variety). Differences between the two datasets have also been observed in the number of words contained in each book ( Figure 4). The books in the OL dataset have approximately three thousand words on average, while the books in the ECARLE dataset more than four thousand. There are also some outlier books with more than seven thousand words. To find the words in the datasets, we used the el_core_news_lg vocabulary of the spaCy [41] Python library and we ignored tokens belonging to PUNKT and SPACE classes since the first one includes all punctuation marks and the second one all space characters.

Evaluation
This section describes the experimental setup and discusses the results. We first present the datasets that were used for training and testing. Next, we mention the hyperparameter tuning process and present the results on both datasets. Finally, we discuss and explain the results.

Experimental Setup
We split the OL dataset into a train and a test set, in a way such that the distribution of the classes in the test set is the same as in the ECARLE dataset. This allows for a more informative comparison between the accuracy of the model at in-sample modern Greek text and out-of-sample katharevousa text. As a result, the training set consists of 698 books and the test set 57. Table 1 presents the number of instances per class and the total number of instances for the train and test sets of the OL dataset, as well as for the ECARLE dataset. TextRank was used in three different variations: we extracted phrases with a rank score greater than 0.01, as well as the top 5/10 sentences. In addition, we experimented with splitting each book into three equal parts and considering the first 256 tokens of each part, as 256 is the maximum sequence length that our BERT model can accept.
To find the appropriate hyper-parameters for fine-tuning BERT to our classification task, we used stratified 5-fold cross-validation on the train set of OL. We followed the instructions of BERT's creators [42] for the fine-tuning process, experimenting with the following set of parameters: (i) learning rate 2 × 10 −5 , 3 × 10 −5 , 5 × 10 −5 , (ii) batch size 16, 32, and epochs 2, 3, 4. Table 2 shows the selected hyper-parameters, with respect to each different method of input selection for the BERT model, along with the corresponding accuracy of the model. To further enhance our assumption about the effectiveness of BERT to generalize beyond the training set, we also experimented with the most common traditional machine learning algorithms that have achieved great performance in text classification. Particularly, we experimented with the support vector machines (SVMs), naive Bayes (NB), and logistic regression (LG). To give appropriate inputs to the classifiers, we used count vectorization converting the training/test sets of text documents to a matrix of token counts. As the vocabulary, we used the top 60,000 tokens ordered by term frequency across the training set. Since the entire document can be represented using such method, we did not experiment with different input methods. We used stratified 5-fold cross validation to select the models with the highest accuracy considering a set of parameters for each one. For SVMs, we experimented with the regularization parameter (C) with values (0.1, 1, 10, 100), the kernel coefficient for radial basis function (rbf), polynomical and sigmoid kernels (gamma) (1, 0.1, 0.01, 0.001) and the kernel type to be used in the algorithm (kernel) (rbf, polynomial, sigmoid, linear). The degree of the polynomial kernel function was set fixed to 3 and tolerance for stopping criterion to 1 × 10 −3 . To support multiclass classification, we used the one-against-one scheme which is used as multiclass strategy and performs better than other schemes such as one-against-all [43]. For NB, we experimented with the additive smoothing parameter (alpha) with values (0.5, 1, 2). Finally, for LG, we experimented with different solvers (newton-cg, lbfgs, sag, and saga), the norm used in penalization (L1, L2), and the inverse of regularization strength (C) (0.1, 0.5, 1.0). The tolerance for stopping criteria was set fixed to 1 × 10 −4 . To implement the infrastructure for the traditional machine learning algorithms, we used the Scikit-learn Python library [44]. Table 3 summarizes the results of the classifiers along with the selected parameters and the mean accuracy over the five folds during validation. To evaluate the performance of the learning models we used the following measures:

1.
Accuracy counts the correct predictions over the total number of examples.
where TP, TN, FP, FN correspond to the true positives, true negatives, false positives, and false negatives, respectively.

2.
Kappa coefficient [45] indicates how much better a trained classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class. The smaller the value, the more likely it is that the classifier would randomly classify the instances.
where c is the total number of instances correctly predicted, C the total number of classes, s the total number of instances, p k the number of times that k was predicted, and t k the number of times that k truly occurs.

3.
F1 score is the harmonic mean of the precision and recall for a class. where: and recall = TP TP + FN

4.
Weighted average F1 score estimates the weighted average of the harmonic means of the precision and recall for all classes.
where n i is the number of instances of the class i, C the total number of classes, and the F1 i is the F1 score of the class i.
Although accuracy is an appropriate measure for balanced datasets, in our case it can be misleading, since the dataset is significantly imbalanced. F1 score and Kappa coefficient can give us better insights on the outcomes.

Results
Firstly, we present results on the OL test set (Table 4) and the ECARLE dataset ( Table 5) in terms of accuracy and kappa coefficient (K), based on the hyper-parameters selected earlier. As expected, we notice that the results in the ECARLE dataset are worse than those in the OL test set. The models trained on the first part of the book or TextRank phrases have high performance in OL test set. In the ECARLE dataset, the model trained on five sentences extracted from TextRank has the best accuracy (68.42% acc.) The models trained on second/third parts of the books have the worst performance in OL test set with 75.44% and 71.93%, respectively, while in the ECARLE dataset, the models trained on second part of the book or TextRank phrases have equally the worst performance with 59.65% accuracy. Models trained on 10 sentences extracted from TextRank have high performance on OL test set with 85.96% accuracy. However, the performance is lower in the ECARLE dataset. Regarding the K score in OL test set, there is greater agreement between the raters (actual and predicted values) for the model trained on TextRank phrases (0.8140) in OL dataset, while in the ECARLE dataset, the model trained on 5 sentences extracted from TextRank has K equal to 0.5090. Table 6 presents the results on OL test set and ECARLE dataset considering the F1 score. The models trained on TextRank phrases/sentences have the highest weighted average F1 score on the OL test set (88.32%) and on the ECARLE one (67.75%). The difference in the performance on the two datasets of the model trained on the TextRank 5 sentences, is the second-lowest (15.04%). The model trained on the first part of the book has a high weighted average F1 score on the OL test set. However, its performance in the ECARLE dataset is low (55.19%). Finally, the model trained on the third part has the lowest difference between the results on the two datasets (10.13%), but this depends on the lowest performance on the OL test set (74.81%). We further present the corresponding confusion matrices to show the predictions of the models over the true labels of the books with and without TextRank. Table 7 shows the confusion matrices of the OL test set. In all cases the models predict correctly the manuals. Firstly, we observe that only the models trained on an input constructed by TextRank correctly predicts all the poems. The model trained on TextRank phrases is the only one that did not misclassify other books as manuals while the models trained on the second/third parts of the books misclassified 10 and 7 books, respectively, as manuals. The model trained on the first part of the books is biased towards the essay class while it is the only one that successfully classified 25/29 books as essays. The models trained either on the second or on the third parts of the books are biased towards manual and prose classes which justifies the low performance of the models based on the accuracy. Table 8 shows the confusion matrices for the ECARLE dataset. Models trained either on the 5 or 10 sentences extracted from TextRank correctly predict 9/11 poems while the model trained on the third part of the book has the worst performance in this class (6/11). The model trained on the first part classifies 26/29 essays correctly, but misclassifies 13/15 prose books as essays. Only the model trained on TextRank phrases classifies 11/15 prose books correctly, while it is the one with the worst performance in the essay class predicting 16/29 books. Furthermore, the model trained on the third part of the book predicts 1/2 manuals. Table 7. Confusion matrices of OL test set. Rows correspond to predictive values and columns to the actual ones for the four categories (essays (E), prose (Pr), poems (P), and manuals (M)).    E  26 13 3  1  22  7  1  1  19  4  1  0  16  2  0  0  21  5  0  1  23 11  1  1  Pr  0  2  1  1  0  4  2  1  6  10  4  1  1  11 4  1  0  9  2  1  0  3  1  1  P  3  0  7  0  6  4  8  0  2  0  6  0  12  2  7  1  8  1  9  0  6

First Part
Finally, we present the results with traditional machine learning (ML) algorithms for the OL test set (Table 9) and the ECARLE one (Table 10) . As we expected, the algorithms have very good performance in the OL test set, since they have achieved great performance in a variety of text classification tasks before. The LG algorithm outperforms all models with 89.47% accuracy and WAF1 89.84%. However, in the ECARLE dataset, the results are significantly worse. The NB algorithm has the worst performance (19.30% acc.), while the best algorithm LG has 52.63% accuracy.

Discussion
The results indicate that we can build an efficient classifier for Greek books of the 19th century using resources from the 20th and 21st century since the BERT model classified most of the ECARLE books correctly. We observe that the model on both the OL test and ECARLE dataset has equally high performance. Furthermore, our assumption about the effectiveness of the BERT model has been confirmed since the alternatives, the traditional machine learning algorithms, had the worst performance in the ECARLE dataset.
All learning models had high performance during the hyperparameter tuning process with stratified 5-fold validation and OL training set. We expected good performance of the traditional machine learning algorithms since they have achieved high accuracy in text classification generally and also because all books are from the same collection and produced by the same extraction method. We also expected good performance of the BERT model since (1) it has achieved state-of-the-art results in many natural language processing tasks; (2) all books are from the same collection produced by the same extraction method and are written in modern Greek; (3) the BERT model has also been pre-trained on modern Greek texts; and (4) BERT is highly adaptable in downstream tasks.
The TextRank algorithm improves the results of the BERT model. The model finetuned on TextRank phrases has the highest weighted average F1 score (88.93%), the highest Kappa Coefficient score (81.40%), and the equal highest overall accuracy (87.72%). An explanation for the performance of TextRank is that BERT discovered more patterns in the phrases than among the sentences. These are not syntactic and semantic patterns but maybe simple statistics such as word and punctuation frequencies. Considering also the fact that most previous approaches achieve high performance using stylometric features including word and punctuation frequencies [27][28][29][30], the results can be justified. The performance of the models fine-tuned on consecutive parts of texts is affected by the category distribution in the test set. Although the models trained on the second/third part of the text have high accuracy during hyperparameter tuning (90.12% and 86.93%, respectively), they have the lowest performance in the test set (75.44% and 71.93%, respectively).
We expected the results on the ECARLE dataset for a multitude of reasons. First, we were unable to find a plethora of 19th-century books, and the training of the learning models happened using modern Greek books of the 20th century and the 21st century. Thus, we expected some failures due to fundamental differences between the language used in the 19th-century and 20th-century Greek texts. These failures are more apparent when we use traditional machine learning algorithms. Descriptive statistics showed that there is a chronological gap between the two datasets, as well as a difference between the distribution of the words. The differences which occurred from the language itself related to the so-called Greek language question between the high variety of katharevousa and the low variety of demotic. This conflict regarding the dominance of one variety over the other took place in the late 19th and the 20th century.
The high variety of katharevousa retained the ancient Greek synthetic character and was established as the official national language of the Greek state in 1827. Regarding literary texts, katharevousa was mostly used in official documents, essays, prose, while the dominant language variety in poetry was the demotic [46]. In the late 19th century, many poems and scholars started defending the use of the demotic variety, establishing the movement of demoticism. The demotic variety was the one spoken by Greeks of the time. Concerning the katharevousa, the demotic is not a synthetic language but an analytic one. In practice, this means that the demotic uses more words and phrases to express a meaning in contrast with the katharevousa. In the 20th century, the Greek language question took on political dimensions; the conservatism supported the use of katharevousa, while the communists supported the use of the demotic. The use of demotic in written texts expanded in the 20th century and became dominant in the 1960s. In 1976, demotic was finally recognized as the official language of the Greek state.
The above statements justify the performance of TextRank phrases in predicting the prose books with high accuracy (11/15). The input of the BERT model based on the TextRank phrases is a concatenation of non-consecutive small pieces of texts that mitigates the differences between the demotic and katharevousa. Indeed, there are not more words and phrases in demotic books to express a meaning in contrast with the books in katharevousa. Although the prose books are written in katharevousa in the 19th century and in Demotic in the 20th century, this difference did not influence the performance of the model. On the other hand, the model trained on the first part of the book predicts 14/15 books as prose books in OL test set, while it classified only 2/15 books in the ECARLE dataset.
The performance of the models is also justified for the poems. Many books have been misclassified as poems. Sentences extracted from TextRank can classify 9/11 poems correctly. Poems have already been written in Demotic since the 19th century. Thus, books in the OL dataset have similar way of writing with the books of ECARLE dataset.
An observation is that the models are biased towards essays in ECARLE dataset except the model trained on TextRank phrases. This is an evidence that the OL dataset can be used for training models that can be used for predicting books from an earlier century. The high performance of the models in OL test set for the essay class is equally high for the ECARLE dataset despite the fact that we have a small number of essays during training in contrast to the number of books of other classes (e.g., prose).
The corrupted sets due to OCR conversion and PDF extraction seem not to affect the performance of the learning models. The noisy and missing sentences did not significantly affect the performance of the BERT model, neither did the differences between the Greek demotic and katharevousa. Although the test sets are small enough to provide a full explanation of the performance of the BERT model and TextRank algorithm, we observe that despite the limitations and obstacles presented in the paper, the models are capable of classifying an important set of books correctly.
BERT is known to learn complex features, such as syntactic patterns and semantic dependencies [47]. Considering that previous studies have shown that stylometrics play a key role in genre identification, we believe that BERT manages to learn such types of features during fine-tuning.

Conclusions and Future Work
This paper addressed the problem of constructing a model for classifying Greek literature of the 19th century by genre/form concepts, under the limitations of a small collection of works of literature and the low quality of the source text. To address these challenges, we compiled a collection of modern Greek books and employed the state-ofthe-art BERT model in conjunction with the TextRank algorithm for extracting significant sentences/phrases from each book. We posed two research questions and experimented with state-of-the-art algorithms for answering them. We found that recent books written in the modern Greek language helped us train efficient models correctly classifying most of the literature books in our target collection which were written in katharevousa. The assumption that the BERT model can efficiently build such a classifier has been confirmed considering that traditional machine learning algorithms had the worst performance in katharevousa. In addition, we found that using TextRank leads to better results compared to consecutive text parts extracted from the start, middle, or end of each book.
In future work, we aim to extend our collections of literature books to conduct more data-intensive experiments. Furthermore, we aim to experiment with several extractive and abstractive summarizers to confirm that a set of representative sentences/phrases can carry enough information for training an efficient classifier in this task. Finally, more experiments with traditional machine learning and deep learning models will give us a better perspective about the efficiency of the transformer-based models in this task and domain of interest.