Computing the Affective-Aesthetic Potential of Literary Texts

In this paper, we compute the affective-aesthetic potential (AAP) of literary texts by using a simple sentiment analysis tool called SentiArt. In contrast to other established tools, SentiArt is based on publicly available vector space models (VSMs) and requires no emotional dictionary, thus making it applicable in any language for which VSMs have been made available (>150 so far) and avoiding issues of low coverage. In a first study, the AAP values of all words of a widely used lexical databank for German were computed and the VSM’s ability in representing concrete and more abstract semantic concepts was demonstrated. In a second study, SentiArt was used to predict ~2800 human word valence ratings and shown to have a high predictive accuracy (R2 > 0.5, p < 0.0001). A third study tested the validity of SentiArt in predicting emotional states over (narrative) time using human liking ratings from reading a story. Again, the predictive accuracy was highly significant: Radj = 0.46, p < 0.0001, establishing the SentiArt tool as a promising candidate for lexical sentiment analyses at both the microand macrolevels, i.e., short and long literary materials. Possibilities and limitations of lexical VSM-based sentiment analyses of diverse complex literary texts are discussed in the light of these results.


Introduction
Emotion recognition is a vital aspect of daily human life, important for survival, social, or professional reasons. However, only very recently-in evolutionary terms-has it become a challenge to both human readers and computer algorithms to read out emotional information from (literary) texts, e.g., when using machine-learning-assisted sentiment analysis/SA tools. Perhaps more than other objects of culture, written texts can induce emotions, since narratives are inseparable from the emotional content of the plots [1,2]. These emotions or sentiments can determine the most ubiquitous and basic affective decision of daily life, namely deciding whether we like or dislike something/somebody [3,4]. What we read about something or somebody also can determine our behavior, e.g., choosing a movie, buying a book, or voting for someone. Sentiment analysis (SA) can be defined as: 'the process of computationally identifying and categorizing opinions (According to Liu (2015) an opinion is a quintuple, e i , a ij , s ijkl , h k , t l , where e i is a named entity (e.g., Abraham), a ij an aspect of e i (e.g., a word or phrase expressing an aspect such as 'Abraham's son is sad'), s ijkl is the sentiment on aspect a i , (e.g., a valence value or a discrete emotion label such as 'sad'), h k is the opinion holder, and t l is the time of the opinion expressing event) expressed in a piece of text, especially in order to determine whether the writer's attitude AI 2020, 1, 11-27; doi:10.3390/ai1010002 www.mdpi.com/journal/ai towards a particular topic, product, etc., is positive, negative, or neutral' (Oxford English dictionary: https://www.lexico.com/en/definition/sentiment_analysis). Although the majority of SAs today are applied to book or movie reviews with several gold standards allowing one to evaluate the SA tool's performance [5], an increasing number of studies applies SA to literature and poetry [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22]. The standard and most straightforward SA approach nowadays in natural language processing/NLP, digital humanities, computational linguistics and stylistics, psychology, or neurocognitive poetics is the lexical one: It simplifies complex emotional information analysis to a vocabulary-based computation of the polarity, valence, or some other sentiment variable of single keywords contained in the sentences of the text. Following Miller's psycholinguistic doctrine [23], whoever wants to understand how larger text segments can induce emotional processes must start with those basic units at which all relevant processes and representations in language use come together: Single words [24]. Words, as has long been known [25,26] are embodied stimuli with the potential to elicit overt and covert sensorimotor and affective responses [27]. They even can 'stink', suggesting that the affective processes we experience when reading rely on the reuse of phylogenetically ancient brain structures that process basic emotions in other domains and species [28].
Regarding lexical SA, Bestgen's pioneering study [8] already suggested that lexical valence can predict the affective tones of sentences and entire texts quite well, and there is also a lot of recent evidence for the usefulness and empirical validity of the lexical approach to SA using different tools like VADER [5], HU-LIU [29], or SentiArt [13]. Like most other tools in the field, the former two are both based on word lists containing human rating data, i.e., what is sometimes called emotional dictionaries or prior-polarity lexicons [30]: Vader uses~7.500 entries https://www.kaggle.com/nltkdata/vader-lexicon, and Hu-Liu~6.800; (https://www.cs.uic.edu/~{}liub/FBS/sentiment-analysis.html#lexicon). In contrast, SentiArt uses an unsupervised learning approach introduced by Turney [31], which is based on vector space models (VSMs) and a label list representing prototypes of positive and negative semantic orientation or emotional valence, such as the labels GOOD, NICE vs. BAD, NASTY [32]. Optimally, word list-based methods should cover each (content) word-or at least a maximum-in the test texts to be 'sentiment analyzed' in order to augment both the reliability and validity of the tool. Practically, however, such tools often run into problems for several reasons. First, when dealing with highly literary or ancient text materials, the word lists overall coverage or hit rate can sink below 50% making the sentiment analysis unreliable. An example is given in Study 3 below, in which two word-list-based methods that are compared with SentiArt yield suboptimal results due to their low coverage when applied to a classical text in German, E.T.A. Hoffmann's (1816) The Sandman. Secondly, if there are no or only limited word lists available in the language of a researcher's country, simply translating existing English lists into that language without empirical cross-validation is problematic [33] since sensitivity to emotional content varies across languages, which differ considerably in their emotion vocabularies [34][35][36]. Collecting human rating data to create new word lists in other languages or to enlarge existing English ones is costly, but most importantly, there are serious methodological and epistemological issues about the reliability and validity of human sentiment ratings when they are turned from a dependent variable (i.e., a 'subjective' behavioral measure in response to a stimulus) into an independent variable (i.e., an 'objective' predictor of say the positivity of a text [37,38]).
The big advantage of VSM-based methods like SentiArt is that they avoid these problems: (i) They require no word lists based on human ratings; (ii) thanks to the public availability of VSMs in >150 languages (https://fasttext.cc/docs/en/pretrained-vectors.html) they can be applied to a multitude of texts from different countries even in special dialects; and (iii) by creating task-or domain-specific VSMs, they can be flexibly adapted to different research purposes, e.g., predicting human behavior of participants reading children books or Shakespeare sonnets [39,40]. The next section describes the workflow and exact procedure of SentiArt. However, as mentioned above, some users might want to create their own task-specific VSM, e.g., because they have reasons to think that the wiki.en VSM is not the optimal one for their test texts at hand. In this case, the workflow for applying SentiArt is as follows:

SentiArt
1. Selection and evaluation of an appropriate VSM, e.g., one can use the procedure described on the fasttext homepage (https://fasttext.cc/docs/en/pretrained-vectors.html) to directly download the (German) VSM called 'wiki.de.vec' providing 300d sublexical vectors for each of >2 million words (e.g., in the original uncleaned version [41]). 2. Selection and evaluation of an appropriate label list, e.g., for valence, one could use the modelbased label lists empirically validated [13,42]. 3. Computation of AAP and evaluation of predictive accuracy, i.e. cross-validation with empirical data (e.g., human ratings) Each of these three steps involves multiple choices, which influence the results of the SA and will be described in detail below.

The Present Study
In this paper, we test the accuracy and validity of SentiArt and its underlying computational tools by applying it to the different test materials and discuss possibilities and limitations of lexical SA of complex literary texts.

Study 1. Selecting and Evaluating the VSMs
Two crucial ingredients of VSM-based SA are the training corpus and VSM. The where v is the size of the vocabulary and d the dimensionality) is always based on a training corpus and the choice of the latter will affect the quality and utility of the VSM for the SA purposes at hand. For example, the publicly available VSM called 'german.model/GM with GM = R 610k × 300 ; (https://devmount.github.io/GermanWordEmbeddings/) was trained with the 'word2vec' algorithm [43] on the German Wikipedia and news articles of a single day in a specific year (15 May 2015). Thus, the VSM incorporates choices regarding the size, representativeness, or specificity of the training corpus, all of which will influence the quality and validity of the VSM used to compute the semantic relatedness values, which are crucial for establishing the 2D emotion potential space and thus the results of the SA (for an overview of relevant training corpora, see [12]). However, as mentioned above, some users might want to create their own task-specific VSM, e.g., because they have reasons to think that the wiki.en VSM is not the optimal one for their test texts at hand. In this case, the workflow for applying SentiArt is as follows: 1.
Selection and evaluation of an appropriate VSM, e.g., one can use the procedure described on the fasttext homepage (https://fasttext.cc/docs/en/pretrained-vectors.html) to directly download the (German) VSM called 'wiki.de.vec' providing 300d sublexical vectors for each of >2 million words (e.g., in the original uncleaned version [41]).

2.
Selection and evaluation of an appropriate label list, e.g., for valence, one could use the model-based label lists empirically validated [13,42].

3.
Computation of AAP and evaluation of predictive accuracy, i.e. cross-validation with empirical data (e.g., human ratings) Each of these three steps involves multiple choices, which influence the results of the SA and will be described in detail below.

The Present Study
In this paper, we test the accuracy and validity of SentiArt and its underlying computational tools by applying it to the different test materials and discuss possibilities and limitations of lexical SA of complex literary texts.
where v is the size of the vocabulary and d the dimensionality) is always based on a training corpus and the choice of the latter will affect the quality and utility of the VSM for the SA purposes at hand. For example, the publicly available VSM called 'german.model/GM with GM = R 610k × 300 ; (https://devmount.github.io/GermanWordEmbeddings/) was trained with the 'word2vec' algorithm [43] on the German Wikipedia and news articles of a single day in a specific year (15 May 2015). Thus, the VSM incorporates choices regarding the size, representativeness, or specificity of the training corpus, all of which will influence the quality and validity of the VSM used to compute the semantic relatedness values, which are crucial for establishing the 2D emotion potential space and thus the results of the SA (for an overview of relevant training corpora, see [12]).
For the present purpose, we chose the widely used, publicly available, and ecologically valid subtlex database for German [44] as a reference lexicon, which allowed us an evaluative comparison between three VSMs using an identical set of items. The~120k words of subtlex overlap sufficiently with those of a large range of both non-literary and literary texts to obtain stable SA results. We further chose three publicly available German VSMs as a basis for 'sentiarting' the~120k words of subtlex (i.e., assigning VSM-based valence and AAP values to each word), which was then used to predict the valence and AAP values of the words of our test texts. Each VSM was evaluated using a face-validity approach based on the t-distributed stochastic neighbor embedding (tsne) algorithm [45], as well as a cross-validation procedure using human valence rating data from the Berlin Affective Word List (BAWL) [46,47]. Table 1 summarizes the data for the three VSMs.  1 The cleaning procedure simply deleted all words containing non-alphabetic characters.

VSM Evaluation
When human rating or other empirical data are available, the validity of a VSM and the VSM-based SA can straightforwardly be cross-validated [13,[49][50][51][52]. If such data are not available, face validity tests, e.g., using semantic arithmetic experiments, are a viable option. The model evaluation tests proposed for the 'german.model' are exemplary in this regard, including multiple systematic semantic arithmetic and syntactic tests.  Table 1 using the tsne algorithm. The idea behind this tsne-based evaluation is that if the VSM is any good for the present purposes, concepts obviously related to each other, such as the emotionally rather neutral CATS and DOGS ('Katze', 'Hund'; Figure 2), or the more emotionally valenced concepts in Figure 3 (e.g., DISGUST/'Ekel') should be separated but relatively close in semantic space, while a concept like 'house' should be clearly apart. This is indeed the case for all three VSMs in German.   The results in these three Figures can be summarized as follows. First, the data show that, as expected, overall each VSM generates distinct semantic neighborhoods as represented by the tsne method. In Figure 1a, GM produces STORY ('Geschichte') and MOVIE ('Film') as the closest semantic neighbors of the target word BOOK ('Buch'), for SDEWAC it is MOVIE ('Film') and THEME ('Thema'), while WIKI produces OPUS ('Werk') and THEME ('Thema'). The data for the list of rather neutral words in Figure 2 (WOMAN, MAN, HOUSE, CAT, DOG) suggest that with increasing VSM size, the concepts corresponding to these words become better separated in the 2D computational semantic space. Thus, while in GM (Figure 2a,b) the concepts CAT ('Katze', yellow dots) and DOG ('Hund', red dots) still widely overlap and are close to WOMAN ('Frau', magenta dots) and MAN The results in these three Figures can be summarized as follows. First, the data show that, as expected, overall each VSM generates distinct semantic neighborhoods as represented by the tsne method. In Figure 1a, GM produces STORY ('Geschichte') and MOVIE ('Film') as the closest semantic neighbors of the target word BOOK ('Buch'), for SDEWAC it is MOVIE ('Film') and THEME ('Thema'), while WIKI produces OPUS ('Werk') and THEME ('Thema'). The data for the list of rather neutral words in Figure 2 (WOMAN, MAN, HOUSE, CAT, DOG) suggest that with increasing VSM size, the concepts corresponding to these words become better separated in the 2D computational semantic space. Thus, while in GM (Figure 2a,b) the concepts CAT ('Katze', yellow dots) and DOG ('Hund', red dots) still widely overlap and are close to WOMAN ('Frau', magenta dots) and MAN ('Mann', cyan dots), in both SDEWAC (Figure 2c,d) and WIKI (Figure 2e,f), they are clearly apart from each other and from the concept HOUSE ('Haus', green dots). The violin plots show semantic neighborhood density (snd) for each concept as quantified by the average cosine of the target item with its N nearest neighbors (where N was set to 50 here) [53].
Finally, the emotional word list in Figure 3, which corresponds to the five negative labels from the 'Ekman99 model [42], i.e., DISGUST ('Ekel', magenta), EMBARRASMENT ('Verlegenheit', cyan), FEAR ('Angst', green), SADNESS ('Traurigkeit', orange), and SHAME ('Scham', red). The data show that the conceptual overlap is much larger than for the neutral words in Figure 2, a finding that can be expected given the relatively abstract nature of emotion terms compared to concrete categories like DOG. As can also be seen in Figure 3b,d,f, the three VSM produce different snd values with different distributional shapes, although the concept SADNESS (orange) appears to be the 'clearest' (i.e., highest snd) in all three VSMs. On the basis of these descriptive data in Figures 1-3, one can expect notable differences between the three VSMs with regard to predictions concerning the valence and AAP of texts. The next study describes the computation of the AAP together with the evaluation of the label lists.

Study 2. Computation and validation of lexical valence and AAP values.
Each word of the subtlex database was 'sentiarted' as follows. Using the vectors from the three models summarized in Table 1, we computed the valence values based on the semantic relatedness (estimated via the cosine similarity between two vectors) between each word in subtlex and the theoretically motivated and empirically validated 'Ekman99 emotion labels' [42]. The model-based valence of a test word, v(w), is computed according to Equation (1), i.e., as the difference between two average similarity values: First, the average similarity (s) between the test word (w), and the seven (m) positive emotion labels (l pos_1−7, = CONTENTMENT, HAPPINESS, PLEASURE, PRIDE, RELIEF, SATISFACTION, SURPRISE), and, second, the average similarity between the test word and the five (n) negative emotion labels (l neg_1−5, = DISGUST, EMBARRASMENT, FEAR, SADNESS, SHAME). The similarity between a word and a label, s(w, l), is computed by the cosine between the 300d vectors for word and label, as given by the VSM, shown in Equation (2).
where A i and B i are the vectors for word and label, respectively. As an example, using the SDEWAC VSM, the computational valence, v(w), for the theoretically most positive test word 'reizvoll' (APPEALING) in the subtlex database yielded an average similarity with the seven positive labels of 0.23 and an average similarity with the five negative labels of 0.19, resulting in a theoretical valence of 0.04.
The same procedure is applied for the computation of arousal and AAP values for each test word. For the latter ones, we used the extended 120 label list [12,48], which is given in the Appendices of both papers. All these values are summarized in an '.xlsx' table available via email (ajacobs@zedat.fu-berlin.de). The SDEWAC sheet, for example, gives the valence, arousal, and AAP values for each of~120k words from the subtlex database that overlap with the SDEWAC words (the original German subtlex database has~200k words, but here we only used those for which a spelling check had been made (~125k).~115k of these words overlap with those in the wiki.de VSM,~90k with the german.model, and~120k with sdewac).

Label List Evaluation Predicting Human Valence Rating Data
Among the 12 models based on psychological emotion theories that Westbury [42] tested as candidate label lists, the 'Ekman99' [54] model with the above outlined 12 labels was the winner accounting for about 34% variance in the validation set of >10.000 human valence ratings database [42,55]. Here, we used publicly available German valence rating data from the BAWL [46,47], a very successful tool, which has been applied in >100 studies in different fields of research [10], to test the validity of the translated 12 'Ekman99' labels when used with the three VSM. It is worth noting that the same validation procedure should, in principle, be applied when using alternative word list based SA tools, i.e., before using them, one should test how well they predict human ratings from an independent data base, such as the BAWL or the one by Warriner [55]. As far as we can tell, such a cross-validation procedure is not yet standard practice, though.
The data in Figure 4 establish an interesting novel finding: The best predictor of human valence rating data, at least for the German BAWL, is not computational valence based on the 'Ekman99' label list, but AAP based on the AAP list [48], the latter accounting for more than twice the variance than the former. In a way, this is not astonishing, since the former uses only 12 labels and the latter uses 120 (60 positive and 60 negative items including almost all 'Ekman99' labels) thus making it much less context sensitive and more accurate. This is true for all three VSMs. The VSM yielding the best performance (R 2 > 0.5) is SDEWAC (middle panel), followed by WIKI and GM. Model performance can be increased to > 0.6 when reducing the rating data to those items which have the highest inter-rater agreement, e.g., for items with a standard deviation of ≤ 0.9. However, compared to the results of Westbury [42], an R 2 > 0.5 appears pretty good. Human valence ratings very likely are based, in a yet unknown part, on information retrieved from semantic memory which is of experiential/embodied origin, as opposed to distributional semantics [2,56]. The perhaps simplest assumption would be to consider this 'embodied part' to account for~50% of the variance. If this would hold, accounting for 50% of the variance by means of distributional semantic models seems very promising to us. Given these cross-validation results, we used this best-fitting VSM for all following SAs.

Label List Evaluation Predicting Human Valence Rating Data
Among the 12 models based on psychological emotion theories that Westbury [42] tested as candidate label lists, the 'Ekman99' [54] model with the above outlined 12 labels was the winner accounting for about 34% variance in the validation set of >10.000 human valence ratings database [42,55]. Here, we used publicly available German valence rating data from the BAWL [46,47], a very successful tool, which has been applied in >100 studies in different fields of research [10], to test the validity of the translated 12 'Ekman99' labels when used with the three VSM. It is worth noting that the same validation procedure should, in principle, be applied when using alternative word list based SA tools, i.e., before using them, one should test how well they predict human ratings from an independent data base, such as the BAWL or the one by Warriner [55]. As far as we can tell, such a cross-validation procedure is not yet standard practice, though.
The data in Figure 4 establish an interesting novel finding: The best predictor of human valence rating data, at least for the German BAWL, is not computational valence based on the 'Ekman99' label list, but AAP based on the AAP list [48], the latter accounting for more than twice the variance than the former. In a way, this is not astonishing, since the former uses only 12 labels and the latter uses 120 (60 positive and 60 negative items including almost all 'Ekman99' labels) thus making it much less context sensitive and more accurate. This is true for all three VSMs. The VSM yielding the best performance (R 2 > 0.5) is SDEWAC (middle panel), followed by WIKI and GM. Model performance can be increased to > 0.6 when reducing the rating data to those items which have the highest interrater agreement, e.g., for items with a standard deviation of ≤ 0.9. However, compared to the results of Westbury [42], an R 2 > 0.5 appears pretty good. Human valence ratings very likely are based, in a yet unknown part, on information retrieved from semantic memory which is of experiential/embodied origin, as opposed to distributional semantics [2,56]. The perhaps simplest assumption would be to consider this 'embodied part' to account for ~50% of the variance. If this would hold, accounting for ~50% of the variance by means of distributional semantic models seems very promising to us. Given these cross-validation results, we used this best-fitting VSM for all following SAs.

Study 3. Predicting Human Liking Ratings and Emotional States over Time with Different AAP Indices
Using the look-up table approach the computation of the mean AAP of stories or book chapters is straightforward, coming down to simply cross-referencing a list of words representing a book chapter (or a sentence or paragraph) with the corresponding words in the 'sentiarted' subtlex table, as described above.

Evaluation of the AAP Construct Predicting Human Liking Ratings
The superior predictive validity of the present AAP construct (over the valence construct) was established using human valence rating data ( Figure 4). Here, we tested it a second time against human liking ratings from a reading study using E.T.A. Hoffmann's The Sandman -a prototypically uncanny narrative from 1816 representative of the 'black romantic' that evokes feelings of suspense and immersion in readers [14,57]. (In this study, 20 participants first read the story ‚The Sandman' (divided into 65 segments of approximately equal length; M = 105.5 words; SD = 26.1 words) on paper in one go. Afterwards, they had to answer five comprehension questions to ensure that they had actually read and understood the story. Finally, the novella was returned to them and they rated each of the 65 sections separately on a computer on different scales (liking scale = 1-7). All ratings referred to subjects' experience during the first reading, which was explicitly pointed out to them. The entire experiment lasted between 90 and 140 min depending on the reader. The data were averaged across readers) For comparison, the liking ratings were predicted by three different predictors: (i) Empirically measured BAWL valence ratings (Figure 5a

Study 3. Predicting human liking ratings and emotional states over time with different AAP indices
Using the look-up table approach the computation of the mean AAP of stories or book chapters is straightforward, coming down to simply cross-referencing a list of words representing a book chapter (or a sentence or paragraph) with the corresponding words in the 'sentiarted' subtlex table, as described above.

Evaluation of the AAP Construct Predicting Human Liking Ratings
The superior predictive validity of the present AAP construct (over the valence construct) was established using human valence rating data ( Figure 4). Here, we tested it a second time against human liking ratings from a reading study using E.T.A. Hoffmann's The Sandman-a prototypically uncanny narrative from 1816 representative of the 'black romantic' that evokes feelings of suspense and immersion in readers [14,57]. (In this study, 20 participants first read the story ‚The Sandman' (divided into 65 segments of approximately equal length; M = 105.5 words; SD = 26.1 words) on paper in one go. Afterwards, they had to answer five comprehension questions to ensure that they had actually read and understood the story. Finally, the novella was returned to them and they rated each of the 65 sections separately on a computer on different scales (liking scale = 1-7). All ratings referred to subjects' experience during the first reading, which was explicitly pointed out to them. The entire experiment lasted between 90 and 140 min depending on the reader. The data were averaged across readers) For comparison, the liking ratings were predicted by three different predictors: (i) Empirically measured BAWL valence ratings (Figure 5a   While all R 2 values are moderate, the model fits are highly significant and, most importantly, the fit for the two empirically derived predictors (BAWL and sentiWS ratings) is not better than the one for AAP, on the contrary. Presumably, part of the superior performance of the AAP method lies in the fact that the hit rate (content words only) of the other two is low, which makes their estimates unreliable (SentiWS ~ 15%, BAWL ~ 30%, AAP ~ 90%). Together with successful previous applications of SentiArt in the prediction of word beauty ratings [48], or the classification of text segments from While all R 2 values are moderate, the model fits are highly significant and, most importantly, the fit for the two empirically derived predictors (BAWL and sentiWS ratings) is not better than the one for AAP, on the contrary. Presumably, part of the superior performance of the AAP method lies in the fact that the hit rate (content words only) of the other two is low, which makes their estimates unreliable (SentiWS~15%, BAWL~30%, AAP~90%). Together with successful previous applications of SentiArt in the prediction of word beauty ratings [48], or the classification of text segments from the Harry Potter books [13,59], these data establish the AAP, as computed by SentiArt with the SDEWAC VSM, as a viable alternative to using human rating data as predictors of other human rating data, an often costly and, in general, epistemologically questionable method [37,38].

Predicting Emotional States Over (Narrative) Time
One challenge addressed by the present research topic of this journal is defined by the fact that 'modeling and predicting the emotional state over time is not a trivial problem, because continuous data labeling is costly and not always feasible. This is a crucial issue in real-world applications, where the labeling of the features is sparse and eventually describes only the most prominent emotional events'. Stories or books are natural candidates for analyzing emotional states over time, since they offer the possibility to plot the emotion potential across different chapters or other units of narrative time. It should be noted though that in many cases, narrative time is not linear and thus cannot always be directly compared to the results provided by the present approach (We are grateful to an anonymous reviewer for mentioning this). However, for reasons of comparability, here we followed the standard procedure proposed by successful macroanalytic approaches for analyzing emotional time series of entire books, such as the 'hedonometer' [9] or the 'Syuzhet' package [15]. These methods aggregate lexical SA information across large units of texts (e.g., 10.000 words for the hedonometer in a linear way [60]).
While not all literary texts are equally well suited to such macroanalyses, the emotion potential method proposed in an earlier paper [12] can also be reliably applied to small text units, i.e., for microanalyses, such as a single Shakespeare sonnet (~115 words) or short text segments of 100 words like the present ones from The Sandman (overall length~7000 words). Such smaller texts are well suited to empirically cross-validate the theoretical predictions derived from SA tools by collecting (quasi-)continuous rating data. At least two such datasets have been examined in previous studies [57,61], which were not interested in computational SA though.
Using the above data from The Sandman study, next we tested the prediction of human liking ratings over time, i.e., the evolution of the story across the 65 segments, by different AAP indices. When readers judge the emotional content of subsequent coherent pieces of text, which kind of text features they really use for that complex decision still is a big open question both for NLP and neurocognitive poetics approaches [3,12]. Thus, focusing on the lexical level, we don't know yet whether readers take into account all words of the text or just a few key words, whether they pay attention to word forms (conjugation, inflection, derivation) and/or recurrent words (integrating word frequency information) or not. If they do not take into account each and every word, other open questions are whether content words count most (or exclusively), or whether words with extreme valence values weigh more. To shed some light on this issue, we tested several models here: For each of the four models, we further computed several AAP indices: (i) Mean AAP (mean R 2 adj = 0.32), (ii) frequency-weighted mean AAP (mean R 2 adj = 0.28); (iii) lens mean AAP (mean R 2 adj = 0.05; The 'lens' option was proposed by Dodds [9] to obtain a strong signal by only keeping words residing in the tails of the valence distribution). Here we took all words into account for which AAP was <25% or >75% of the distribution), (iv) frequency-weighted lens mean AAP (mean R 2 adj = 0.11). The AAP values resulting from all 4 × 4 = 16 computing methods were then used as predictors of the human liking ratings in 16 linear regression models. The mean R 2 values for the models and indices indicated above suggest that using lemmata or lens extremes did not help the present SA. The winning model C was based on the simple mean of the AAP values for all unique content words without frequency weighting (R 2 adj = 0.34), closely followed by the frequency-weighted variant (R 2 adj = 0.33). Interestingly, this suggests that, at least for the present short segments of a mystery story, readers seem to have focused on content words but not to have relied much on a cumulative AAP value, largely ignoring how often a given content word occurred. Of course, readers very likely also use inter-or supralexical information in their liking ratings of literary texts [3] thus explaining the moderate R 2 values, which leave about 70% of variance unaccounted for. Still, the data in Figure 6 look very promising in showing the potential of a purely lexical micro-SA for predicting emotional states over narrative time.
AI 2020, 1,13 = 0.33). Interestingly, this suggests that, at least for the present short segments of a mystery story, readers seem to have focused on content words but not to have relied much on a cumulative AAP value, largely ignoring how often a given content word occurred. Of course, readers very likely also use inter-or supralexical information in their liking ratings of literary texts [3] thus explaining the moderate R 2 values, which leave about 70% of variance unaccounted for. Still, the data in Figure 6 look very promising in showing the potential of a purely lexical micro-SA for predicting emotional states over narrative time.  Figure 6 shows the smoothed average (SMA, window size = 5) curves for the rating data (in blue) and the corresponding AAP values computed with the winning model C (in red). Actually, the synchrony between the curves is pretty high (R 2 adj = 0.46, p < 0.0001) suggesting that mean AAP for content words is a useful option for predicting the temporal dynamics of human liking ratings. The R 2 adj of about 50% sets an upper bound for more sophisticated SA tools that take into account e.g., aspect-based SA [62] or inter-and supralexical text features [12].

Summary, Discussion, Limitations and Outlook
The general face validity of three publicly available German VSMs for representing lexicosemantic concepts was established using a tsne approach. The VSMs were then used in the SentiArt algorithm to compute the valence and AAP values of the ~120 k words of a German-language database (subtlex). In a first cross-validation study, it was shown that the computational AAP values predicted ~2800 human valence ratings from the BAWL better than the computational valence values, establishing the SDEWAC VSM as the best-fitting of the three VSMs (R 2 > 0.5, r = 0.72, p < 0.0001). A second cross-validation study showed that the computational AAP values predicted human liking ratings from an empirical study in which participants read the story The Sandman better than empirically obtained valence ratings from the BAWL. It also showed that the time course of human liking ratings was well predicted by the AAP values (r = 0.65, p < 0.0001).
In sum, the present studies establish SentiArt's AAP variable as a useful predictor of human valence ratings of single words (BAWL) and liking ratings for story segments (The Sandman). The predictive validity of the former (R 2 ~0.52) was higher than for the latter (R 2 ~0.23). This could be  Figure 6 shows the smoothed average (SMA, window size = 5) curves for the rating data (in blue) and the corresponding AAP values computed with the winning model C (in red). Actually, the synchrony between the curves is pretty high (R 2 adj = 0.46, p < 0.0001) suggesting that mean AAP for content words is a useful option for predicting the temporal dynamics of human liking ratings. The R 2 adj of about 50% sets an upper bound for more sophisticated SA tools that take into account e.g., aspect-based SA [62] or inter-and supralexical text features [12].

Summary, Discussion, Limitations and Outlook
The general face validity of three publicly available German VSMs for representing lexico-semantic concepts was established using a tsne approach. The VSMs were then used in the SentiArt algorithm to compute the valence and AAP values of the~120 k words of a German-language database (subtlex). In a first cross-validation study, it was shown that the computational AAP values predicted~2800 human valence ratings from the BAWL better than the computational valence values, establishing the SDEWAC VSM as the best-fitting of the three VSMs (R 2 > 0.5, r = 0.72, p < 0.0001). A second cross-validation study showed that the computational AAP values predicted human liking ratings from an empirical study in which participants read the story The Sandman better than empirically obtained valence ratings from the BAWL. It also showed that the time course of human liking ratings was well predicted by the AAP values (r = 0.65, p < 0.0001).
In sum, the present studies establish SentiArt's AAP variable as a useful predictor of human valence ratings of single words (BAWL) and liking ratings for story segments (The Sandman). The predictive validity of the former (R 2~0 .52) was higher than for the latter (R 2~0 .23). This could be expected since other than lexical features influence the complex ratings of entire segments or paragraphs. An example is interlexical features, which concern the relation between two or more words in a line, sentence, stanza, or paragraph that may well represent dynamic changes or contrasts in readers' affective experience [23,63]. Thus, the interlexical features valence and arousal span (i.e., the range of valence or arousal values, respectively, of single words across a text segment) are indicative of emotional shifts in a piece of text that can influence readers' mood and indicate an update of the mental situation model [64]. Affective responses to texts can be seen as the dynamic attribution of emotional valence and arousal to every state of the (text) world that an adaptive agent (reader) might visit [23,65]. Valence and arousal spans appear to be appropriate interlexical features serving as proxys for such a dynamic. Indeed, empirical evidence shows that a strong variation in lexical arousal in a piece of text can lead readers on an emotional rollercoaster as indicated by online measures of heart-rate variability, brain activity, or liking ratings [57,61,66].
Of course, also supralexical features will affect the emotional experience when reading an entire text. The supralexical features proposed in examples in Jacobs' 4 × 4 matrix for QNAs [23] (global swing at the metric level, global affective meaning at the phonological level, syntactic complexity at the morpho-syntactic level, and action density at the semantic level all potentially can affect readers' sentiments and thus be relevant for a future integrative SA tools. However, there is very little research on how these features can best be quantified and integrated into current SA tools [12]. Another important aspect for SAs of entire books concerns the emotional and figure personality profiles for main characters as computed by an extended SentiArt algorithm [13]. These profiles can help in predicting the empathy for and identification with story characters, which undoubtedly are an important factor influencing readers' sentiments and moods during the reading of entire novels [67].