Automated Classiﬁcation of Evidence of Respect in the Communication through Twitter

: Volcanoes of hate and disrespect erupt in societies often not without fatal consequences. To address this negative phenomenon scientists struggled to understand and analyze its roots and language expressions described as hate speech. As a result, it is now possible to automatically detect and counter hate speech in textual data spreading rapidly, for example, in social media. However, recently another approach to tackling the roots of disrespect was proposed, it is based on the concept of promoting positive behavior instead of only penalizing hate and disrespect. In our study, we followed this approach and discovered that it is hard to ﬁnd any textual data sets or studies discussing automatic detection regarding respectful behaviors and their textual expressions. Therefore, we decided to contribute probably one of the ﬁrst human-annotated data sets which allows for supervised training of text analysis methods for automatic detection of respectful messages. By choosing a data set of tweets which already possessed sentiment annotations we were also able to discuss the correlation of sentiment and respect. Finally, we provide a comparison of recent machine and deep learning text analysis methods and their performance which allowed us to demonstrate that automatic detection of respectful messages in social media is feasible.


Background
Treating every person with respect [1] seems to be a timeless commandment that everyone would readily agree upon. Unfortunately, this commandment is not practiced in many societies, groups, and enterprises. Outbreaks of disrespectful human behavior are witnessed regularly, especially on social media which significantly influence emotions in humans [2]. As a result, numerous researchers have begun to address the problem of hate speech propagated via micro-blogging platforms such as Twitter [3][4][5][6][7]. Because hate is not limited to any concept or language, diverse studies have addressed hate expressed toward specific topics, such as sexism [8], racism [3,[5][6][7], nationalism [9], and immigration [10] in English, as well as in other languages [8,9,11]. Various entities have attempted to mitigate the negative effects of hate speech; for example, the United Nations and European Union have their own strategy of addressing hate speech [12,13]. In contrast, the United States Navy has proposed to address the problem from another direction, by strengthening positive behavior. Accordingly, the Navy has published a list of signature behaviors that 21 st -century sailors should exhibit [1], in which "treating every person with respect" is placed first. Interestingly, this approach seems to be novel for studies focused on micro-blogging language analysis, because studies aimed at identifying the positive signature behaviors of social media users, with a focus on the use of respectful language or expressions of respect, are difficult to find. After conducting a search on Google Scholar [14] with queries including "expressing admiration Twitter," "expressing respect Twitter," "language of respect Twitter," "respectful language Twitter," "expressing appreciation Twitter," "appreciation on Twitter," and "polite language Twitter," we were able to identify only a few relevant Twitter-related studies addressing admiration in sports [15,16], compliments for celebrities [17], self-organization of a racial minority group [18], politeness strategies [19], polite language style [20] and gender-dependent language style [21,22]. From the above findings, we hypothesized that the use of polite, respectful language and expressions of respect has been widely discussed in many areas and contexts, although not yet regarding Twitter.
We believe that introduction of a first data set of a new kind is a significant contribution. When new data is presented, it is often the case that data owners demonstrate possible use. In our case, we demonstrate that this data set can be used for training of automated classification methods from the Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP) domains. We believe that comparing the performance of 14 models is a significant contribution, as other authors do not compare so many models in a single study. We also believe that demonstrating the correlation of respect to sentiment is crucial, indicating that these two notions are not the same. However, they are somewhat connected according to the obtained results.

Our Focus and Related Research
Our study was aimed to address the use of respectful language and expressions of respect on Twitter and to demonstrate that whether a person is exhibiting a positive signature behavior can be assessed based on textual data through automated text analysis.
The most closely related research to our study is probably that in [23], which has addressed the problem of assessing respect in utterances of officers on duty. The study utilized a hand-annotated sample of 414 data instances to perform regression analysis on the influence of chosen linguistic features on the respectfulness of the analyzed utterance. Furthermore, the model was used to assign a "respect score" to previously unseen data instances in accordance with phrases found in analyzed sentences. That is, a lexiconbased analysis with a regression model was used to solve a regression task of assigning a "respect score." The advantage of this method is its transparency because it allows for easy demonstration of which linguistic features contribute to the "respect score" at the instance level.

Defining Respect
Respectful language can be defined in many ways, depending on the context, the persons involved in the context, or the domain. Accordingly, understanding how prior researchers analyzing "respect" have approached the topic of inquiry is important. The main questions that academics and philosophers have asked about respect include how respect should be understood at a general level. Most researchers in the field have identified the concept of respect in various ways, including as a style of conduct, an attitude, a feeling, a right, or a moral virtue [24]. The concept of respect has always had important relevance to people's daily lives because people almost universally live together in social groups. Humans are called upon to give respect in various value paradigms, for example, human life; members of minority racial and ethnic groups; those discriminated against on the basis of gender, sexual orientation, age, religious beliefs, or economic status; the respect for nature urged by environmentalists; and the respect demanded in recognizing some people as social and moral equals and appreciating their cultural differences [25,26]. Academics interested in this matter have widely recognized the existence of different types of respect. For example, the relative ideas regarding respect may differ significantly from other ideas, given the particular context of society, culture, religion, and age [27,28].
Respect as a concept has also been highlighted in discussions of justice and equality, injustice and duties, moral motivation and development, cultural variety and tolerance, punishment, and political violence. According to [29], interest in respect has focused mainly on respect for people, that is, for others who live throughout the world, and therefore regarding differences in religious and cultural beliefs. Thus, the idea that all people should be treated with respect has become more refined: all people should be treated respectfully simply because they are people. Duty and the associated moral approach can be traced to the philosopher Kant, who said that all people are appropriate objects of the most morally significant attitude of respect [30].
However, although most humans recognize the importance of respect and the idea of a moral and political sentiment, owing to the specific actions of people and societies, agreement is lacking regarding issues such as how the concept of respect should be understood, and who or what the appropriate objects of respect should be [31]. As a consequence, the attitude of respect is important to discuss. The attitude of respect necessarily has an object: respect is always directed, felt, or shown to some object or person. Although a wide variety of objects can be appropriate for one type of respect or another (such as the flag, a statue, or a symbol), the subject of respect (the respecter) is nonetheless always a human, that is, a capable conscious rational being who can acknowledge and respond intentionally, who has and expresses values with respect to the object, and who is responsible for bearing respectful or disrespectful attitudes [32].
Hudson [33] has proposed four kinds of respect. (1) We can respect people for their personalities or their work, e.g., respecting a colleague as an academic and/or having respect for someone with "guts." (2) We can respect people for their achievements, e.g., having respect for a professional swimmer or a soccer player having respect for the goalkeeper of the opposing team. (3) We can respect the terms of an agreement and the rights of a person. Finally, (4) we can show respect for people symbolically; e.g., when a judge enters a room, people stand up. To the original classification by Hudson, Dillon [32] has added a fifth form, respect for care, which involves considering that the object has a deep and perhaps unique value, and therefore appreciating it and perceiving it as fragile or requiring special care; as a result, we choose either to act or to refrain from acting, owing to the benevolent concern that we feel for the object.
People can be recipients of different forms of respect. One can discuss the legal rights of a person and respect those rights; one can show respect for the institution that a person represents, e.g., for a president by calling her "Ms. President," or by respecting someone for being committed to a worthy project; and one can accord a person the same basic moral respect that most humans believe anyone deserves. Because a variety of possible dimensions exist, the idea of respect for people remains somewhat vague.
Nonetheless, respectful language, which is defined on the basis of the concept of showing respect, that is, how a person behaves in a respectful way regarding others, is a key component. Authors such as Chapman have linked respectful language with "professional language," [34] which is dependent on the skills and level of education of the person who is speaking or writing. According to the broad discussions presented in different studies of respectful language, the concept also depends on the intellectual characteristics of the person. Chapman argues that using respectful language encourages people to take responsibility for what they say or write because words are an expression of a person's personality.
This definition clearly identifies the respectful behavior with the person's psychological expression and their emotions and sentiments related to a topic, and thus with the degree of empathy that can be shown to a topic or person. Thus, Chapman states the following regarding empathy: "Empathy requires intentional thinking, the recognition that other people's feelings and circumstances are separate from our own, and a willingness to act appropriately in response to these. Respectful language, therefore, begins with an intention to respond to what others want. Showing respect does not involve benevolence, guesswork, or simply giving what we are comfortable within professional conversations" [34]. Other synonyms for the word "respectful" are: After studying various definitions and authors' opinions regarding the arguments presented to define respect, we have found that the expression of respect depends on characteristics inherent to the person issuing the message. That is, the context in which the expression, whether oral or written, is given determines the degree of respect within the message. Characteristics such as language, culture, political vision, religion, and even the use of sarcasm influence the subjective perception of respect. Here, we present a variety of contexts in which respect and respectful expression toward one or more people have been defined.
Holtgraves discusses respectful language by considering social psychology [35]. Others, such as Thompson [36] and Wolf [37], have studied the concept of respect from the point of view of politics and politicians, by examining the negative implications that the use of disrespectful language can have for a community, commensurate with political trends. In addition, implications can exist at the national and international levels.
Regarding the use of respectful language in the treatment of customers, such as in medicine, Beach [38] has discussed how a professional must behave toward workers, co-workers, clients, and patients (in the case of a medical doctor). The definition of respect in the context of medicine is "recognition of the unconditional value of patients as persons" [38].
In our study, we sought to address Twitter data by considering language specifics. Unfortunately, this type of data lacks most of the context or knowledge regarding the person expressing the statement, thus complicating the task of deciding whether a given tweet is respectful. The lack of knowledge regarding authors and the context of a given tweet is challenging, especially given other authors' findings regarding definitions of respect and generally how respect can be perceived.

Relationship between Sentiment and Expression of Respect
As our research project proceeded, and we prepared data for our experiments, we often asked what the relationship might be between the sentiment of a tweet and its respectfulness, i.e., are we, in fact, analyzing the same thing but merely calling it a different name? To discuss the relationship between the two notions in the context of Twitter, we selected the same set of tweets that had already been hand-annotated for Twitter-sentiment analysis in [39].

Our Contribution
To the best of our knowledge, this study contributes:

1.
A new data set of tweets that, to the best of our knowledge, is the first open data set annotated with a focus on the expression of respect, 2.
A comparison of 14 selected approaches from the fields of deep learning, natural language processing, and machine learning used for the extraction of features and classification of tweet respectfulness in the new data set, 3.
Analysis of the correlation between tweet sentiment and respectfulness to answer the two questions of whether positive tweets are always respectful and whether negative tweets are always disrespectful, and 4.
Finally, to enable full reproducibility of our experiments, we openly publish our data and code.

Analyzed Data and Annotation Scheme
In our study, we focused on the detection of respectful tweets. Because of the aforementioned lack of knowledge regarding the author or context of a given tweet, deciding whether a tweet is respectful is not an easy task. Thus, we decided to accept the subjective judgment of what is respectful, as perceived by annotators who were employed to label the analyzed data. Importantly, in this context, our annotators were prepared for the labeling task by participating in a literature review regarding the task of "defining respect." During this preparation process, we agreed that the definition of respectfulness could be easily understood from two different points of view. First, the definition can be understood as a straightforward use of words that the reader or listener believes are respectful (such as "Mr. President, can you please give some details about the executive order?") or as expressing respect, which might include the use of "bad" words but is nonetheless an expression of respect (such as, "you are a badass at making money"). For this project, we adopted the latter type of example, and we interchangeably refer to it within our work as either "respectfulness" or "expressing respect." It may seem obvious that a text entity could be considered to exhibit a range of respectful sentiment; however, studies elaborating on annotation quality in hate speech analysis [6,40] have demonstrated that, in many cases, humans do not agree on whether a single tweet should be considered racist, and prior studies have rarely gone into detail regarding whether a tweet is "more or less racist." Similarly, we believe that the same tweet could be perceived as either "respectful" or "disrespectful" by different people and that increasing granularity of the notion of respectfulness causes additional complications. Our data preparation process was consistent with the above statement. Because we observed complications related to annotator agreement before our final annotation process, we held a series of meetings within the annotator group to improve our common understanding and definition of the "respectfulness" of text. Ultimately, we decided to simplify the annotation task by not providing regression-like "respect scores" as in [23] but instead creating classification labels with only three "respect" classes, to limit confusion (Table 1). Table 1. Annotation scheme adopted in the study.

Label Name Label Tweet Description
Disrespectful 0 Is aggressive and/or strongly impolite, seems "evidently" disrespectful.

Respectful 1
Tweets that are certainly not disrespectful are written in "standard" language without any evidently negative or positive attitudes. If it is unclear whether the tweet should be considered very respectful or respectful, the tweet is labeled respectful.
Very respectful 2 Undoubtedly exhibits respect.
The adopted set of 5000 tweets had already been released in [39]. When working with the data set, we detected and erased 36 tweets that were not written 100% in English and then labeled the remaining 4964 tweets according to the defined annotation scheme and the following procedure: (1) three annotators independently annotated each data instance; (2) we computed Krippendorf's alpha annotator agreement of the resulting labels; (3) to obtain a single respectfulness label for each data instance, we retained the label if all three annotators were in agreement (3102 tweets or adopted the label assigned by the majority of annotators (1833 tweets) if the difference between the labels was not greater than 1. In the other cases (29 tweets), a new single annotator was asked to decide and provide a decisive annotation. The data regarding users is anonymized, i.e., it is impossible to connect the given tweets to users automatically.
The annotation procedure resulted in 849 disrespectful, 3730 respectful, and 385 very respectful data instances consisting only of text (no tweet-related meta data were used). Example data instances were (original spelling): "Rest in peace, shipmates.A senseless tragedy, but know your service was not in vain..you made a difference.Condolences to your families." and "@USNavy Surgeon Thomas Harris was born on this day in 1784. @NavyHistoryNews #Medical #History _URL".
Generally, the greater the classification granularity of the assessed concept, the more difficult it is to maintain the annotations' quality. Annotators are always subjective, and it is not easy to maintain the common perception of a given phenomenon even with three classes. This is also one of the first studies in the domain of classifying respect and in a similar domain, i.e., sentiment analysis. In the beginning, many authors agreed that three classes of sentiment were sufficient.
We are aware of the problem of using limited amount of data in our study; however, it cannot be resolved in this case as, to the best of our knowledge, along with this study, we are publishing the first data set with respectfulness labels that allow predictions on message level.

Annotation Correlations
Owing to the choice of data set, we were able to compute correlations between annotators regarding the respectfulness of the tweet and the sentiment label already existing in the data set. Because the original sentiment labels were divided into five classes (0, very negative; 1, negative; 2, neutral; 3, positive; and 4, very positive), for the purpose of computing a correlation with our three-class respectfulness labels, we modified the original sentiment labels in the following manner: very negative and negative classes were treated as class 0; neutral tweets were treated as class 1; and positive and very positive tweets were treated as class 2.

Feature Extraction Methods
Automatic classification of text is feasible with machine learning (ML), deep learning (DL), and natural language processing (NLP) methods. ML classifiers are algorithms that operate on features extracted from data instances to perform predictions. In this context, various classifiers can be used, e.g., gradient boosting [41], random forest [42], or support vector machines [43].
There are many use cases that mix DL, ML, and NLP methods in various domains, such as, for example, stock data mining [44], online opinion mining [45], or sentiment analysis [46]. In text classification, researchers in the NLP field have provided methods for preprocessing text instances, including the extraction of features on the basis of: (a) n-grams, (b) token occurrences such as bag of words (BOW), term frequency (TF), inverse document frequency (IDF), and (c) lexicon-based methods such as LIWC [47]. SÉANCE [48]. Developments in the area of NLP enabled word or token-level embeddings (i.e., vector representations of text) obtained by trainable models such as Glove [49] and Word2vec [50], which were later followed by a family of other models capable of creating embeddings on various text granularity levels, such as the character level, sub-word level, or token level. To obtain sentence-level or document-level feature vectors from embeddings that correspond to smaller-text entities, various strategies have been proposed, with simple pooling (i.e., averaging of embeddings that belong to the larger text entity) as probably one of the first naive approaches. However, for some time, DL convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have demonstrated superiority in the conversion of subentity embeddings over simple pooling. Yet, with the introduction of transformer model architecture [50] and the famous model "pre-training of deep bidirectional transformers for language understanding" (BERT) [51,52], researchers have achieved new quality levels when creating text representations. Specifically, for text classification, transformer model architectures most often allow for embeddings to be obtained for the whole analyzed text entity, without any intermediate steps, via the so-called "classification token CLS." In brief, the transformer model attention and self-attention mechanisms are prepared to create high-quality entity-level embeddings while the whole model is being pre-trained. Sub-entity-level embeddings can also be obtained from transformer models and can then be further converted into entity-level representations by the previously mentioned pooling or CNNs or RNNs; however, such an approach has inferior performance [50].
Here, we use a selection from the described methods for obtaining tweet-level feature representations to compare their performance on the task of the prediction of respectfulness in tweets. We focus on recent models from the transformer family that could be fine-tuned for our specific classification task to provide the highest possible quality of tweet-level vector representations. In addition, we demonstrate the performance of an LSTM responsible for creating tweet-level vector representations from token-level vector representations obtained from selected language models. We also present the performance that can be achieved by selected pre-trained DL language models used to create vector representations for each token without any data-specific training and then averaging these embeddings to provide a tweet-level vector representation.
The rationale for not using the GRU networks was that they are comparable to or slightly inferior to LSTMs, especially if bidirectional LSTM models are concerned, as in our study. An example of a renowned comparison study in this regard can be found in [53].
Finally, we demonstrate the models by utilizing features extracted in accordance with the known LIWC [47] and SÉANCE [48] lexicons. The complete list of the tested feature extraction methods is presented in Table 2. From all demonstrated feature extraction methods, some do not require any data-specific training, and others do. Therefore, we believe the easier to adapt are the ones that do not require data-specific training. Unfortunately, at the same time, these methods provide poorer results. If not specified otherwise, the pretrained models were downloaded with the Transformers [54] and Flair [55] Python modules. Table 2. List of the feature extraction methods used in our study.

Method for Obtaining Tweet-Level Embeddings Source
Term Frequency Top 300 features selected according to a mutual information method implemented in the Python sci-kit learn module Data-specific training required Native output of features for the whole text data instance [56] SEANCE Lexicon-based method, "Sentiment analysis and social cognition engine" No data-specific training Native output of features for the whole text data instance [48] LIWC Lexicon-based method, "Linguistic inquiry and word count" No data-specific training Native output of features for the whole text data instance [47] Albert Pooled Tiny version of BERT, model version "base-v2pooled" No data-specific training Mean of token embeddings [57] Distilbert Pooled "Distilled" [58] version of BERT pre-trained to output sentence-level embeddings, model version "base-nli-stsbmean-tokenspooled" No data-specific training Mean of token embeddings [59] Roberta Pooled Robustly pre-trained BERT ready to output sentence-level embeddings, model version "roberta-large-nli-stsb-meantokenspooled" No data-specific training Mean of token embeddings [59] Fasttext LSTM

Configuration of Models Which Used LSTMs
Proper configuration of ML and DL models requires experiments and studies. In order to configure model parameters in this study, we based them on past research [63] and our experience in the field.
When LSTMs were used to obtain entity-level feature vectors from embeddings that corresponded to sub-entities, we were able to adopt the following LSTM hyperparameters: the LSTM analyzed the sequence of embeddings from the beginning of the tweet until its end and then in the opposite direction (bidirectional) = true, number of LSTM layers = 2, and size of embedding corresponding to the tweet created by the LSTM (hidden size) = 512. To train the LSTMs, the following parameters were used: initial learning rate = 0.1, minimal learning rate = 0.002, factor at which the learning rate was decreased after training with the given learning rate was ended (anneal factor) = 0.5, and number of epochs without improving the loss on the validation set for which training was continued with the given learning rate (patience) = 20. In addition, the data instances were shuffled before each training epoch. Other parameters were set to the default values proposed in the Flair module.

Configuration of Fine-Tuned Models
To fine-tune the transformer models, we utilized the following parameters: Adam optimizer (as implemented in the PyTorch Python module), learning rate = 3 × 10 −6 , number of data instances provided in the model input during a single training pass (i.e., mini-batch size) = 8, and number of times that the training procedure ran over the whole data set (i.e., epochs) = 4. In addition, the data instances were shuffled before each training epoch. Other parameters were set to the default values proposed in the Flair module.

Cross-Validation
During machine learning classification, all experiments in our study were five-fold cross-validated, and the whole data set was divided into proportions of 80% and 20% for the training and test sets, respectively. In addition, all feature extraction methods that required training on our data set were five-fold cross-validated. In this case, the test sets were ensured to remain the same both during training of the feature extraction method and later during machine learning classification.

Machine Learning Classification
For the final machine learning classification of the respectfulness of tweets, we utilized the gradient boosting classifier implemented in the xgboost Python module version 1.2.0. For training the classifiers, we adopted the following parameters: number of gradient boosted trees (n_estimators) = 250, training objective = multi:softprob (multi-class classification with results for each data point belonging to each class), and learning rate = 0.03. For a detailed list of all parameters, please refer to the code repository [64].

Classification Metrics
Assessment of the quality of the trained classification models was performed with F1 macro (F1) and Matthews correlation coefficient (MCC) scores. The F1 score was chosen because of its popularity in the ML community. MCC values were also selected because this metric is known to provide more reliable results for unbalanced data sets [64], as was clearly relevant to the dataset used in this work. Therefore, in our study, we treated MCC scores as the decisive quality metric. Classification metrics were computed once for a list of predictions created from the sub-lists of predictions obtained for the test sets from each cross-validated fold.

Software, Code, and Computing Machine
The computations required to perform this study were conducted on a single computing machine equipped with a single NVIDIA GPU model Titan RTX 24 GB RAM. All experiments were implemented in Python3, and the corresponding code and data set are available in [64]. For ease of reproduction, all experiments can be repeated by executing a single bash script. Most elements of the experimental pipeline responsible for the feature extraction methods were implemented with Flair module version 0.6 post1. A precise description of the software requirements is available in the project repository [64].

Relationship between Respectfulness and Sentiment
As demonstrated in Figure 1, annotators 2 and 3 (AN2 and AN3) exhibited the highest correlation of the proposed respectfulness labels. The respectfulness-sentiment correlations per annotator ranged from 0.49 to 0.62. After creation of the final unified respectfulness label, the overall correlation between sentiment and respectfulness was 0.594.

Comparison of Classification Performance
Training of machine learning classifiers with the features provided by the methods described in Table 2 resulted in predictions that allowed us to compute appropriate quality metrics, which are presented in Table 3. In Table 4 we also demonstrate a worst-best model comparison by means of presenting a confusion matrix for Term Frequency and RoBERTa L models. Figure 1. Matrix of correlations between annotators regarding respectfulness of data instances ("AN1," "AN2," and "AN3") and sentiment ("SENT") labels provided in the original data set.

Comparison of Classification Performance
Training of machine learning classifiers with the features provided by the methods described in Table 2 resulted in predictions that allowed us to compute appropriate quality metrics, which are presented in Table 3. In Table 4 we also demonstrate a worst-best model comparison by means of presenting a confusion matrix for Term Frequency and RoBERTa L models.  1. Matrix of correlations between annotators regarding respectfulness of data instances ("AN1," "AN2," and "AN3") and sentiment ("SENT") labels provided in the original data set.  Table 4. Confusion matrix for the worst-and best-performing classification models based on features from the term frequency and Roberta large fine-tuned (RoBERTa) feature extraction methods. Numbers corresponding to class labels are: 0, disrespectful; 1, respectful; and 2, very respectful. In the best-performing model; no data instances were misclassified between the "disrespectful" and "very respectful" classes.

Relationship between Respectfulness and Sentiment
The computed correlation based on the respectfulness and sentiment labels was 0.594, thus necessitating subsequent interpretation and consideration. The respectfulnesssentiment correlations computed per annotator ranged from 0.49 to 0.62; therefore, for various people, the differentiation between respectfulness and sentiment varies. When considering the final unified across-annotators respectfulness labels, we found that 2934 of 4964 data instances were assigned the same respectfulness and sentiment classes. However, inspection of the per-class details of the created data set revealed that disrespectful tweets were considered negative 73.73% of the time, respectful tweets were considered neutral in 51.9% of instances, and very respectful tweets were positive in 96.62% of cases. Therefore, we conclude that, for our annotators, very respectful tweets were almost always found to be positive, and disrespectful tweets were likely to be found negative.
When responding specifically to the previously defined research question of whether positive tweets are always respectful, we observed that, according to our annotators, in most cases, such tweets were considered either respectful or very respectful. In our data set, there were 1875 positive tweets, only 28 of which (1.49%) were simultaneously considered to be disrespectful. An example of this small group can be represented by the tweet: "Holy Christ, are you going to New Orleans to help with Katrina survivors next?" This tweet, although labeled as very positive, also received a disrespectful annotation. In the given case, one could question why this tweet was labeled disrespectful. When labeling for respect, the annotators agreed that this was a sarcastic tweet that in fact suggests that "you shouldn't be going there," whereas presumably earlier in the sentiment annotation process, the annotators perceived this tweet to be straightforward and non-sarcastic. This example demonstrates how tricky tweet-level classification can be if sarcastic language is introduced without the corresponding broader context.
To answer the second research question of whether a negative tweet is always disrespectful, we begin with the observation that there were 947 negative tweets in the data set. From this group, the majority (626 or 66.1%) were considered disrespectful; however, the remainder were considered respectful. Therefore, negative tweets are not always disrespectful. Some example tweets that can provide background for this conclusion are "The link isn't working" (obviously negative, but not disrespectful) or "@USNavy I do not accept any culpability, blame or responsibility." The latter was labeled as negative and yet is obviously not disrespectful. Table 3 displays the differences in the classification quality of the gradient boosting classifiers, depending on the provided independent variables. The order of quality of the obtained MCC and F1 results mimics the historical development of the feature extraction methods that were described briefly in Section 2.3, i.e., simple Term Frequency and Lexiconbased methods provide the lowest quality, whereas fine-tuning of the recent transformer model RoBERTa large provides the highest quality. In general, from Table 3, we conclude that fine-tuning the recent transformer models, including the tiny Albert model, allowed us to obtain significantly higher MCC scores than those with the other feature extraction approaches. In our study, the exception to this rule was apparent when methods using LSTMs were considered. The quality of the feature extraction models that utilize bidirectional LSTMs to produce tweet-level embeddings strongly depends on the token-level embedding quality. If simpler fasttext embeddings are used, then the trained fasttex+LSTM embedding method can be surpassed by a pre-trained but more recent token embedding (roberta-large-nli-stsb-mean-tokenspooled). In addition, if a bidirectional LSTM is provided with the highest quality embeddings from the RoBERTa large model, then the resulting prediction quality can even surpass that of some of the fine-tuned transformer models.

Worst-Best Model Comparisons
The best-performing model based on the fine-tuned RoBERTa large feature extractor achieved an MCC score of 0.6337, whereas the worst term frequency method provided features that allowed for an MCC score of only 0.4049.
A detailed view of how the worst-and best-performing models performed the given prediction task is displayed in the confusion matrices in Table 4. The main difference in model quality lies in model error regarding minority class 2 (very respectful); that is, the best-performing model did not confuse a very respectful tweet with a disrespectful tweet even once. However, interestingly, the best model more frequently (167 errors) incorrectly predicted disrespectful tweets as being respectful than did the worst model (87 errors).
In addition, a critic might note that the worst model, based on features extracted by Term Frequency method, was nonetheless able to correctly predict 79.9% of data instances, whereas the best model achieved 86.0%. From this perspective, the difference between the compared models seems minimal and thus might incorrectly suggest that model choice is not significant. This line of thinking can be readily dismissed because significant class imbalance was present in the data set; as a result, naively assigning the majority class for each data instance would allow for 75.1% accuracy. Thus, an appropriate quality metric such as MCC should be utilized; moreover, the actual model use-case matters most. In the given example, not mistaking disrespectful tweets for very respectful tweets is probably more important than not confusing disrespectful tweets with respectful tweets. Of course, in another use-case, it could be appropriate for the models to optimize a different type of errors. Fortunately, in this case, the MCC score was decided to serve as a decisive quality metric, and the result of the comparison is clear.

Comparison with Results from Other Studies
Comparison of our results with those of other studies on the assessment of respectfulness is difficult because no open data sets exist. The closest study to our work [23] has approached the task of predicting respect in a different manner. However, the order of the quality of our models can be compared with those in other studies using text analysis. In this context, our observations align with those from other studies. In [65] the researchers demonstrate that a pretrained BERT model provides superior results to those of TF-IDF models in several text classification tasks. A survey of text classification algorithms [66] has also ranked deep learning methods ahead of TF or BoW feature extraction techniques. Another study [67] has investigated two classification tasks and found that when LSTM is considered as the method for creating entity-level embeddings, the quality of the language model used for embedding tokens plays an important role, wherein older methods such as Glove, Flair, and Elmo [68] are outperformed by the more recent BERT and RoBERTa techniques.

Example Practical Benefits of Carrying out Respect Analysis
There is already evidence of penalizing some users of online social media for their adverse behavior based on "hate speech" analysis, with spectacular examples as blocking accounts in early 2021 by Twitter [69] and Facebook [70]. However, no gratification system exists for users writing their posts in a respectful manner. An example of implementing such a system would be increasing online discourse quality, which has indisputable moral virtues.

Limitations of Our Study
Conclusions regarding the comparison of the quality of the feature extraction methods presented in this study should be made with care because the comparison was performed by using only a single data set with a limited number of data instances. In addition, the adopted training procedures were not optimized for each model, thus potentially favoring some models over others. The data set was annotated by 3 + 1 researchers, such that their subjective perception of respectfulness defined what the later trained models were taught. As mentioned in Section 1.3, the perception of respectfulness depends strongly on people and context. Therefore, if another annotator group were used the same tweets could be labeled differently. Accepting the subjective judgment of what is respectful by four annotators is not the optimal choice. However, in the case of no user and context information, we only see averaging more subjective judgments as an alternative.

Conclusions
Promoting respect toward other people is a just cause. Interestingly, this seems to be a novel area of research regarding Twitter, for which researchers have focused mostly on analyzing ubiquitous "hate speech." This study took the first steps in the "Twitter respectfulness" field by (1) discussing how respect is defined by other authors, (2) creating what is probably the first open Twitter data set annotated with respectfulness labels, and (3) demonstrating correlations in newly created data with well-studied sentiment analysis. We found that in our data respectfulness was correlated to sentiment at a moderate level of almost 0.6 which can be interpreted that respect is connected with sentiment, but they are distinguishable notions.
To demonstrate how data with respectfulness labels can be used, our study presents an approach for the application of recently developed methods for the automated classification of the respectfulness of tweets. Our results demonstrate that training automated respectfulness classifiers in Twitter is feasible and that they can achieve promising performance. Even though the used data quantity was limited, the comparison of model quality was in accordance with other studies.
We hope that this demonstration will allow others to continue efforts toward promoting respect by, for example, implementing enterprise-level positive motivation measures, on the basis of the automated assessment of textual data.  Data Availability Statement: A precise description of the software requirements is available in the project repository: https://github.com/krzysztoffiok/respectfulness_in_twitter.