Predicting the Volume of Response to Tweets Posted by a Single Twitter Account

Social media users, including organizations, often struggle to acquire the maximum number of responses from other users, but predicting the responses that a post will receive before publication is highly desirable. Previous studies have analyzed why a given tweet may become more popular than others, and have used a variety of models trained to predict the response that a given tweet will receive. The present research addresses the prediction of response measures available on Twitter, including likes, replies and retweets. Data from a single publisher, the official US Navy Twitter account, were used to develop a feature-based model derived from structured tweet-related data. Most importantly, a deep learning feature extraction approach for analyzing unstructured tweet text was applied. A classification task with three classes, representing low, moderate and high responses to tweets, was defined and addressed using four machine learning classifiers. All proposed models were symmetrically trained in a fivefold cross-validation regime using various feature configurations, which allowed for the methodically sound comparison of prediction approaches. The best models achieved F1 scores of 0.655. Our study also used SHapley Additive exPlanations (SHAP) to demonstrate limitations in the research on explainable AI methods involving Deep Learning Language Modeling in NLP. We conclude that model performance can be significantly improved by leveraging additional information from the images and links included in tweets.


Introduction
Information published on social media is often meant to gain the attention of other users. On Twitter, one of the most widely used social media platforms at the time of writing this paper [1], whether published information successfully gains attention can be assessed by several measures, such as replies, likes or retweets. Petrovic et al. [2] have demonstrated that humans can predict, with a certain probability, whether a given tweet will receive a substantial response. Indeed, some researchers [3] still use human coding for tweet classification. However, much effort is committed to automating Twitter-related predictions. Table 1 provides a brief review of selected work on the automated prediction of responses to tweets. Similar to Cotelo et al. [4], many authors have explored the integration of the textual and structural information available in each tweet. Suh et al. [5] have conducted a large-scale investigation of tweet features responsible for tweet popularity, and have explored the relationships among these variables by using a generalized linear model. Some studies have focused on modeling "cascades of retweets," i.e., the number of retweets over time. Gao et al. [6] has used a general reinforced Poisson process model that is fed data on the number of retweets over time. Kupavskii et al. [7] has used a gradient boosting decision tree model, fed with various structured features, including social and content features, as well as time-sensitive features of the initial tweet publisher, along with the "infected nodes," i.e., users who "retweeted" the initial news. A study by Cheng et al. [8] has investigated many linear and non-linear classifiers and features regarding news content, including image analysis, "root" features of the publisher of the original tweet, features of users who re-shared a given tweet, and structural and time-dependent features. In Zhao et al. [9], no features were used; instead, only information regarding the number of retweets overtime was fed into a self-exciting point processes model. Oliveira et al. [11] General reinforced Poisson process model used for regression analysis Number of retweets over time Number of retweets over time Gao et al. [6] Gradient boosting decision tree model. Regression and classification tasks.

Number of retweets over time
Structural features including social features, content features (i.e., tweet length, number URLs, mentions, hashtags, negative and positive terms and smileys, question and exclamation marks, arousal, valence and dominance), Affective Norms of English Words (ANEW), time-sensitive features of the initial node, features of the infected nodes, and page rank.

Total number of retweets
Previously retweeted, TF-IDF content features (terms used in the tweet text), Latent Dirichlet Allocation topic distribution, the number of retweets of a given account, and many others briefly mentioned Hong et al. [12] Many linear and non-linear classifiers, e.g., logistic regression and RF. Classification task.  [13]. A broad selection of structural features as well.
Total number of retweets Broad set of user and tweet features Zhang et al. [15] Researchers have also pursued the more challenging goal of predicting the total replies that a tweet will receive before publication. Petrovic et al. [2] has investigated a passive-aggressive algorithm, including social features, such as those reflecting the publishing user, along with tweet features that Symmetry 2020, 12, 1054 3 of 15 "encompass various statistics of the tweet itself, along with the actual text of the tweet." A generalized linear model fed only structural features, such as "contains hashtags" or "contains URL," is used in Suh et al. [5]. In Jenders et al. [10], a generalized linear model and naive Bayes models are fed a structured tweet and user features, such as the sentiment of the tweet, tweet length, number of mentions, number of hashtags, number of followers, emotional divergence and number of URLs. A random forest (RF) classifier model was adopted in Oliveira et al. [11], which also benefited from the inclusion of structured user and tweet features, such as the number of hashtags, URLs, mentions, tweet length, number of words, whether the tweet is a reply, the hour of the tweet's timestamp, the number of images and videos, and the sentiment of the tweet. Hong et al. [12] have used a logistic regression model fed user features, such as the number of retweets of a given account and content features extracted through slightly more sophisticated methods, including Term Frequency-Inverse Document Frequency (TF-IDF) analysis of the terms used in the tweet text, and Latent Dirichlet Allocation topic distribution analysis. The paper also briefly mentions many other features. Zhang et al. [15] used a support vector machine model and fed it various structured user and tweet features, such as the number of followers, friends, past tweets, favorites, number of times the user was listed, age of account, user activity, user screen name length, the verification status of the user, average number of followers gained from a tweet, average number of times a user was listed through a tweet, number of URLs, hashtags, mentions, words, characters, whether the tweet was a reply, whether the tweet had been retweeted previously, and the time at which the tweet was published.
For some specific applications, such as detecting spamming accounts [16], even more structured user features, such as the URL rate and the interaction rate, are believed to be highly informative. Interestingly, a recent study [14] has reversed the prediction logic and based the analysis on replies, but this approach struggled to predict the popularity of the original source tweet. Importantly, this study used complex Deep Learning Language Modeling to automatically extract content feature vectors from tweets, rather than using hand-selected features.
Also, a different trend in the research community focusing its effort on Twitter data is worth mentioning, specifically, that which addresses the detection of events in Twitter using wavelet-based analysis. For example, one of the works representing this approach introduced EDCoW (Event Detection with Clustering of Wavelet-based Signals) [17], and demonstrated that detecting events through news spreading in Twitter is feasible with the proposed method.
Given the abundance of structural tweet features used by various authors, it is understandable that many works, like Keib et al. [3], Cotelo et al. [4] and Jenders et al. [10], struggled to identify which of these features influence the predictive capabilities of trained models, and to what extent. In this context, owing to the revived interest in explainable artificial intelligence (XAI) after "explainability winter" [18], it is possible that exploiting new interpretability techniques could be beneficial.
Our research aimed to compare selected machine learning classifiers fed with structured tweet features, and features extracted with the recently developed Deep Learning Language Models (LMs), for predicting the total number of replies to tweets published by @USNavy, the official US Navy account. For each tweet, we accounted for only the information available before publishing. We also wished to demonstrate how a recently introduced XAI tool can be leveraged to improve the understanding of the importance of structural features, and not features provided by Deep Learning LMs. Finally, in order to provide information valuable from an ML practitioners' perspective, we also give insight into the computation times of deployed methods.
We believe that our choice of data source, namely a single Twitter account, is beneficial for Natural Language Processing (NLP) practitioners who, while working for an entity owning a Twitter account, are obliged to predict responses to a future tweet by this entity. In our study, the selection of the particular @USNavy account was dictated by the funding source of our research specified in the funding section. We also hope that the small size of the here-analyzed training data sample can be perceived as informative if a question is posed: is a small number of available historical tweets from my organization an issue in the application of the here-described methods? Because unstructured tweet text is written in a highly specific manner, numerous studies [19][20][21][22] have used tools from the NLP field and proposed tweet-filtering techniques before addressing the machine learning task. Our work benefitted from such tweet pre-processing concepts; however, given the high quality of the language used by the official US Navy account, we defined our own simplified approach.
Our feature extraction efforts began with exploiting structured tweet information, such as whether the tweet included an image or contained any hashtags. Petrovic et al. [2], demonstrated that social features, such as the number of followers and friends, and whether the user's language is English, are very informative regarding reply prediction. In addition, Mbarek et al. [23] and others, as previously mentioned, have suggested various user profile-related features that can improve the quality of classification. Our research could not benefit from these approaches because we sought to analyze tweets published by a single user. Instead, we included the date of publication as an indirect feature correlated, for example, with changes in the number of followers over time. However, we did not seek to define precise hour-by-hour models, as proposed in Petrovic et al. [2]. Rather than concentrating on features engineered by hand, we decided to focus on gathering information from unstructured text data by using a Deep Learning architecture based on recently developed LMs.
Our work contributes to the field primarily by comparing the performances of three machine learning models in the same classification tasks, on the basis of features extracted primarily with a recently developed Deep Learning Language Modeling approach and four different LMs. The comparison was performed independently for three different target variables: the total numbers of replies, likes, and retweets. We also used SHapley Additive exPlanations (SHAP) [24] a state of the art eXAI technique, to demonstrate that the high performance of Deep Learning Language Modeling comes at the price of model explainability. To provide full experimental reproducibility, we have released our code and data set in an open repository [25].

Analyzed Data
To gather and analyze Twitter data, it was necessary to gain acceptance for the proposed use case from Twitter by obtaining a Twitter Developer Account. In this work, we analyzed Twitter data published by the official @USNavy account from January 2011 to December 2019. Our search within this period was conducted on 14 January 2020, and resulted in a total of 23,951 tweets. The annual numbers of replies of likes and retweets to the gathered tweets increased over time, as shown in Figure 1. For all three target variables, the years 2017-2019 showed a substantial increase, as compared with the previous years. To analyze more up-to-date and uniform data, we narrowed our analysis to these three most recent years. In this period, the official @USNavy account published 4853 tweets, which we further limited to 4498 tweets according to the procedure described in the unstructured text pre-processing section of the manuscript. Descriptive statistics of the target variables for the analyzed data are presented in Table 2. Table 2. Descriptive statistics of target variables for the analyzed 4498 tweets. For all three target variables, the years 2017-2019 showed a substantial increase, as compared with the previous years. To analyze more up-to-date and uniform data, we narrowed our analysis to these three most recent years. In this period, the official @USNavy account published 4853 tweets, which we further limited to 4498 tweets according to the procedure described in the unstructured text pre-processing section of the manuscript. Descriptive statistics of the target variables for the analyzed data are presented in Table 2.

Classification of Target Variables
In our study, rather than predicting the precise number of user responses to a given tweet (i.e., solving a regression task), we decided to address a classification task, in which the classes generally reflected the number of responses. Class definitions were derived from descriptive statistics of the analyzed response data, and are presented in Table 3.

Classification Framework
To solve the defined classification task, we propose a framework with the workflow presented in Figure 2. This framework divides each tweet's data instance into structured non-textual and unstructured textual data, and performs separate feature extractions for both data types. Furthermore, the extracted features from the tweet instance are fed into a Machine Learning Classifier, which predicts the reply class. It is essential to mention that in our work, we use the notion "unstructured text" and "unstructured textual data" solely to underline the difference of free text from structured data. Specifically, this does not refer to the quality of language used in Twitter posts that we analyze.

2.4.Feature extraction)
When structured data were considered, each tweet instance was flagged in a binary manner according to several categories: includes image, includes links to an external web resource, includes any hashtags, was posted as a reply to another tweet, and includes a retweet of another tweet. In addition, the tweet publication date was included as a separate feature as the number of months after January 2017. This approach resulted in the definition of six features derived from structured non- It is essential to mention that in our work, we use the notion "unstructured text" and "unstructured textual data" solely to underline the difference of free text from structured data. Specifically, this does not refer to the quality of language used in Twitter posts that we analyze.

Feature Extraction
When structured data were considered, each tweet instance was flagged in a binary manner according to several categories: includes image, includes links to an external web resource, includes any hashtags, was posted as a reply to another tweet, and includes a retweet of another tweet. In addition, the tweet publication date was included as a separate feature as the number of months after January 2017. This approach resulted in the definition of six features derived from structured non-textual data for each tweet data instance. Table 4 presents the percentage of "true" values for each binary feature for all tweet instances analyzed. Feature extraction from the unstructured text was conducted through a complex approach involving several steps, as presented in Figure 3. It is essential to mention that in our work, we use the notion "unstructured text" and "unstructured textual data" solely to underline the difference of free text from structured data. Specifically, this does not refer to the quality of language used in Twitter posts that we analyze.

2.4.Feature extraction)
When structured data were considered, each tweet instance was flagged in a binary manner according to several categories: includes image, includes links to an external web resource, includes any hashtags, was posted as a reply to another tweet, and includes a retweet of another tweet. In addition, the tweet publication date was included as a separate feature as the number of months after January 2017. This approach resulted in the definition of six features derived from structured nontextual data for each tweet data instance. Table 4 presents the percentage of "true" values for each binary feature for all tweet instances analyzed. Feature extraction from the unstructured text was conducted through a complex approach involving several steps, as presented in Figure 3.

Pre-Processing and Filtering of Unstructured Text Data
Unstructured Twitter text substantially differs from standard text, and previous research has accordingly proposed a special approach to pre-processing [19][20][21][22]. In our research, we borrowed from these proposals and modified them by adding our new steps, which resulted in the pre-processing and filtering procedure presented in Figure 4. After pre-processing, all tweets with duplicated text were deleted; 4498 tweets remained for the final analysis. Unstructured Twitter text substantially differs from standard text, and previous research has accordingly proposed a special approach to pre-processing [19][20][21][22]. In our research, we borrowed from these proposals and modified them by adding our new steps, which resulted in the preprocessing and filtering procedure presented in Figure 4. After pre-processing, all tweets with duplicated text were deleted; 4498 tweets remained for the final analysis.

Deep Learning Feature Extractor
To extract features from the pre-processed unstructured text, we used the Flair NLP framework (version 0.4.5) presented by Akbik et al. [26]. This allowed us to create and train a Deep Learning Feature Extractor (DLFE) via the procedure presented in Figure 5. First, we used an LM to convert tokenized textual data into corresponding single-token vectors. The procedure was conducted with three LMs for subsequent quality comparison: (a) FastText [27] LM [Gensim version [28] trained on Twitter data with model word dictionary covering 61.4% of data set tokens; (b) a distilled version of Bidirectional Encoder Representations from Transformers

Deep Learning Feature Extractor
To extract features from the pre-processed unstructured text, we used the Flair NLP framework (version 0.4.5) presented by Akbik et al. [26]. This allowed us to create and train a Deep Learning Feature Extractor (DLFE) via the procedure presented in Figure 5.
First, we used an LM to convert tokenized textual data into corresponding single-token vectors. The procedure was conducted with three LMs for subsequent quality comparison: (a) FastText [27] LM [Gensim version [28] trained on Twitter data with model word dictionary covering 61.4% of data set tokens; (b) a distilled version of Bidirectional Encoder Representations from Transformers (DistilBERT) LM [29]; and (c) Glove LM [30] with model word dictionary covering 64.9% of data set tokens. Second, we trained a two-layer bidirectional Long Short Term Memory Neural Network (LSTM) with hidden_size = 512 to create tweet-level embeddings from single-token vectors provided by the LM. For each LM, the training procedure of DLFE used parameters previously demonstrated to provide a high performance with a reasonable training time [31], namely: initial learning rate = 0.1, minimal learning rate = 0.002, annealing rate = 0.5, mini-batch size = 8, hidden size = 256, and shuffle Symmetry 2020, 12, 1054 7 of 15 data during training = true. Other parameters were set to the default values proposed by the Flair framework. As a result, we obtained three ready-to-use DLFEs optimized for the analyzed data. Figure 4. Procedure for the pre-processing of tweets.

Deep Learning Feature Extractor
To extract features from the pre-processed unstructured text, we used the Flair NLP framework (version 0.4.5) presented by Akbik et al. [26]. This allowed us to create and train a Deep Learning Feature Extractor (DLFE) via the procedure presented in Figure 5. First, we used an LM to convert tokenized textual data into corresponding single-token vectors. The procedure was conducted with three LMs for subsequent quality comparison: (a) FastText [27] LM [Gensim version [28] trained on Twitter data with model word dictionary covering 61.4% of data set tokens; (b) a distilled version of Bidirectional Encoder Representations from Transformers (DistilBERT) LM [29]; and (c) Glove LM [30] with model word dictionary covering 64.9% of data set tokens. Second, we trained a two-layer bidirectional Long Short Term Memory Neural Network (LSTM) with hidden_size = 512 to create tweet-level embeddings from single-token vectors provided by the LM. For each LM, the training procedure of DLFE used parameters previously demonstrated to provide a high performance with a reasonable training time [31], namely: initial learning rate = 0.1, minimal learning rate = 0.002, annealing rate = 0.5, mini-batch size = 8, hidden size = 256, and shuffle data during training = true. Other parameters were set to the default values proposed by the Flair framework. As a result, we obtained three ready-to-use DLFEs optimized for the analyzed data.
To add a state of the art transformer LM to our comparison, we also introduced a Robustly Optimized BERT Pretraining Approach (RoBERTa) large model [32]. In this case, the LM was not used to output single token embeddings, and therefore no LSTM was used. Instead, RoBERTa was fine-tuned on our data, and the built-in transformer model special classification token "CLS" was used to obtain tweet-level embeddings directly from the transformer model. The fine-tuning procedure was performed with the following parameters, inspired by Devlin et al. [13]: initial learning rate = 0.00003, mini-batch size = 8, maximum number of epochs = 4, minimal learning rate = 0.000003, and patience = 3. Other parameters were set to the default values proposed by the Flair framework. To add a state of the art transformer LM to our comparison, we also introduced a Robustly Optimized BERT Pretraining Approach (RoBERTa) large model [32]. In this case, the LM was not used to output single token embeddings, and therefore no LSTM was used. Instead, RoBERTa was fine-tuned on our data, and the built-in transformer model special classification token "CLS" was used to obtain tweet-level embeddings directly from the transformer model. The fine-tuning procedure was performed with the following parameters, inspired by Devlin et al. [13]: initial learning rate = 0.00003, mini-batch size = 8, maximum number of epochs = 4, minimal learning rate = 0.000003, and patience = 3. Other parameters were set to the default values proposed by the Flair framework.

Division of the Data during the Training Process
Our research used fivefold cross validation. For training of the DLFE, 70% of all data instances were used for training, 10% were used for validation, and 20% were used for testing. When machine learning models were trained, 80% of all data instances were used for training, and 20% were used for testing. The same data instances were used for training and testing in both the training of the DLFE and, subsequently, the machine learning prediction of user response.

Feature Sets were Fed to the Machine Learning Classifiers
In Table 5, we present three feature groups and several defined feature sets used to compare the quality of solving prediction tasks. The groups and sets were defined in the same manner for each target variable. For each cross-validated trial, we conducted statistical analyses in Python with statsmodels (version = 0.10.1) and pingouin (version = 0.3.4) software packages. The adopted procedure was as following: we tested for the normality of the distribution according to the proposal by Shapiro-Wilk [33] (Shapiro and Wilk, 1965); one-way ANOVA was carried out; this was then followed with a Tukey Honest Significant Difference multiple comparison test in order to verify significant differences between trials. Significance threshold was set to p = 0.05.   Other parameters of each classifier were set to default as proposed by the Python sklearn software package (version = 0.22.1). The F1 micro score was used as the outcome measure. For all cases, only the mean F1 micro score is reported, for clarity. All experiments were coded in Python 3 and performed on the same computing machine equipped with a single NVIDIA Titan RTX 24 GB RAM GPU.

Summary of the Algorithm
For improved clarity of adopted procedures, we provide appropriate pseudo code in Figure 6.

Summary of the Algorithm
For improved clarity of adopted procedures, we provide appropriate pseudo code in Figure 6.

Explaining Model Decisions
To provide an improved understanding of the rationale of the machine learning models' predictions, we used SHAP (version 0.35.0), a state of the art XAI technique. We used SHAP Tree Explainer [34] to generate visualizations of model-level explanations of several selected RF and GB model variants.

Methods Computation Time
For ML practitioners, not only quality but also the computation time of deployed methods plays an important role. To provide such information for all employed LMs and a selected dependent variable, we have computed the times involved in training feature extractor models, creating tweetlevel vector representations for a selected data fold, and the whole procedure of training and testing machine learning classifiers.

Results and Discussion
In our opinion, there are several notable observations regarding our experimental results. Figure  7 depicts partial results of the prediction of the number of "replies," which can be treated as an

Explaining Model Decisions
To provide an improved understanding of the rationale of the machine learning models' predictions, we used SHAP (version 0.35.0), a state of the art XAI technique. We used SHAP Tree Explainer [34] to generate visualizations of model-level explanations of several selected RF and GB model variants.

Methods Computation Time
For ML practitioners, not only quality but also the computation time of deployed methods plays an important role. To provide such information for all employed LMs and a selected dependent variable, we have computed the times involved in training feature extractor models, creating tweet-level vector representations for a selected data fold, and the whole procedure of training and testing machine learning classifiers.

Results and Discussion
In our opinion, there are several notable observations regarding our experimental results. Figure 7 depicts partial results of the prediction of the number of "replies," which can be treated as an example for comparing the prediction quality of the trained models. Here, using features from group I, i.e., derived only from structured tweet data, resulted in an inferior prediction quality to that derived using group II features, extracted from unstructured tweet text by the DLFE, independently of the selection of the machine learning classifier. Therefore, our results support the intuitive hypothesis that the written content in tweet text matters more than hand-crafted features, such as having an image or the date of publication. A comparison of the results for groups II and III also supports another intuitive assumption that the six features derived from structured tweet data provide meaningful information and improve prediction quality, mostly based on features extracted from unstructured tweet text. Examination of the full results presented in Table 6 strengthens this conclusion because, in most cases, using group III features provided a quality the same as, or slightly higher than, that derived using group II features. However, an exception to this rule should be mentioned, for instance, in the prediction of replies, for which the results without structured data were marginally higher, specifically, a 0.558 F1 score for DB features and a RF classifier versus 0.557 for SDB features and the same classifier.
Assessing the full results presented in Table 6 allowed us to draw additional conclusions: 1.
The MLP and R classifiers were usually, but not always, outperformed by the GB and RF classifiers. No clear pattern indicated which classifier performed best; 2.
Predicting the number of replies was more difficult than predicting the other two target variables for all tested feature sets; 3.
For likes and retweets, for all compared LMs, RoBERTa provided the highest prediction performance for group II features as well as group II features in combination with structured features (group III features). However, this result was not the case for the prediction of replies. We hypothesize that this finding was caused by the unoptimized training regime for this target variable, and we discuss this aspect further in "Limitations of the study"; 4.
DistilBERT LM most often had the second-best performance after RoBERTa LM; however, in this case, the improvement in the prediction quality over that of Glove and FastText LMs was marginal; 5.
The best quality of results for replies, likes and retweets was associated with F1 scores of 0.558, 0.655 and 0.65, respectively.
Symmetry 2020, 12, x FOR PEER REVIEW 10 of 16 were marginally higher, specifically, a 0.558 F1 score for DB features and a RF classifier versus 0.557 for SDB features and the same classifier.
Assessing the full results presented in Table 6 allowed us to draw additional conclusions: 1. The MLP and R classifiers were usually, but not always, outperformed by the GB and RF classifiers. No clear pattern indicated which classifier performed best; 2. Predicting the number of replies was more difficult than predicting the other two target variables for all tested feature sets; 3. For likes and retweets, for all compared LMs, RoBERTa provided the highest prediction performance for group II features as well as group II features in combination with structured features (group III features). However, this result was not the case for the prediction of replies. We hypothesize that this finding was caused by the unoptimized training regime for this target variable, and we discuss this aspect further in "Limitations of the study"; 4. DistilBERT LM most often had the second-best performance after RoBERTa LM; however, in this case, the improvement in the prediction quality over that of Glove and FastText LMs was marginal; 5. The best quality of results for replies, likes and retweets was associated with F1 scores of 0.558, 0.655 and 0.65, respectively.    As already mentioned in the Methods section, we have carried out statistical analyses for all presented experiments. Full results of these analyses are available, along with data and code [31], and their possible interpretation is that most results had a normal distribution. One-way ANOVA indicated significant differences between trials, and Tukey HSD tests indicated significant differences in around 50% of the compared pairs.
Information regarding the computation times of deployed methods, presented in Table 7, shows that improved prediction quality comes at the cost of speed, both when mode training and inference is concerned. If a system operating in real-time is developed, then probably using DistilBERT and RoBERTa may, importantly, prolong the whole data processing procedure. However, we believe it is essential to underline that the demonstrated times are only generally illustrative, and will strongly differ between computing machines and code implementations. Similar F1 score values were obtained by Hong et al. [12]; however, Hong used different features, and the analyzed data were published by various user accounts, which allowed them to leverage account-specific features that are known to provide valuable information and improvements in classification scores, as demonstrated for instance by Zhang et al. [15]. Our findings can also be compared to those of Kupavskii et al. [7]. In addition to solving a regression task, Kupavskii et al. [7] conducted a two-class classification by using a gradient-boosting decision tree model to achieve F1 scores as high as 0.775 and 0.67, for the two analyzed classes. In our work, we assumed that no post-publishing information was available. These higher F1 scores might possibly be attributable to the utilization of information available after a tweet's publication, because the authors themselves demonstrated that even incorporating information regarding the number of retweets from the first 15 s after a tweet is published can substantially increase predictive performance. In addition, solving a classification task with two classes is usually simpler than solving a similar task with three classes.
The consistent quality of our deep learning methods is probably reducible to the fact that they are capable of creating context-aware, tweet-level representations, i.e., capturing the context of the whole tweet and extracting more precious information from the unstructured text. LMs such as Glove and FastText provide only context-independent features, which causes the performance to drop.
Further increasing the performance of our machine learning models is likely to be possible with the proper engineering of additional structured features. Many possible features could be adopted, including those as simple as the length of a tweet, as proposed in Duan et al. [35]. Figures 8-10 demonstrate the importance of well-engineered structural features. The mentioned figures depict SHAP explanations for the same machine learning classifier, GB, and features from groups I, II and III. Analysis of Figure 8 indicates that time-dependent information regarding when the tweet was published was most informative for the model trained solely on structured features. In Figure 9, features created by DLFE can also be demonstrated, but unfortunately, there is no information on what these features represent. This unfortunate observation shows that while deep learning modeling in NLP provides a significant performance boost, it makes the state of the art XAI techniques useless in some cases. Figure 10 shows the importance of the structured features, compared with DLFE features, for a model trained on these combined features. The single most crucial structured feature was found in the 15 most important features. Thus, properly engineered structured features appear to be truly valuable, even in conjunction with DLFE features. what these features represent. This unfortunate observation shows that while deep learning modeling in NLP provides a significant performance boost, it makes the state of the art XAI techniques useless in some cases. Figure 10 shows the importance of the structured features, compared with DLFE features, for a model trained on these combined features. The single most crucial structured feature was found in the 15 most important features. Thus, properly engineered structured features appear to be truly valuable, even in conjunction with DLFE features.    In fact, we believe that the key to significant improvements in prediction quality lies in crucial information that is available in tweets but is currently neglected. A representative detail that illustrates the underlying problem can be seen in the pre-processing of tweets. Our tweet pre-   In fact, we believe that the key to significant improvements in prediction quality lies in crucial information that is available in tweets but is currently neglected. A representative detail that illustrates the underlying problem can be seen in the pre-processing of tweets. Our tweet preprocessing procedure resulted in the removal of 351 duplicated tweets. Of course, the @USNavy In fact, we believe that the key to significant improvements in prediction quality lies in crucial information that is available in tweets but is currently neglected. A representative detail that illustrates the underlying problem can be seen in the pre-processing of tweets. Our tweet pre-processing procedure resulted in the removal of 351 duplicated tweets. Of course, the @USNavy account did not publish the same tweets several times; however, the procedure converting all links and images to the same tokens resulted in the creation of identical tweets such as "LIVE NOW: Watch #USNavy's newest Sailors graduate boot camp-_URL _IMAGE" or "Around the fleet in today's #USNavy photos of the day. info and download: _URL . . . _IMAGE." Consequently, among the 351 deleted tweets, many differed only in image or link content. As shown in Table 4, the analyzed set of tweets included 83.5% of data instances with images. Intuitively, the content of an image should influence the likelihood of "liking" or "retweeting" a tweet, but our features are not capable of reflecting image content in any manner. We believe that extracting information from the images attached to tweets coulda clearly improve the quality of predictions regarding user responses. Future efforts to address this issue could begin with a similar approach, as in Mbarek et al. [23], in which the authors experimented with leveraging publicly available Convolutional Neural Network-based tools and the simple color analysis of images for feature extraction. In addition, for 60.38% of data instances with links, the used classifiers include no information regarding the web resources to which the links direct. In this context, prediction quality could be improved by analyzing the URL type, as proposed in the study by Suh et al. [5], which indicated that some tweets are more likely to be retweeted than others, depending on the URL target. Structured features extracted by the proposed approaches could also contribute to improving understanding of the rationale for model predictions if XAI tools similar to those used in our study were implemented.

Limitations of the Study
One limitation of our study is the small data set, which prohibits us from drawing strong conclusions from our experiments. Another source of possible data-related bias is the choice of a single Twitter account as a data source. It cannot be excluded that the here-described methods would perform differently for another Twitter account.
Another limitation specific to the topic addressed is that we did not focus on testing many structured engineered tweet features that could improve prediction quality. This decision was deliberate, because the main aim of this work was to demonstrate the utility and quality of the Deep Learning Feature Extraction approach regarding unstructured tweet text.
To determine the comparability of various LMs and all target variables, we performed training procedures with the same set of parameters. This design could have introduced bias, because the chosen training regime could be more beneficial for some LMs and target variables than others. This negative effect is apparent in the results of predicting replies; unexpectedly, RoBERTa LM was outperformed by simpler LMs, probably because of the unoptimized training regime.

Conclusions
Predicting the number of likes, replies or retweets that a tweet will receive before publication is a difficult task. Other researchers have experimented with various models and features, and some have analyzed different scenarios by using available post-publishing information. In our work, we presented models trained primarily on features extracted from unstructured tweet text, via deep learning feature extraction based on recently published LMs, i.e., DistilBERT and RoBERTa. Our findings confirm that using these recent models for text-based feature extraction provides a higher quality of prediction results, when compared to using simple structural features and earlier-introduced LMs like Glove and FastText. We also found that from the three analyzed dependent variables, predicting the number of replies was most difficult. We believe that substantial room for improvement still remains, and we hypothesize that improving prediction quality will be possible with proper leveraging of the information contained in the images and links published with tweets. Our study also demonstrated that when more structured features containing additional information are introduced, it is possible to assess their influence on the prediction quality if proper XAI techniques are employed. This may allow optimization at the stages of feature engineering and selection. Unfortunately, the tested XAI method did not prove useful for features provided by deep learning language models. Understanding the rationale for model predictions could also be improved with the use of XAI techniques.