Popularity Prediction of Instagram Posts

: Predicting the popularity of posts on social networks has taken on signiﬁcant importance 1 in recent years, and several social media management tools now offer solutions to improve and 2 optimize the quality of published content and to enhance the attractiveness of companies and 3 organizations. Scientiﬁc research has recently moved in this direction, with the aim of exploiting 4 advanced techniques such as machine learning, deep learning, natural language processing, etc., to 5 support such tools. In light of the above, in this work we aim to address the challenge of predicting 6 the popularity of a future post on Instagram, by deﬁning the problem as a classiﬁcation task and by 7 proposing an original approach based on Gradient Boosting and feature engineering, which led us to 8 promising experimental results. The proposed approach exploits big data technologies for scalability 9 and efﬁciency and it is general enough to be applied to other social media as well. ,


149
Differently from other literature works, which commonly formulate the problem as a regression 150 task, with the goal of estimating the so-called engagement factor (i.e., the ratio between expected likes 151 over account followers) of a future post, we model the problem as a binary classification task. 152 More specifically, our goal is to determine if a future Instagram post will be popular or unpopular, 153 regardless of the type of visual content published (image or video), but mainly focusing on the post 154 metadata such as the caption, the time of publication, and the account typology. In particular, we label 155 as popular a post whose number of (expected) likes will exceed a specified threshold (roughly, the 156 moving average of likes of the account), or as unpopular otherwise. 157 To formalize this concept, we first need to introduce the preliminary definition of the Likes Moving 158 Average (LMA). Intuitively, given the i-th post of an Instagram account, the LMA represents the average 159 number of likes achieved by its previous K posts. In formulae, let P A be the ordered set of posts 160 published by an account A; then, we have: where K is the number of previous posts considered (i.e., the size of the moving average window), 162 and like_count(P A [j]) is the number of likes obtained by the j-th post of A 1 .  In this section, we describe in detail the process adopted in order to build the dataset used for the 172 study and testing of our approach.  199 Hence, following this scheme, we collected the post and profile features described in Table 1.

FEATURE DESCRIPTION is_video
A binary feature that indicates if the post content is an image or a video. likes_num The number of likes received by the post. timestamp The publication date and time of the post. followers_num The number of followers of the post author. caption The full caption of the post, including emojis and hashtags.

207
In this section, we now describe the proposed method to address the problem of predicting the 208 popularity class of a future Instagram post (as defined in Section 3), through a supervised learning 209 technique and by exploiting user-generated metadata (caption, hashtag, publication time, etc.) as 210 features. As previously mentioned, our method does not perform any kind of analysis on the type of 211 visual content that the user intends to publish (picture or video), since the goal is to verify whether the 212 background information generated by the user is useful to promote the aforementioned content, or, 213 vice versa, whether it may penalize the expected popularity. Moreover, such a consideration allows us 214 to generalize our approach and apply it to other social networks where only text is reported.

215
In particular, the proposed method, that we named XGBoost Instagram Predictor (XGB-IP), consists 216 of two main steps: 1. a feature engineering phase, in which, starting from the fetched data (according to the procedure 218 described in Section 4), we enriched the dataset with some derived information, as well as 219 removing data whose contribution was negligible or not interesting for the class prediction The details of these two steps are then described below.  Table 2.

FEATURE TYPE DESCRIPTION Average likes
Average number of likes of the K most recent posts of the account, for different values of K.

Recent likes
The exact number of likes achieved by the most recent published posts.

Time features
The scheduled date and time of the post to be published. Text-related features The features derived from the caption (number of words, sentiment score, hashtags popularity, emoji, etc.).    use of binary features has shown to be more effective than the use of discrete features which also count 252 the number of emojis present in the text).

253
In parallel, we also collected all the hashtags referred in the posts of our dataset, and then we 254 associated a weight to each of them, intended as the number of posts in which they appear, relative 255 to the whole Instagram network. Then, similarly to what we did for the emojis, we created 10 256 macro-categories of hashtags, corresponding to 10 different levels of hashtag popularity (determined 257 by dividing into 10 parts the range of weights between the most used and the least used hashtag, 258 according to logarithmic scale). In this way, each macro-category corresponds to a binary feature that 259 is set to 1 when the caption includes at least one hashtag which belongs to that level. We tested the classification algorithms with different configurations, by performing a parameters 278 exploration, mainly on the following: learning_rate, which represents the shrinkage of each tree 279 contribution; n_estimators, which is the number of boosting stages to perform; and max_depth, which 280 represents a limitation of the number of nodes in each tree. Table 3 shows the final configurations 281 obtained, for both the classification algorithms considered. In addition to those mentioned above, 282 several tests were also performed on the other parameters. However, since they did not bring significant 283 improvements in accuracy or execution time, we opted to use the default values.

Experimental evaluation 289
Below, we describe the experiments conducted to validate the effectiveness of our method, carried 290 out on the dataset described in Section 4.
We also observe that, although these baselines globally obtain good performance (especially as 312 the value of the considered K parameter increases), the best one is given by fixing j = 1 (Baseline 1); in 313 particular, for values of j > 3, we found a rapid decay of the baseline accuracy. Therefore, hereafter, we 314 only consider j ∈ {1, 2, 3} for our comparison.

315
However, we introduce an additional baseline, which represents a special case, and we indicate it 316 with index j = 4 (Baseline 4). It is simply defined as the average of the first three baselines:

(4)
The baselines defined above will then be used in the remainder of this section for the comparison 318 with the proposed method.  generic class can be defined as follows: Using this definition we can obtain the balanced accuracy in the following way: where R c0 and R c1 represent, respectively, the recall value for class 0 (unpopular) and class 1 336 (popular), respectively.
This metric is calculated independently for each of the two classes c0 and c1, resulting in two 343 values F1 c0 and F1 c1 . At this point, the final result is obtained as follows: With W 0 and W 1 representing the weights associated with the two classes, whose value depends 345 on the number of true instances of each class. 346

347
We now show the results of the experimental evaluation of our algorithm, in terms of the metrics 348 described in Section 6.3, and by comparing it with the baselines outlined above. 349 We performed a total of 12 experiments, by spanning different combinations of K (10, 30 and 50), observe that the best baseline is the #4, although for ∆ = 0, the baseline 1 reaches slightly better results.

359
In this context, our method (XGB-IP) gets the best overall performance for all the ∆ thresholds (except 360 for the F1-Score with ∆ = 0.15). Here, the most significant value is a 57.22% of balanced accuracy for 361 ∆ = 0: in this case, we get a relative improvement of +8.59% compared to the best baseline, as well as   This result makes it possible to affirm that the distributed implementation of the method is writing, original draft, A.S.P. and G.U.; writing, review and editing, S.C., A.S.P., D.R.R., R.S., and G.U. All authors 446 have read and agreed to the published version of the manuscript.