Sequential Recommendations on GitHub Repository

: The software development platform is an increasingly expanding industry. It is growing steadily due to the active research and sharing of artiﬁcial intelligence and deep learning. Further, predicting users’ propensity in this huge community and recommending a new repository is beneﬁcial for researchers and users. Despite this, only a few researches have been done on the recommendation system of such platforms. In this study, we propose a method to model extensive user data of an online community with a deep learning-based recommendation system. This study shows that a new repository can be effectively recommended based on the accumulated big data from the user. Moreover, this study is the ﬁrst study of the sequential recommendation system that provides a new dataset of a software development platform, which is as large as the prevailing datasets. The experiments show that the proposed dataset can be practiced in various recommendation tasks.


Introduction
GitHub is one of the biggest software development platforms, wherein a large number of users upload their open-source projects. It has gained popularity at an expanding rate, as depicted in Figure 1. It stores a vast number of repositories of source codes related to researchers' projects or research papers for them to share with more people online. The total number of repositories reached more than 280 million, and the total number of users on this platform reached 69 million users in this platform in August 2020 (GitHub API https://api.github.com/) (accessed on 20 Nov 2020). As a result of the augmentative computation-intensive nature of modern scientific discovery, this trend is largely growing in deep learning (DL) and machine learning fields; a particularly clear example can be found where links are often provided to GitHub in published research papers. However, to find useful information or repositories in GitHub, one needs to inspect projects manually using a search filter or search for a popular project on the Explore GitHub page (https://github.com/explore). Subsequently, the page provides you with categories that are limited to static themes such as 'recently visited repositories', 'recently visited topics', 'starred repositories', or trending projects in consideration of the recently visited topics. GitHub provides such recommendations based on general recommendation topics, although it depends heavily on temporally close and content-based relationships (category, language, etc.) as far as we have experienced.
The recommendation system is an algorithm that proposes an item related to or preferred by the user. Effective recommendations have become essential as recommenders are applied in a large number of fields, and their contents increase at an exponential rate. Consequently, many studies have been conducted on this subject. The emergence of deep learning has contributed to a significant improvement in the research. Various models and techniques have been proposed, including neural collaborative filtering (NCF) [1], neural factorization machine [2,3], recurrent neural network (RNN) [4][5][6][7], convolutional neural network (CNN) [8,9], and reinforcement learning models [10]. All models have a tendency to recommend items based on different points of interest depending on their particular task. Some systems handle user-based settings, while some deal with item-based settings regarding the user's general long preferences, personalizations, or sequential interactions. Most traditional recommendation systems are content-based and collaborative filtering (CF)-based systems. They tend to model their preferences for items based on explicit or implicit interactions between users and items. Specifically, they assume that all user-item interactions in historical order are equally important, and try to learn one's static preferences using the users' history of interactions. However, this may not be maintained in real-life scenarios, wherein the users' next actions rely heavily on their current intentions as well as static long-term preferences, which may be deduced and influenced by a small set of the most recent interactions. Further, conventional approaches ignore sequential dependencies between user interactions, and user preferences are modeled incorrectly. Consequently, sequential recommendations are gaining popularity in academic research and practical applications. Prevailing recent recommendation systems, such as the convolutional sequence embedding recommendation model (Caser) [8] and self-attentive sequential recommendation (SASRec) [11] utilize interaction sequences that contain useful information about each user's behavior, e.g., listening to music, purchasing merchandise, watching YouTube, and the similarity between the items, by capturing both long-range and short-range dependencies of user-item interaction sequences.
In this study, we tested a recommendation algorithm using GitHub interaction datasets to determine whether it can adequately obtain sequential interactions regarding academic and research interests in large datasets. We evaluated our dataset by implementing gated recurrent units for recommendation (GRU4Rec) [4], Caser, and SASRec methods, which are DNN-based sequential recommendation algorithms. Based on sequential recommendation studies [8,9,11,12], we assume that it is sufficient to test the recommendation in this field by supplying the model with only the user's item preference sequence. In a recommendation system task, cold start refers to the situation when there is not enough recorded data to recommend any items to new users. In summary, when cold starters are people with less than 40 interactions and repositories with less than 160 interactions, normalized discounted cumulative gain [13] (NDCG) scores of 0.067 and 0.145 were achieved for ten items, along with precision scores of 0.013 and 0.018. The contributions of this research are as follows. • As far as we are aware, we are the first to provide a large-scaled GitHub dataset for a recommendation system. We provide two different scales of the dataset, which contains approximately 19 million interactions with over 230,000 users and 102,000 repositories. • We present an in-depth experiment on recommendations with the GitHub dataset. • We introduce the potential of sequential recommendations in the researchers' platform.

Related Works
In this section, we describe the related works based on three aspects: recommendation system studies in GitHub, general recommendation systems, and sequential recommendation systems.

Recommendations in GitHub
Millions of repositories and users have greatly facilitated researchers' studies; however, only a number of studies have focused on this particular platform [14][15][16][17][18][19][20][21][22]. One of the pioneering works [15,19,20] based on term frequency-inverse document frequency, is usually used to reflect the importance of a word to a document in a collection or corpus [23]. Furthermore, Zhang et al. and Shao et al. [16,22] are the most recent and the only DNNbased recommendation studies in this specific field that utilize deep auto-encoder and graph convolutional networks (GCNs). Zhang et al. [22] introduce Funk singular value decomposition Recommendation using Pearson correlation coefficient and Deep Auto-Encoders (FunkR-pDAE) which applies modified singular value decomposition to optimize similarity of user and item matrices and deep auto-encoder to learn the latent sparse user and item features. Shao et al. [16] encode both the graph neighborhood and content information of created nodes to be recommended or classified [24], to capture interactions between users and items. Despite the fact that deep learning-based recommendation systems are superior to traditional item-based or machine learning-based approaches, the studies are based on the content; and they are not able to represent time or sequential interaction patterns.

General Recommendation Systems
Generally, the recommendation system recommends an item to a user, which another user with the similar preferences was interested in, based on the interaction history [25][26][27] using CF. Although such traditional machine learning-based algorithms are replaced by modern deep learning-based algorithms, they pioneered this field of study. To be specific, non-negative matrix factorization is a method [26] that projects items and users into latent spaces through the exploitation of global information with internal products of vectors and predicts the user's interest in items. The two-layer Restricted Boltzmann Machine [28] method is another popular algorithm in CF. Salakhutdinov [28] pioneered DNN-based algorithms and won the Netflix Prize (https://www.netflixprize.com). NFC [1], AutoRec [29], and Collaborative Denoising Auto-Encoder [30], using auto-encoders along the lines of DNNs, have replaced the existing algorithms; however, the principle cause for the decline of these systems is that they cannot represent sequential information.

Sequential Recommendation Systems
As GitHub provides recent and various fields of projects, researchers tend to refer to projects depending on their current interests, e.g., projects or studies. Therefore, temporal and sequential interactions in GitHub are essential information for a recommendation system. The sequential recommendation system mainly aims to forecast and predict consecutive items based on the user's past sequential interaction patterns in chronological order. The growing popularity of these systems is due to the robustness of the cold-start problem; moreover, there is no ignorant information within the user's sequential behavior (e.g., purchasing item, listening to music, and streaming videos). The initial approach was the Markov chain-based model, Rendel et al., He et al., and Zhang et al. [31][32][33], which recommended based on L previous interactions with the L-order Markov chain. The next paradigm was the RNN-based models, such as the GRU4Rec [4] or RNN model [5,7], that were used to denote the representation in the user behavior sequence. This approach has attracted attention because RNNs are proficient in modeling sequential data [34]. In contrast, CNN-based models have been proposed by Tang and Wang [8] using horizontal and vertical convolution filters to address the vulnerability of RNNs in gradient vanishing problems, while modeling sequences as an option to learn sequential patterns.
Recently, several studies have attempted to utilize neural attention mechanisms for using sequential interactions and improving recommendation performance [35][36][37]. Attention mechanisms learn to focus only on the features that are important to a given input. Therefore, it has been studied actively in natural language processing (NLP) and computer vision. In addition, unlike the standard algorithms, the self-attention mechanism [38] models a complex sentence structure and searches for related words by generating the next word by considering the relationship between the word and other words. Various approaches using the attention mechanism [11,12,39] have been proposed to represent the sequential behavior of users through high-level semantic combinations of elements, and have achieved cutting-edge results. Consequently, we have adopted the sequential recommendation models to evaluate the recommenders in modeling extensive user data of an online community.

GitHub Dataset
In this section, the dataset obtained from the GitHub database will be discussed beforehand. The data were obtained and crawled from the public GitHub Archive (https: //www.gharchive.org/) (accessed date 18 December 2020). Although activity archives are available from 12 February 2011 to the present day, we have selected data only from the year of 2018. The database includes the repository id, hashed user id, program language, and description of each repository. Based on this online database, we constructed the GitHub dataset, as shown in Table 1. The dataset is formed in the Python dictionary format under the pickle module. We assumed that the interaction between a user and item is created when he or she stars a repository. This speculation was made because the users' repository visitation logs were not available. They may repetitively or simultaneously visit multiple repositories; however, the information regarding the track of this action is publicly not accessible, so it is limited to users' own will of starring the repositories. Additionally, as the timestamp of each activity could not be retrieved and under the assumption that the exact time information is not a necessary feature, we have excluded this feature in our GitHub dataset.
The fundamental information of each feature will be explained further due to the preprocessing of the data for efficiency. First, each dataset under different numbers of cold-start interactions is mainly divided into train, validation, and test data in the Python dictionary format. This will be elaborated in more detail in the next section. Second, within that dictionary, each repository, hashed user index, and program language is given a unique id. Finally, the raw description of a repository is provided with the parsed vocabulary and their unique id for anyone interested in content-based recommendation. The GitHub link (https: //www.github.com/John-K92/Recommendation-Systems-for-GitHub) to download our dataset and a custom dataset pre-processing method are provided for readers to utilize this dataset on any DL models. This custom dataset pre-processing method is based on Pytorch Spotlight Interaction module which is widely used for recommendation tasks with Pytorch framework to handle interaction sequences of user and item actions. You may refer to spotlight documentation [40] for more details. Other popular datasets of recommendation studies such as, MovieLens, and Amazon (http://jmcauley.ucsd.edu/data/amazon/), are compared based on the statistics from each dataset in Table 2. For every dataset, the interactions were assigned by the implicit feedback (review or ratings of items) of each user and item. As seen in Table 2, the domains and scarcity of real-world datasets differ significantly. It can be clearly noticed that our GitHub dataset is the largest amongst all the well-known datasets. The size of the dataset that is too large may be considered as having a lot of items rather than the user-item interaction information which is a powerful signal in the recommendation system. However, when the interaction cold-start is set as 2, the number of users, number of items, and total log are reduced to 2,572,450, 2,054,002, and 55,380,271, respectively. As a result, the GitHub dataset was utilized by adjusting the appropriate cold-start number in our experiment. The GitHub dataset is an interesting database for researching and developing recommendation systems, given their size, relative scarcity, and average interaction.

Baseline Recommendation Models
In our experiment, we assume a set of given users as U = {u 1 , u 2 , ..., u |U | } and a set of repositories as R = {r 1 , r 2 , ..., r |R| }. S u = {r u 1 , ..., r u t , ..., r u |S u | } represents the interaction sequence in chronological and sequential order of each user, where user u interacts at time step t under r u t ∈ R. However, unlike other tasks, t stands for the sequential order of interactions, not the absolute timestamp as in temporal-based recommendation systems [5,41,42], similar to the prominent DL-based recommendation models, Caser, BERT4Rec [12], and sequential-based methods [6,43].

GRU4Rec
The GRU4Rec [44] model focuses on modeling model interaction sequences in sessionbased scenarios. The backbone structure of GRU4Rec uses a variant of RNN, gated recurrent units (GRUs), which is a more sophisticated version of RNN that effectively alleviates the vanishing gradient problem of an RNN with an update gate and a reset gate. The gates in GRU mainly learn to control the amount of information that is required to update and forget the hidden state of the unit. For instance, a reset gate allows control of how much information to remember from the previous state. Similarly, an update gate allows control of how much of the new state is a copy of the old states. Considering W and U as the learned weight at each gate and the information of the hidden gate as h, the update gate is calculated with input x into the network at a time step t, and the reset gate is determined by where sigmoid σ(·) is an activation function for the non-linear transformation. The memory content that stores information from the previous gate using the reset gate is given as follows:ĥ where is an element-wise product of elements. The determination of what to store in the current memory content within the past and current steps by final linear interpolation is determined as The GRU4Rec outperforms the RNN-based algorithms; the basic GRU4Rec's architecture is shown in Figure 2. In this illustration, L represents the previous items and T represents the target item sequences within the user interaction sequence Su, respectively, and E represents the embedding matrix of the L previous consecutive interactions. To capture inter-dependencies, Hidashi et al. [44] further improved recommendation accuracy by introducing new ideas, such as TOP1 losses [44], Bayesian Personalized Ranking loss [45], and mini-batch negative sampling.

Caser
In order to learn the sequential representations from user and repository interactions, this model applies both horizontal and vertical convolutional filters to leverage a latent factor for the CNN. The convolution processing treats an embedded matrix E ∈ R L×d as an input image, when L is the historical interacted items in dimensional latent space d [8]. Caser [8] learns the sequential pattern of interactions by feeding the local features of the image and the general preferences of users. The horizontal convolution estimates φ c (E i:i+h−1 F k ), given that φ c is an activation function in convolution layers and h is the height of the filter, as illustrated in Figure 3. Subsequently, the vertical convolution captures the latent features by ∑ L i=1F k l · E l , with the fixed filter size at 1 × L unlike the horizontal convolution layer. The two stream outputs of these two convolution layers are then fed into a fully connected layer to attain abstract and higher-level features. Finally, the output z and the user's embedded general preference P u are concatenated and the result is then linearly projected to an output layer, which is written as where W ∈ R J ×2d and b ∈ R J with nodes |J | are the weight matrix and bias term in the output layer, respectively.

SASRec
Kang [11] made use of a self-attention module, comprising a left-to-right unidirectional single-head attention model, to capture interaction representation. The architecture of the model is elaborated in Figure 4. This means that it identifies relevant or essential items from historical sequential interactions and forecast the next item in sequence. Unlike CNN-or RNN-based models, it is cost-efficient and is able to model semantic and syntactic patterns among interactions. SASRec tends to focus on long-term dependence in dense datasets, while simultaneously focusing on relatively recent interactions in sparse datasets. The attention of the scaled dot product can be obtained by where the inputs, the query, keys, and values, that are fed into the self-attention block are denoted as Q, K, and V, respectively. Here, d is the dimensionality, and √ d denotes the scale factor that helps to avoid the unwilling large values of the inner product in the output of the weighted sum of values from the interactions between the query and key. Each self-attention head is calculated as follows: Then, the output is fed to a point-wise two-layer feed-forward network sharing the parameters. Thus, the self-attention block can be re-written as: Therefore, output F i captures the all user-item sequences. However, Kang et al. [11] also maintained the stacking of another self-attention block to learn more complex interactions from the sequential item information. This section first describes the pre-processed GitHub dataset set with statistics and evaluates the recommender models and their variations. We then elaborate on the results and analyses of the experiments of the recommendation. For a reasonable comparison, the learning and evaluation code was imported from each appropriate author.
As mentioned in the Section 3.1, we focus on the interactions between user and repository. Therefore, although a repository has attributes such as timestamp, description, and program language, we exclusively focus on the interactions. In that respect, we ruled out the cold-start users and repositories under n number of interactions, where n was set to 40 and 160. The reason behind this is that sparsely interacted items may act as outliers within the dataset and to ensure strong relations among the interactions as practiced in common recommendation systems in [1,8,32,43,46]. The size n of cold-start in the GitHub dataset indicates that there are no items or users that interact with less than the n number of inter-activities. Along with n cold-start, we set the L previous interactions as 5 and that for target sequence T as 3. For details, Table 3 shows a brief analysis of the two datasets for our training and experiments. The total number of interactions is revised from the original 37,068,153 interactions to 19,064,945 and 2,968,165, in respect to the n values. n can be varied; however, we have set the value to specific numbers to acquire adequate correlativity for modeling recommendation tasks in consideration of the data sparsity. As the sparsity of a dataset significantly affects the modeling of the recommendation, we have defined n as two values to vary the sparsity. Consequently, the number has dropped from 37 million to a minimum of approximately 2 million, but the amount is still significant for evaluating the recommendation performance. In this dataset, data were split for testing and validation with 20% and 10% of the dataset for each n-length. When applying recommendation systems using a certain dataset, setting the appropriate minimum number of historical interactions and often modeling subsidiary information, such as based language and description, is imperative. First, the GitHub dataset is skewed to only a small number of repositories occupying a significant amount of interactions similar to other recommendation task datasets. As shown in Figure 5a, approximately less than 20% of the most interacted repositories explain almost 90% of all the interactions within the dataset. In comparison, the users are relatively evenly distributed, as shown in Figure 5b. This indicates that the amount of information loss, resulting from setting the value n, is inconsistent. Second, for n size of '40', which is the minimum length in our experiment, Figure 6 represents the brief statistics of the language feature in the dataset. Although we have not used the language, description, or time information which could be used effectively in the recommendation system, and used only sequential interactions, we present the concise statistics of the 'language' feature. Figure 6a illustrates the top 10 most program-based repositories. Here, one of the major features is "None" values, and a few languages represent the most repositories in the dataset, as seen in Figure 6b. Therefore, careful attention is needed for those who are interested in using such 'language' features in the GitHub dataset. However, because of the large number of repositories of common program languages, it can still be a prominent feature in recommendation tasks. The supplement feature handling can help the model performance.

Evaluation Metrics
In the evaluation of the recommendation systems, we implemented various common evaluation metrics in the task, including precision@N, recall@N, mean average precision (MAP), normalized discounted cumulative gain (NDCG@N), and mean reciprocal rank (MRR) as in studies [8,11,12,31,47,48]. Precision@N and Recall@N are determined by whereR 1:N is the top-N estimated repositories per user. MAP is the mean of the average precision (AP) of all users in U , given rel(N) = 1 if the ground truth R is equal to the predicted Nth item inR; AP is computed by NDCG@N implies a position-aware metric that assigns rank-specific weights to measure rank quality. It can be calculated as where R u,g u is the predicted rank for g u , when g u is the ground truth repository for a given user u. In recommendation tasks, MRR refers to how well the recommender model assigns rankings similar to NDCG. To distinguish between the two indicators, MRR focuses on the assessment of assigning ground-truth items to higher ratings in forecasts; however, the NDCG considers rankings in relation to adding larger weights from higher ratings. MRR is defined as follows:

Performance Analysis
In this experiment, we have set the number of L as 5 and the number of T as 3, which heuristically gave the best performance in training all given models. That is, the input number of each item sequence is 5, predicting the next 3 items in a sequence for each user. The numbers of two cold-start datasets are set to 40 and 200, under the assumption that similar data sparsity of a dataset will result in a similar result.
The results of the recommendation algorithms described previously are shown in Table 4. The CNN-based model surpassed all the models in all matrices. Generally, selfattention-based recommendation systems such as SASRec, BERT4Rec, and AttRec [39] better predict the subsequent items in relatively sparse datasets, especially when CNNbased recommenders have a limited receptive field [11] in regards to the size of the dataset. These results show the presence of being more weighted on the short-term sequential interactions in GitHub or similar platforms. Nonetheless, recommendations based on the RNN, such as GRU4Rec, model performed worse than the other algorithms. This is particularly because the model learns the sequence in a step-by-step manner, which suggests that too much concentration in the sequence will degrade the overall performance. Consequently, it is necessary to capture dependencies effectively, not only in the short-term but also in long-term information (e.g., users' general preference) [8,9,11,43]. Considering the different n number of cold-start interactions, Caser outperformed the performance indicators for Pres@10 and MAP in more than 200 interactions. The metrics were extremely close to each other when number of cold start is 40 interactions (n = 40), but the gap was enlarged because they tend to focus more on users with larger number of interactions. It clearly demonstrates that the CNN-based model, Caser, is robustly capable of capturing short-term sequential interactions, as well as accurately modeling long-term preferences especially in the less receptive areas.  Figure 7 presents a predicted output of the predicted repositories from the randomly sampled users in the validation dataset. The five sequential interactions are the input items of the model; and we presented the output predicted items. In this figure, 'category' denotes the hand-crafted feature from each item and the 'language' feature is obtained from each output repository. We manually checked the task or project of the output repositories, and categorized the topic to visualize the example. As the repository was predicted from time step t, it was found that the previous time step t − 1 item not only affected the forecast next item but also the next nearby time steps in the sequential recommender. The highestranked item t,r includes a feature ('category') of the immediate previous item and the one prior('language') inclusively. However, item t,r−1 item is rather much closer to item t−3 and item t−4 than the sequentially nearby item, such as second previously interacted item. This reveals that the outputs include various features of sequence perspective and that the recommender can represent users' sequential and general preferences. Figure 7 briefly shows the notion of deep learning-based sequential recommendation and the correlation of typical class features ('type' and 'category') in GitHub recommendation. To sum up, the GitHub platform has an exclusive differentiation between various platforms in recommendation tasks.

Conclusions and Future Work
This study presents a DNN-based recommendation system using a large-scale dataset associated with GitHub. The pervasive growth rate and distinct aspects of the use of the GitHub platform suggest that more studies may be carried out in this particular area. Consequently, we investigated the GitHub dataset using prevailing deep learningbased recommendation systems. The recommenders described in the study have been evaluated on GitHub platforms, along with past user-repository interaction sessions. The results shows that the CNN-based approach, Caser, is consistently better than the other recommendation systems algorithms and models personalized preferences and interactions on these platforms regardless of the size of cold-start interactions.
Future studies, therefore, should consider various state-of-the-art recommendation systems to study the properties of GitHub interactions further. These include RNN-based VLaReT [49] that outperformed the mentioned baseline models, and improved CNNbased recommender [50] and hypergraph CNN-based recommender [51]. Moreover, using multiple labeled features, such as language types and NLP-based features by modeling with a pre-trained language model, may be essential for analyzing user interactions and predicting future behavior. Data Availability Statement: Data available in a publicly accessible repository that does not issue DOIs Publicly available datasets were analyzed in this study. This data can be found here: https: //www.github.com/John-K92/Recommendation-Systems-for-GitHub.

Conflicts of Interest:
The authors declare no conflict of interest.