Semantic Interest Modeling and Content-Based Scientific Publication Recommendation Using Word Embeddings and Sentence Encoders

Guesmi, Mouadh; Chatti, Mohamed Amine; Kadhim, Lamees; Joarder, Shoeb; Ain, Qurat Ul

doi:10.3390/mti7090091

Open AccessArticle

Semantic Interest Modeling and Content-Based Scientific Publication Recommendation Using Word Embeddings and Sentence Encoders

by

Mouadh Guesmi

^*

,

Mohamed Amine Chatti

^*

,

Lamees Kadhim

,

Shoeb Joarder

and

Qurat Ul Ain

Social Computing Group, Faculty of Engineering, University of Duisburg-Essen, 47057 Duisburg, Germany

^*

Authors to whom correspondence should be addressed.

Multimodal Technol. Interact. 2023, 7(9), 91; https://doi.org/10.3390/mti7090091

Submission received: 12 August 2023 / Revised: 6 September 2023 / Accepted: 13 September 2023 / Published: 15 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

The fast growth of data in the academic field has contributed to making recommendation systems for scientific papers more popular. Content-based filtering (CBF), a pivotal technique in recommender systems (RS), holds particular significance in the realm of scientific publication recommendations. In a content-based scientific publication RS, recommendations are composed by observing the features of users and papers. Content-based recommendation encompasses three primary steps, namely, item representation, user modeling, and recommendation generation. A crucial part of generating recommendations is the user modeling process. Nevertheless, this step is often neglected in existing content-based scientific publication RS. Moreover, most existing approaches do not capture the semantics of user models and papers. To address these limitations, in this paper we present a transparent Recommendation and Interest Modeling Application (RIMA), a content-based scientific publication RS that implicitly derives user interest models from their authored papers. To address the semantic issues, RIMA combines word embedding-based keyphrase extraction techniques with knowledge bases to generate semantically-enriched user interest models, and additionally leverages pretrained transformer sentence encoders to represent user models and papers and compute their similarities. The effectiveness of our approach was assessed through an offline evaluation by conducting extensive experiments on various datasets along with user study (N = 22), demonstrating that (a) combining SIFRank and SqueezeBERT as an embedding-based keyphrase extraction method with DBpedia as a knowledge base improved the quality of the user interest modeling step, and (b) using the msmarco-distilbert-base-tas-b sentence transformer model achieved better results in the recommendation generation step.

Keywords:

semantic user modeling; content-based recommender system; word embedding; sentence encoder

1. Introduction

Every year, thousands of papers are published in journals and conferences by researchers in many different fields. The increasing amount of digital data resulting from the development of information technologies means that literature search is becoming a challenging and time-consuming task in which it is more difficult to reach the desired information. With the constantly increasing amount of papers, users frequently utilize the current academic paper search engines (e.g., Google Scholar and Semantic Scholar) to search for relevant papers based on a set of keywords. The existing search engines often fail to satisfy users’ demands efficiently because they do not take individual user profiles into account. In fact, for a given search query, a search engine provides the same information to all users even though individual users may have their own interests and information needs. This drawback necessitates a personalized information system, such as a scientific publication recommendation system (RS) (sometimes denoted in the literature as paper RS, research paper RS, academic paper RS, scientific paper RS, article RS, scholar RS, etc.) to automatically present the most relevant papers to researchers based on their interests and information needs while minimizing the time they spent in searching [1,2].

The field of recommending scientific publications has been extensively researched [3,4,5,6]. Common approaches for scientific publication recommendation include collaborative filtering (CF) (e.g., [7,8]), content-based filtering (CBF) (e.g., [9,10,11]), graph-based methods (e.g., [12,13]), and hybrid approaches (e.g., [14,15,16,17]), each of which attempts to measure relevance among research papers using different methods. CBF methods have been widely used in the literature [3,18,19]. Their popularity can be attributed to their effectiveness in comprehending the content of items, particularly textual ones, which leads to recommendations that are highly aligned with user interests. In addition, they are able to mitigate cold start and data sparsity issues, and are inherently transparent [3,4,20].

A fundamental part of generating recommendations is the user modeling process that identifies a user’s information needs [3,21]. CBF relies on inferring the interests of users, which can be explicitly provided by users as input queries (e.g., paper, keywords) or implicitly inferred from the items that users have interacted with in the past (e.g., papers that the user authored, cited, tagged, browsed, or downloaded). These interests are then used to build user models, which are utilized to find relevant recommendations based on matching features between user models and papers [3,4,6,22]. Thus, a good user model plays an important role in enhancing the performance of the RS by providing more accurate recommendations [3,6,23]. Nevertheless, the majority of existing scientific RS publications neglect the user modeling process. In their literature survey of scientific RS publications between 1998 and 2013, Beel et al. [3] observed that many authors neglected the user modeling process. According the the authors, the majority (81%) of the surveyed approaches made their users provide a keyword, text snippets, or a single input paper to represent their information needs. Only few approaches automatically inferred information needs from the user’s historical item interactions. In their comprehensive literature review of scientific RS publications between 2019 and 2021, Kreutz and Schenkel [6] found that the problem of neglecting user modeling continues to hold.

Another major issue in current content-based scientific RS publications is related to capturing the semantics of user models and papers, which is essential to developing more accurate and effective RS [19]. Most of the current content-based scientific RS publications use the classical bag-of-words method, which represents the number of times each word occurs in a document. These methods do not consider the context of the words or the semantic similarity between words during the extraction and representation of the paper and user model features [6,9,19]. Recently, text and sentence embedding techniques have gained more and more attention due to the good performance they have shown in a broad range of NLP-related scenarios. By examining how words are used in large corpora of textual data, word embedding algorithms generate a low-dimensional vector space representation of words in an entirely unsupervised manner, enabling machines to understand and process textual information more effectively. This approach serves to capture the semantic meaning of the words or documents and contextual relationships between them, which can be effectively used to extract meaningful data representations, obtain a semantic and relational understanding of the data, and measure semantic similarities between words or documents [6,9,24,25].

To address these limitations, in this paper we propose the transparent Recommendation and Interest Modeling Application (RIMA), a content-based scientific publication RS that leverages word embeddings and sentence encoders to improve the accuracy and effectiveness of the user modeling and recommendation generation tasks. Concretely, RIMA implicitly infers semantically enriched user interest models from users’ past publications by combining embedding-based keyphrase extraction techniques with knowledge bases, then utilizes pretrained transformer sentence encoders to encode semantic information of user models and papers and compute their similarities. We conducted extensive experiments on different datasets to evaluate our approach, along with an online user study (N = 22). Our results revealed that combining SIFRank [26] and SqueezeBERT [27] as an embedding-based keyphrase extraction method with DBpedia [28] as a knowledge base can improve the quality of the interest model generation task. For recommendation generation, we were able to generate a more accurate and better-ranked recommendation using the sentence transformer model msmarco-distilbert-base-tas-b (https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b, accessed on 18 September 2022) to extract semantic representations of user models and papers in order to capture semantic similarity between them.

The rest of this paper is organized as follows. Section 2 reviews related works on different methods of user interest model generation and content-based scientific publication recommendation. Section 3 presents the two pipelines related to user model construction and recommendation generation in RIMA. Section 4 presents the results of the offline evaluations and the user study. Finally, Section 5 and Section 6 point out limitations, summarize the work, and outline our future research plans.

2. Related Work

In this section, we provide an overview of the literature addressing different methods of interest model generation and scientific publication recommendation, with a focus on content-based approaches and NLP techniques.

2.1. Interest Model Generation

User modeling is a crucial task to achieve personalized services such as recommendation. The main aim of the user modeling process is to build a user profile by analyzing users’ shared information. User interests are one of the most critical pieces of information in the user model [29]. The process of automatically acquiring the user’s interests is known as interest modeling [30]. “User interest modeling”, “interest mining”, and “interest profiling” are all synonyms for “interest modeling” in the academic literature on user modeling. Interest modeling can be seen as the process of constructing a model to represent individual user interests based on their long-term and/or short-term information. User interest models can be generated through various approaches, including explicit user interest detection and implicit user interest mining [22]. Data in a user interest model are acquired through different methods, either manually, in which the user explicitly provides information about their interests and preferences, or implicitly, by analyzing user data such as behaviors, preferences, and other contextual information [31]. The widespread use of social media and digital publications has led researchers to focus on generating user interest models based on textual content containing keyphrases, which can be self-annotated by the user or automatically extracted using keyphrase extraction algorithms.

Several text mining methods have been used to generate user interest models. Text classification [32,33,34], named-entity recognition [35,36], and keyphrase extraction are popular techniques that are commonly used to construct interest models from text-based and social media-based data sources [22]. In this work, we focus on user interest modeling based on keyphrase extraction approaches. Keyphrase extraction plays a vital role in creating user interest models by uncovering meaningful patterns and insights from textual data. There are various categories that keyphrase extraction falls under, with the two most common being supervised and unsupervised. In supervised approaches, classification algorithms are commonly used to allocate users into predefined interest classes based on their data. Supervised approaches are relatively simple and easy to apply; however, they are domain-dependent and limited to identifying only predefined interests which were used to train the prediction model [37]. Unsupervised approaches, on the other hand, can be applied in various domains and are not dependent on any predefined prediction model. Thus, they can generate a more diverse set of user interests. Moreover, they are able to automatically identify and capture user interests without relying on labeled training data, which necessitates a lot of human labor [38,39]. In this work, we focus on unsupervised keyphrase extraction methods. These can be broken down into statistical-based, graph-based, and embedding-based approaches. Statistical models such as Latent Dirichlet Allocation (LDA), Term Frequency–Inverse Document Frequency (TF-IDF) [40], Rapid Automatic Keyword Extraction (RAKE) [41], and YAKE! [42] often include features such as word frequency, n-gram feature, location, and document grammar. Graph-based methods such as TextRank [43], SingleRank [44], ExpandRank [45], PositionRank [46], TopicalPageRank [39], TopicRank [47], and MultipartitieRank [48] attempt to model the relationships between words or phrases in the text.

Both statistical-based and graph-based methods are widely used for keyphrase extraction. However, neither approach considers the semantic issues during the keyphrase extraction task. A disadvantage of these approaches is that they cannot provide additional information about the semantic relationships of the entities or concepts present in the text [30,49]. In the user interest modeling task, user models might contain similar interests represented in the form of acronyms (e.g., MOOC and massive open online course), synonyms (e.g., technology enhanced learning and elearning), and lexical variants (e.g., elearning and E-learning). In addition, there may be overgeneration problems (e.g., the keyphrases open learning analytics and learning analytics represent the same interest, namely, learning analytics) [29]. Due to the lack of semantic knowledge, traditional statistical-based and graph-based keyphrase extraction methods can identify semantically similar interests as different, which might influence RS accuracy. To address semantic problems in the keyphrase extraction task, word and sentence embeddings are increasingly used as embedding-based keyphrase extraction techniques. Large corpora are utilized to train models for word and sentence embeddings into a vector space [50]. Pretrained language model encoders are commonly used, and have greatly improved the challenge of extracting keyphrases from textual data. Using sentence embedding techniques, Bennani-Smires et al. [38] developed the EmbedRank model for extracting keyphrases. To determine the document’s sentence embedding and the possible keyphrases, the model employs two pretrained embedding models, Doc2Vec [51] and Sent2Vec [52]. The cosine similarity between the document embedding and candidate keyphrases embeddings determines which keyphrases are chosen. As a result of EmbedRank’s embedding-based Mean Reciprocal Rank (MMR), keyphrase coverage and diversity are increased among the selected keyphrases. A more recent keyphrase extraction model utilizing ELMo [53] and SIF [54] was proposed by [26]. Noun phrases are extracted using tokenization and Part-of-Speech (POS) tagging, and their embeddings are calculated together with the document embedding. Then, the cosine similarity is used to pick the keyphrases. Due to SIFRank’s limitations on longer texts, the authors created SIFRankPlus, an extension of SIFRank that incorporates a position-biased weighting scheme to increase extraction accuracy.

In order to address the semantic issues, research works have incorporated knowledge bases such as Wikipedia [29,55,56,57,58,59,60,61], DBPedia [57,62,63,64], WordNet [65,66,67], Freebase [68], Linked Open Data (LOD) cloud [49,69,70], and YAGO [71] to semantically represent user models. Semantic enrichment in the user modeling process is motivated by the need to enhance the accuracy of user models [22], increase the breadth of the keyphrases used to represent the users’ interests [22,49], gather additional contextual knowledge about the entities and the relationships between them [30,49], infer more transparent and serendipitous user models [56], and bypass the problems of acronyms, synonyms, lexical variants [29], and polysemy, i.e., when a word may have multiple meanings which cannot be distinguished using keyword-based representation [30].

Our approach for interest model generation moves beyond existing works by combining word embedding-based keyphrase extraction techniques with Wikipedia/DBPedia as a knowledge base to generate semantically-enriched user interest models in order to improve the quality of keyword extraction and user modeling by considering the semantic meanings of words.

2.2. Content-Based Scientific Publication Recommendation

Scientific publication RS are well studied in the literature. We refer the interested reader to four comprehensive literature reviews in this area [3,4,6,20]. The four predominant categories are content-based filtering (CBF), collaborative filtering, graph-based, and hybrid systems. In this work, we are interested in a scientific publication RS based on CBF. The key procedure in a content-based scientific publication RS is to match information between users (i.e., researchers) and items (i.e., publications). In general, recommendations in CBF methods are generated by observing features of users and publications. CBF mainly considers the users’ historical preferences and personal library to build the user interest model (i.e., the user profile). Then, CBF extracts keywords from the candidate publications and calculates the similarity of the keywords extracted from user profiles and candidate publications. Finally, publications with high similarity are recommended to users [4]. CBF includes three main steps: item representation, user modeling, and recommendation generation [4].

2.2.1. Item Representation

The appropriate item representation is very important, and is closely related to the performance of the RS [20]. In content-based scientific publication RS, items are represented by a content model containing the items’ features, which are typically word-based, i.e., single words, phrases, or n-grams [3]. Publications are mostly represented as TF-IDF vectors or based on keyphrase extraction models [4,6]. For example, Renuka et al. [72] used TF-IDF representations of automatically extracted keywords and key phrases. Few approaches have used a topic modeling component mostly based on LDA to represent publications’ content. For example, Subathra and Kumar [73] used LDA on publications to find their top n words, then used LDA again on these words’ Wikipedia articles. To counter the semantic problem in content-based approaches that rely on basic TF-IDF representations of publications, recent research on content-based scientific publication recommendation increasingly adopts text embedding methods based on different parts of a publication (i.e., titles, abstracts, keywords, and bodies) [6]. The most common embedding methods used to represent the content of scientific publications include Word2Vec [9,23,74], Doc2Vec [2,74,75], Glove [10], and SciBERT [11]. However, while widely used in graph-based and hybrid scientific publication RS (e.g., [76,77,78,79]), transformer-based embedding techniques, e.g., BERT, SBERT, and DistilBERT, remain under-investigated in content-based scientific publication RS.

2.2.2. User Modeling

One central component of a content-based scientific publication RS is the user modeling process. The user model typically consists of the features of a user’s publications [3]. The literature on scientific publication RS distinguishes between two ways to capture user preferences, implicitly and explicitly [19,23]. Implicit user modeling identifies needs automatically by inferring them from the user’s item interactions. Concretely, the interests of users are automatically inferred from the publications that users have authored or interacted with through actions such as reading, citing, tagging, browsing, or downloading [9,18,23,25,80,81]. In the explicit user modeling approach, the RS asks users to specify their preferences by explicitly providing a list of keywords or an input paper [9,82,83,84,85,86,87,88]. However, in this case an RS behaves similarly to a search engine, and loses the capability to recommend publications even if users do not know exactly what they need [3]. In our work, we focus on content-based scientific publication recommendation approaches that implicitly derive user interest models from their authored papers. Only few works exist that follow this approach [2,89,90,91,92,93,94,95,96,97]. These works have built user models with keyphrases, concepts, or topics extracted from the researcher’s past publications using a bag-of-word (BoW) model, TF-IDF, topic modeling, keyphrase extraction, or embedding techniques. For example, Lee et al. [89] modeled researchers using a BoW model based on their papers retrieved from different digital libraries. Sugiyama and Kan [90] noted that an author’s published works constitute a clean signal of the latent interests of a researcher, and constructed researcher profiles using a feature vector comprising unique terms obtained from their list of previous publications based on TF. Nishioka et al. [91,92,93] constructed user models from research papers and tweets based on different variants of TF-IDF. To generate user models, Bulut et al. [94,95] considered a user’s past publications and represented users as the sum of the features of their publications. All the required metadata, such as the title, year, author, abstract, and keyword of each publication, were extracted and merged together in a profile represented by TF-IDF. Chen and Ban [96] used LDA as a topic modeling technique to topically cluster user interests mined from their published papers. First, a user’s publications were divided into different interest points by clustering technologies. Then, the user’s interests were represented in terms of pattern equivalence classes. Similarly, Amami et al. [97] constructed a user profile based on LDA-generated topics from the users’s publications corpus. Bulut et al. [2] used the Doc2vec embedding method to construct user models while taking the user’s past articles into consideration. They found that the Doc2vec-based representation of the user model achieved better results than TF-IDF. While keyphrase extraction techniques have been used to infer user models from users’ interactions with publications (e.g., in [80]), to the best of our knowledge there are no works that have utilized keyphrase extraction to implicitly derive user interest models from their authored publications.

In summary, while various methods have been utilized for building user interest models from researchers’ authored papers, approaches relying on keyphrase extraction or embedding techniques are lacking. Moreover, these methods do not consider the semantic issues in the user interest modeling task. In order to fill these research gaps, we combine keyphrase extraction, word embeddings, and knowledge bases to build semantically-enriched user interest models to be used as input for our content-based scientific publication RS.

2.2.3. Recommendation Generation

To generate a recommendation list, the similarity between user interest models and recommendation candidates is calculated using a vector space model and a similarity measure to ensure that candidate publications with high similarity are recommended to the researcher [3,4]. In most content-based scientific publication RS, cosine similarity is often applied between papers or between users and papers [6]. For application between papers, similarity is computed between the feature vectors of the input paper on the one hand and the set of the candidate papers to recommend on the other hand [11,24,72,73,74,75,98,99,100,101]. For application between users and papers, similarity is computed using the constructed user profile and the feature vectors of the set of the candidate papers to recommend [2,9,25,80,82,83,84,87,88,89,90,91,92,93,94,95]. Most of the similarity computation is based on papers and users represented by TF-IDF [72,80,83,84,87,88,90,91,92,93,94,95,99,100,101]. We found that whereas embedding techniques are often applied to compute similarities between papers (e.g., [11,24,74,75]), approaches utilizing embedding of user models and papers remain scarce in the literature on content-based scientific publication recommendation [2,9,25].

Overall, our investigation reveals limited previous research utilizing embedding-based approaches to compute similarities between vector representations of the constructed user models and candidate papers to be recommended in the context of content-based scientific literature recommendation. Our work aims to fill this gap by adopting pretrained transformer sentence encoders in a scientific literature RS for embedding users and papers as well as for similarity computation.

3. RIMA Application

The transparent Recommendation and Interest Modeling Application (RIMA) serves as a content-based recommendation system for scientific publications [102,103,104,105,106,107,108,109,110]. RIMA was designed to automatically extract users’ interests from their past scientific publications and then utilize them to provide relevant publication recommendations. In this work, we focus on the generation process of interest models and recommendations. Each process is elaborated through a conceptual pipeline, where the methodology is explained, and a technical pipeline, where the implementation details are presented.

3.1. Interest Model Generation

3.1.1. Conceptual Pipeline

The pipeline for generating the interest model is depicted in Figure 1. The first step involves collecting all publications authored by a user in the last five years. Next, an unsupervised keyphrase extraction method is applied on the publications to obtain keyphrase-based interests. Subsequently, a knowledge base is utilized to semantically enrich the keyphrase-based interests. To introduce dynamism to the interest model, the interests are periodically updated over time using a forgetting function. In the following sections, we discuss these steps in detail.

Keyphrase-based interest model. In this work, our focus is on embedding-based keyphrase extraction techniques in comparison to other statistical- and graph-based approaches. To achieve this, we initially assessed the performance of various statistical- and graph-based keyword extraction algorithms to select the best-performing one as a baseline based on the Precision, Recall, and F-measure metrics. We chose exact matching to compute these metrics. The performance measures for the different keyword extraction algorithms, namely, TextRank [43], SingleRank [44], TopicRank [111], TopicalPageRank [112], PostionRank [46], MultipartitieRank [113], Rake [41], and YAKE! [114], were benchmarked using the Inspec dataset [115]. The Inspec dataset is designed for benchmarking keyphrase extraction and generation techniques from abstracts of English scientific papers. It comprises a document collection of 2000 scientific abstracts with sets of keyphrases identified by expert annotators. The results of the computation are summarized in Table 1, which indicates that SingleRank outperforms all other selected algorithms when extracting the top ten and top fifteen keywords. In particular, SingleRank (marked bold in Table 1) improves upon the strongest baseline, TopicalPageRank (underlined in Table 1), with respect to precision by 2.4%, recall by 2.7%, and F1-score by 2.4% for extracting ten keyphrases. For extracting fifteen keyphrases, the improvement is 0.9% in precision, 2.1% in recall, and 1.4% in F1-score. Therefore, we selected and implemented SingleRank as the baseline for our work.

We employed SIFRank [26] as an embedding-based keyphrase extraction method to extract keyphrases from an author’s publications for the purpose of generating the interest model. SIFRank presents a method for unsupervised keyphrase extraction based on a pretrained language model. It combines the sentence embedding model SIF [54] and the autoregressive pretrained language model ELMo [53]. The selection of SIFRank was based on its performance, as it achieved state-of-the-art results on short documents compared to other unsupervised keyphrase extraction techniques based on pretrained language models [26]. When compared to SIFRank’s performance with core transformer models such as BERT [116], RoBERTa [117], and XLNet [118], the evaluation conducted by the authors of SIFRank showed that SIFRank’s performance was better when employing ELMo as a word embedding approach [26]. However, LSTM-based approaches such as ELMo can be time-consuming [116]. Therefore, we substituted the ELMo word embedding method in SIFRank with the pretrained model SqueezeBERT [27]. SqueezBERT presents a novel neural architecture which uses grouped convolutions. It runs 4.3x faster than BERT-base on the Google Pixel 3 smartphone while achieving competitive accuracy on the General Language Understanding Evaluation (GLUE) set of tasks, which is a standard evaluation benchmark for NLP research [27]. In addition to its speed, SqueezeBERT was chosen because of the increased information flow between its layers and the fact that its transformer design is lightweight. Henceforth, we refer to this method as

S I F R a n k_{S q u e e z e B E R T}

.

Wikipedia/DBpedia-based interest model. The use of a knowledge base to infer interest models has the potential to resolve several semantic-related problems, including the merging of synonym interests, the reduction of acronym interests, and the elimination of noise caused by irrelevant keyphrases. Consequently, knowledge-based interest models should be more comprehensive and precise than keyphrase-based models. In this work, we employed two distinct knowledge bases, namely, Wikipedia and DBpedia, to construct semantically-enriched user interest models. Wikipedia is used to map the generated keyphrases to entities/concepts in the knowledge base. If a matching Wikipedia article’s title could be found, the term was included in the interest model; otherwise, it was removed. To connect keyphrases to concepts in the DBpedia knowledge base [28], we utilized DBpedia Spotlight [119] as an entity linking service.

Dynamic interest model. It was realized that if interests were not constantly updated, they could lose their significance; hence, a forgetting function was deemed necessary. The weight of an interest diminishes depending on the time elapsed since the user last generated it and the current date. Cheng et al. [120] proposed a forgetting function to characterize the diminishment of human interests. By adjusting the half-life hl, they represented the gradual loss of interest in things that had not been recently updated:

F (t) = e^{- \frac{l n (2) * (t - e s t)}{h l}}

(1)

where the forgetting coefficient

F (t)

represents the percentage of the original interests that have declined, t represents the current date, and

e s t

represents the date when the original model was constructed. Here,

h l

represents the half-life (in days) that regulates the forgetting rate. A larger

h l

value results in a slower decline of interest. The update periods for the interest models for publications were set at 365 days (one year). Assuming

t - e s t = h l

, we have

F (t)

= 1/2, which suggests that the interest weight for publication data decreased by half per year.

3.1.2. Technical Pipeline

Figure 2 illustrates the steps to generate the user interest model in the RIMA application. Users sign up using their Semantic Scholar ID information, initiating an API request to the Django server. This request is then forwarded to the Celery worker, which triggers three tasks: (a) collecting user data, (b) generating short-term interest models, and (c) creating long-term interest models. The first task sends an HTTP request to the Semantic Scholar API to collect the user’s publications, including titles and abstracts, from the last five years. Upon receiving the API response with the requested publications, they are forwarded to the second Celery task, where keyphrases are extracted along with their corresponding weights. After this is the weight normalization step, in which the weights for the extracted keyphrases, which range from 0–1, are normalized and mapped to a range from 1–5. Further, the extracted keyphrases with the normalized weights are semantically enriched using either the Wikipedia or DBpedia APIs. The short-term interests are then stored in the database and scheduled for regeneration by the second Celery task on an annual basis. The third task takes the short-term interests as input and utilizes the forgetting function to generate the long-term interest model, which is subsequently stored in the database. When users log into their accounts, a request is sent to the Django server, which requests the long-term interests from the Django model. The Django Model communicates with the database to retrieve the long-term interests. The response with the long-term interests is sent to the front-end through Django View to be visualized.

3.2. Recommendation Generation

3.2.1. Conceptual Pipeline

The pipeline for generating the publication recommendations is depicted in Figure 3. It begins with the collection of the most similar publications to the user’s interest model using the Semantic Scholar API. Subsequently, keyphrases are extracted from the collected publications. After that, we represent the user’s interest model and the keyphrases extracted from the collected publications as embedding vectors. To calculate the weighted average embedding vector of the interest model, we multiply each interest’s embedding vector by its weight and then sum these vectors. Finally, we divide this sum by the total sum of all interest weights. Similarly, we compute a weighted average embedding vector for each publication, based on its extracted keyphrases. Finally, we calculate the cosine similarity between the average weighted vector of the interest model and the average weighted vector of each collected publication. The top ten most similar publications are then recommended to the user.

3.2.2. Technical Pipeline

Figure 4 illustrates the steps to generate the top ten scientific publications to be recommended to the user. Initially, a request containing the user’s top five interests and their corresponding weights is sent to the Django server, which in turn communicates with the Semantic Scholar API to retrieve the most relevant publications based on the user’s interests. Keyphrases are extracted from each obtained publication and their weights are calculated by considering the frequency of these keyphrases in the publication’s title and abstract. Subsequently, the resulting weights are normalized to a scale ranging from 1 to 5. After that, a pretrained transformer language model encoder is used to generate the weighted embedding vectors for the user interest model and each publication. Following that, the cosine similarity function is used to determine the similarity between the user’s interest model and each of the obtained publications. Finally, the top ten most relevant publications with the highest similarity score are recommended to the user.

4. Evaluation

The overall goal of this work was to improve the interest modeling and recommendation mechanisms in a content-based RS by leveraging word embedding techniques. In this section, we first present the results of the evaluation conducted to gauge the quality of the generated interest models through a user study. The best performing approach was then selected to generate user interest models to be used as input for publication recommendation generation. Finally, we present the offline and user study evaluation results related to the quality of the generated recommendations. For the user study evaluations, we used the statistical measurements Precision at K (Precision@K), Mean Reciprocal Rank (MRR), and Mean Average Precision (MAP).

4.1. Interest Model Generation

4.1.1. Participants

The target group for our study consisted of researchers and students. Participants were recruited via e-mail, word-of-mouth, and groups in social media networks, and had to fulfill two participation requirements, namely, having at least one scientific publication, and possessing a Semantic Scholar ID, which is necessary for the interest model generation step. A total of 43 people were contacted, of whom 22 (9 males and 13 females) participants completed the study, including PhD students and professors from various countries and ages and with different backgrounds.

4.1.2. Procedure

We conducted an online user study using a questionnaire to assess the quality of the generated interest models. User information was anonymized and all participants provided their informed consent for study participation. Our goal was to investigate the best approach among three different combinations for generating an accurate interest model: (a) SingleRank as a keyphrase extraction method with Wikipedia as a knowledge base for semantic enrichment; (b)

S I F R a n k_{S q u e e z e B E R T}

as a keyphrase extraction method with Wikipedia for semantic enrichment; and (c)

S I F R a n k_{S q u e e z e B E R T}

as a keyphrase extraction method with DBpedia as a knowledge base. The average time taken to complete the questionnaire was seven minutes. The questionnaire consisted of two questions for each of the three generated interest models: (1) “Please rate the relevance of the following interests which were extracted from your publications” and (2) “Are any of your top five interests not represented in this interest model? If yes, how many?”. Additionally, there was one general question: “Which interest model, in your opinion, most accurately represents your interests?”.

For each generated interest model, users were provided with a list of the top k interests sorted by weight and were asked to assign a relevance value to each interest (1: not at all relevant, 2: low relevance, 3: relevant, and 4: high relevance). Later in the calculations, we considered ranks 1 and 2 to indicate non-relevant interests and ranks 3 and 4 to indicate relevant interests. With the first question, we were able to calculate how many relevant interests were found at the top k interests (Precision@K), how early in the ranked list of generated interests a relevant interest could be found (MRR), and the accuracy with which the interests were ranked and how early relevant interests appear (MAP). The K in Precision@K is the total number of extracted interests. In our case, it is different from one user to another because our approaches generate a different number of interests for each user, with a maximum of fifteen interests depending on the number of publications per user and the number of keyphrases per publication. With the second question, we were able to gain a subjective perspective on the completeness of each interest model by estimating how many interests were missing.

4.1.3. Analysis and Results

Table 2 shows the results for Precision@k, MRR, and MAP. It can be seen that Model 3 generated by (

S I F R a n k_{S q u e e z e B E R T}

+ DBpedia) has the highest precision@k value at 0.73, which indicates that it is the most accurate interest model. However, Model 2 (

S I F R a n k_{S q u e e z e B E R T}

+ Wikipedia) has the highest MRR value at 0.86, meaning that this interest model provides a better ranking for the highest-ranked relevant interests. Both of these interest models share a similar MAP value of 0.78. In contrast, Model 1 (SingleRank + Wikipedia), which served as the baseline, yields the lowest results across all three metrics. Overall, these results demonstrate that the utilization of word embedding techniques can enhance the quality of interest model generation.

Because the results based on Precision@k, MRR, and MAP were close to each other regarding Models 2 and 3, we relied on the subjective opinions of the users to decide which model to use in order to generate the recommendations. As can be seen in Figure 5, 59% of the users selected Model 3 (

S I F R a n k_{S q u e e z e B E R T}

+ DBpedia) as the best interest model. This suggests that they considered it the most complete model, covering most of their interests. Furthermore, Model 3 had the lowest percentage of missing interests at 40%, followed by Model 2 (

S I F R a n k_{S q u e e z e B E R T}

+ Wikipedia) at 44%, while Model 1 (SingleRank + Wikipedia) had the highest percentage of missing interests at 53%. In summary, the offline and user evaluation showed that combining

S I F R a n k_{S q u e e z e B E R T}

as a keyphrase extraction method with DBpedia as a knowledge base can improve the quality of the interest model generation task.

4.2. Recommendation Generation

4.2.1. Offline Evaluation

We conducted an offline experiment with the goal of identifying the best approaches for delivering accurate and relevant scientific publication recommendations. Initially, we tested various keyphrase extraction methods, then subsequently evaluated different embedding models.

Keyphrase extraction from publications. To determine the keyphrase extraction approach for publications, we conducted an experiment comparing the accuracy and performance of SingleRank and

S I F R a n k_{S q u e e z e B E R T}

when extracting keyphrases from publication titles and abstracts. Using various user interest models, we sent requests to the Semantic Scholar API to obtain lists of publications relevant to each interest model. Assuming that the publications should have high semantic similarity to the interest model used to find them, we computed semantic similarities between the interest model and the publication’s keyphrases extracted using both SingleRank and

S I F R a n k_{S q u e e z e B E R T}

. Figure 6 and Figure 7 show the distribution of semantic similarity scores calculated between an example interest model and the publications’ keyphrases. The x-axis represents the similarity scores and the y-axis represents the number of publications which have these similarity scores. For brevity, we selected only one example for presentation here. Overall, we observed no significant difference in accuracy, between the two keyphrase extraction methods as indicated by the distributions of semantic similarity scores. However, it is worth noting that SingleRank consistently outperformed

S I F R a n k_{S q u e e z e B E R T}

in terms of extraction speed. Consequently, we decided to use SingleRank to extract keywords from publications.

Embedding representation. Different models were selected for testing on the embedding step of the recommendation generation pipeline. We compared different pretrained transformer sentence embedding techniques. These included the USE sentence encoder, which has been shown to outperform the BERT [116], ELMo [53], and InferSent [121] models [24]. In addition, we included the SciBERT [122] model, as it was trained on publications. Furthermore, Hugging Face documentation includes a list of models for the sentence embedding task. Among these models, we selected all-mpnet-base-v2 (https://huggingface.co/sentence-transformers/all-mpnet-base-v2, accessed on 18 September 2022), which has the average highest performance in the Hugging Face documentation, all-distilroberta-large-v1 (https://huggingface.co/roberta-base, accessed on 18 September 2022), which has the highest performance in the sentence embedding task, all-MiniLM-L12-v2 (https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2, accessed on 18 September 2022), which is smaller than the other selected models in terms of size and the model is faster, and msmarco-distilbert-base-tas-b, which achieved the highest performance regarding the asymetric semantic search task. Asymmetric semantic search means that we have a short query (in our case, the user’s interest model) and that we want to find a longer paragraph answering the query (in our case, the publications).

4.2.2. Analysis and Results

To determine the optimal model, the embedding performance for uni-grams, bi-grams, and sentences was tested using three benchmarks. First, we used the SimLex999 dataset [123], which is a benchmark dataset for evaluating the performance of semantic models. It consists of 999 pairs of words with human-annotated similarity ratings. The ratings are based on the genuine similarity of the words, which is the degree to which the words are semantically related. Second, we used the BiRD dataset [124], which is another benchmark dataset for evaluating the performance of semantic composition models. This dataset consists of 3345 fine-grained relatedness ratings for 3345 bi-gram pairs. The ratings are based on the comparative annotation technique, which asks human annotators to compare the relatedness of two bi-grams. The last dataset was the the STS Benchmark [125], which is a benchmark for evaluating the performance of semantic textual similarity (STS) systems. The task is divided into two subtasks, multilingual STS and cross-lingual STS. We used Pearson correlation to compare the quality of machine similarity scores to the quality of human judgments for the six selected models. The Pearson correlation test evaluates the strength and the directional association between two continuous variables. We used the r-value, often known as the Pearson correlation coefficient, in our analysis to determine the direction and strength of the correlation. The greater the proximity to 1 (−1), the stronger the positive (negative) correlation. A value of 0 indicates that there is no correlation. Table 3 shows that the all-mpnet-base-v2 model achieved the best performance at the bi-gram and sentence levels, while msmarco-distilbert-base-tas-b achieved the highest performance at the word level and the USE model was the fastest. It is apparent that model performance improves as the length of the grams increases. This makes sense, as these embedding models are context-dependent.

Further, we investigated the performance of these models in our context to obtain the embeddings of the interest models and the publications. For the publication embedding, we compared the embedding performance at the document level, which means representing the titles and abstracts of publications as a whole, and at the keyphrase level, where we extracted the important keyphrases first.

Table 4 and Table 5 show the results of calculating the similarity between a user’s interest model and fifty related relevant publications from the Semantic Scholar API at the keyphrase level and document level, respectively. The similarity scores presented in the tables correspond to the range between the maximum and minimum scores achieved within the list of candidate publications. The results show that the keyphrase level achieves higher similarity scores compared to the document level. However, the keyphrase level is slower, as keyphrases need to be extracted first. It can be seen that SciBERT and msmarco-distilbert-base-tas-b were the best performing models in terms of similarity score. We believe that the good performance of SciBERT in this context is due to the fact that it was trained on publications. However, msmarco-distilbert-base-tas-b was faster. Based on these results, we decided to compute embeddings of the publications at the keyword level and use msmarco-distilbert-base-tas-b in the embedding step of the recommendation generation pipeline in order to obtain embeddings of both interest models and publications before computing their similarities.

4.2.3. Online Evaluation

We conducted an online user study to evaluate the accuracy and ranking of the recommended publications. These recommendations were generated based on the most accurate interest model (i.e.,

S I F R a n k_{S q u e e z e B E R T}

+ DBpedia), which we selected based on the results of the user study related to interest model generation (see Section 4.1.3). We invited the same 22 participants from the previous user study, of whom 16 responded. All participants provide their informed consent for study participation. Our goal was to compare the accuracy and ranking performance of our generated recommendation list with the list provided by the Semantic Scholar API, with the assumption that while the list from the Semantic Scholar API is relevant, the ranking could be improved. The questionnaire we used in this study was comprised of a question per recommendation: “Please rate the relevance of the following publications suggested based on your interest model”, and a general question: “According to you, which recommendation list best reflects your preferences?”. The average time to complete the questionnaire was 10 minutes. In the first question, users were asked to assign a relevance score to each of the top ten recommendations for each list. Users could rate the recommendations using one of four options (1: not at all relevant, 2: low relevance, 3: relevant, and 4: high relevance). Later in the calculations, we considered ranks 1 and 2 to be non-relevant recommendations and ranks 3 and 4 to be relevant. We calculated the statistical measures of Precision@k (how many relevant publications are in the top k extracted publications), MRR (the position of the highest-ranked relevant item), and MAP (the accuracy with which the top k publications are ranked and how early relevant results appear), where k is the total number of recommendations, which was ten in our case, as shown in Table 6.

The results show that recommendation list 1 (our approach) outscored the recommendation list provided by Semantic Scholar (recommendation list 2) in all three metrics, indicating that our approach was able to generate a more accurate and better-ranked recommendation list. In addition, 63% of the participants found that our recommendation list better reflected their interests.

5. Limitations

As a first analysis of the benefits of the application of word/sentence embedding techniques for user modeling and recommendation generation tasks in a content-based RS, this study is not without limitations. First, we performed this analysis in a single domain. It must be verified whether our findings transfer to domains beyond the recommendation of scientific publications. In addition, it must be assessed whether the results generalize to recommendations made by another publication RS as a baseline. Moreover, the proposed pipelines were evaluated with PhD students and professors from various backgrounds. While we achieved a diverse user group, a user study with a larger sample would probably have yielded more significant and reliable results.

6. Conclusions and Future Work

In this paper, we aimed to shed light on neglecting user modeling and capturing the semantics of user models and papers in content-based scientific publication recommender systems (RS). To address these research gaps, we have presented a transparent Recommendation and Interest Modeling Application (RIMA) that leverages word embeddings and sentence encoders to improve the quality of the user modeling and recommendation generation tasks. Moreover, we conducted extensive experiments on different datasets to evaluate our approach, as well as an online user study. The results of our study demonstrate that pretrained transformer word embeddings and sentence enocoders can provide a simple, yet powerful method to improve the accuracy and performance of the user modeling and recommendation generation processes in content-based scientific publication RS. While we are aware that our results are based on one particular RS and that the results cannot be generalized, we are confident that they represent valuable anchor points for the implementation of effective future content-based RS based on embedding techniques.

As future work related to this research, we plan to validate our findings through a quantitative and qualitative user study with a larger sample. Additionally, we aim to enhance the publication extraction process in terms of both time and accuracy, as our current approach can be time-consuming and occasionally extracts sections of the publication beyond the abstract. Further, we intend to explore and compare other approaches, e.g., graph-based and hybrid ones, for scientific publication recommendation.

Author Contributions

Conceptualization, M.G. and M.A.C.; Methodology, M.G. and M.A.C.; Validation, M.A.C.; Software, L.K. and S.J.; Writing—original draft preparation, M.G. and L.K.; Writing—review and editing, M.G. and M.A.C.; Visualization, M.G., L.K., S.J. and Q.U.A.; Supervision, M.A.C. and M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of the Department of Computer Science and Applied Cognitive Science of the Faculty of Engineering at the University of Duisburg-Essen.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Acknowledgments

We acknowledge support from the Open Access Publication Fund of the University of Duisburg-Essen.

Conflicts of Interest

The authors declare no conflict of interest.

References

Isinkaye, F.O.; Folajimi, Y.O.; Ojokoh, B.A. Recommendation systems: Principles, methods and evaluation. Egypt. Inform. J. 2015, 16, 261–273. [Google Scholar] [CrossRef]
Bulut, B.; Gündoğan, E.; Kaya, B.; Alhajj, R.; Kaya, M. User’s Research Interests Based Paper Recommendation System: A Deep Learning Approach. In Putting Social Media and Networking Data in Practice for Education, Planning, Prediction and Recommendation; Kaya, M., Birinci, Ş., Kawash, J., Alhajj, R., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 117–130. [Google Scholar] [CrossRef]
Beel, J.; Gipp, B.; Langer, S.; Breitinger, C. Paper recommender systems: A literature survey. Int. J. Digit. Libr. 2016, 17, 305–338. [Google Scholar] [CrossRef]
Bai, X.; Wang, M.; Lee, I.; Yang, Z.; Kong, X.; Xia, F. Scientific paper recommendation: A survey. IEEE Access 2019, 7, 9324–9339. [Google Scholar] [CrossRef]
Yue, W.; Wang, Z.; Zhang, J.; Liu, X. An overview of recommendation techniques and their applications in healthcare. IEEE/CAA J. Autom. Sin. 2021, 8, 701–717. [Google Scholar] [CrossRef]
Kreutz, C.K.; Schenkel, R. Scientific paper recommendation systems: A literature review of recent publications. Int. J. Digit. Libr. 2022, 23, 335–369. [Google Scholar] [CrossRef]
Chen, T.T.; Lee, M. Research paper recommender systems on big scholarly data. In Proceedings of the Knowledge Management and Acquisition for Intelligent Systems: 15th Pacific Rim Knowledge Acquisition Workshop, PKAW 2018, Nanjing, China, 28–29 August 2018; Springer International Publishing: New York, NY, USA, 2018; Volume 15, pp. 251–260. [Google Scholar]
Sakib, N.; Ahmad, R.B.; Haruna, K. A Collaborative Approach Toward Scientific Paper Recommendation Using Citation Context. IEEE Access 2020, 8, 51246–51255. [Google Scholar] [CrossRef]
Hassan, H.A.M. Personalized research paper recommendation using deep learning. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, Taichung City, Taiwan, 13–16 November 2017; pp. 327–330. [Google Scholar]
Nair, A.M.; Benny, O.; George, J. Content based scientific article recommendation system using deep learning technique. In Proceedings of the Inventive Systems and Control: ICISC, Divnomorskoe, Russia, 5–10 September 2021; Springer: Singapore, 2021; pp. 965–977. [Google Scholar]
Singh, R.; Gaonkar, G.; Bandre, V.; Sarang, N.; Deshpande, S. Scientific Paper Recommendation System. In Proceedings of the 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), Shanghai, China, 28–30 October 2023; pp. 1–4. [Google Scholar]
Tanner, W.; Akbas, E.; Hasan, M. Paper recommendation based on citation relation. In Proceedings of the 2019 IEEE International Conference on Big Data, Belgrade, Serbia, 25–27 November 2019; pp. 3053–3059. [Google Scholar]
Liu, H.; Kou, H.; Yan, C.; Qi, L. Keywords-driven and popularity-aware paper recommendation based on undirected paper citation graph. Complexity 2020, 2020, 817. [Google Scholar] [CrossRef]
Sinha, A.; Shen, Z.; Song, Y.; Ma, H.; Eide, D.; Hsu, B.J.; Wang, K. An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th International Conference on World Wide Web, Vancouver, BC, Canada, 26–31 May 2015; pp. 243–246. [Google Scholar]
Beel, J.; Aizawa, A.; Breitinger, C.; Gipp, B. Mr. DLib: Recommendations-as-a-service (RaaS) for academia. In Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Toronto, ON, Canada, 19–23 June 2017; pp. 1–2. [Google Scholar]
Mataoui, M.; Sebbak, F.; Sidhoum, A.H.; Harbi, T.E.; Senouci, M.R.; Belmessous, K. A hybrid recommendation system for researchgate academic social network. Soc. Netw. Anal. Min. 2023, 13, 53. [Google Scholar] [CrossRef]
Gündoğan, E.; Kaya, M. A novel hybrid paper recommendation system using deep learning. Scientometrics 2022, 127, 3837–3855. [Google Scholar] [CrossRef]
Chaudhuri, A.; Sarma, M.; Samanta, D. SHARE: Designing multiple criteria-based personalized research paper recommendation system. Inf. Sci. 2022, 617, 41–64. [Google Scholar] [CrossRef]
Mohamed, H.A.I.M. Deep Learning Models for Research Paper Recommender Systems. Ph.D. Thesis, Roma Tre University, Roma, Italy, 2020. [Google Scholar]
Zhi, L.; Zou, X. A Review on Personalized Academic Paper Recommendation. Comput. Inf. Sci. 2019, 12, 33–43. [Google Scholar]
Ricci, F.; Rokach, L.; Shapira, B. Introduction to recommender systems handbook. In Recommender Systems Handbook; Springer: Boston, MA, USA, 2010; pp. 1–35. [Google Scholar]
Zarrinkalam, F.; Faralli, S.; Piao, G.; Bagheri, E. Extracting, Mining and Predicting Users’ Interests from Social Media. Found. Trends Inf. Retr. 2020, 14, 445–617. [Google Scholar] [CrossRef]
Chaudhuri, A.; Samanta, D.; Sarma, M. Modeling user behaviour in research paper recommendation system. arXiv 2021, arXiv:2107.07831. [Google Scholar]
Hassan, H.A.M.; Sansonetti, G.; Gasparetti, F.; Micarelli, A.; Beel, J. Bert, elmo, use and infersent sentence encoders: The panacea for research-paper recommendation? In Proceedings of the RecSys (Late-Breaking Results), Kuching, Malaysia, 19–21 December 2019; pp. 6–10. [Google Scholar]
Guo, G.; Chen, B.; Zhang, X.; Liu, Z.; Dong, Z.; He, X. Leveraging title-abstract attentive semantics for paper recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Hong Kong, China, 8–11 December 2020; Volume 34, pp. 67–74. [Google Scholar]
Sun, Y.; Qiu, H.; Zheng, Y.; Wang, Z.; Zhang, C. SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model. IEEE Access 2020, 8, 10896–10906. [Google Scholar] [CrossRef]
Iandola, F.N.; Shaw, A.E.; Krishna, R.; Keutzer, K.W. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? arXiv 2020, arXiv:2006.11316. [Google Scholar]
Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S.; et al. DBpedia—A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semant. Web J. 2014, 6, 140134. [Google Scholar] [CrossRef]
Chatti, M.A.; Ji, F.; Guesmi, M.; Muslim, A.; Singh, R.K.; Joarder, S.A. Simt: A semantic interest modeling toolkit. In Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, Adjunct Proceedings, Singapore, 22–24 May 2021; pp. 75–78. [Google Scholar]
Piao, G.; Breslin, J.G. Inferring user interests in microblogging social networks: A survey. User Model. User-Adapt. Interact. 2018, 28, 277–329. [Google Scholar] [CrossRef]
Dhelim, S.; Aung, N.; Ning, H. Mining user interest based on personality-aware hybrid filtering in social networks. Knowl.-Based Syst. 2020, 206, 106227. [Google Scholar] [CrossRef]
Liu, K.; Chen, W.; Bu, J.; Chen, C.; Zhang, L. User modeling for recommendation in blogspace. In Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology-Workshops, Fukuoka, Japan, 10–15 October 2007; pp. 79–82. [Google Scholar]
Pratama, B.Y.; Sarno, R. Personality classification based on Twitter text using Naive Bayes, KNN and SVM. In Proceedings of the 2015 International Conference on Data and Software Engineering (ICoDSE), Yogyakarta, Indonesia, 25–26 November 2015; pp. 170–174. [Google Scholar] [CrossRef]
Stern, M.; Beck, J.; Woolf, B.P. Naive Bayes Classifiers for User Modeling; Center for Knowledge Communication, Computer Science Department, University of Massachusetts: Boston, MA, USA, 1999. [Google Scholar]
Michelson, M.; Macskassy, S.A. Discovering users’ topics of interest on twitter: A first look. In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, Brisbane, Australia, 14–18 July 2010; pp. 73–80. [Google Scholar]
Pu, X.; Chatti, M.A.; Schroeder, U.; Pratama, B.Y.; Sarno, R. Wiki-lda: A mixed-method approach for effective interest mining on twitter data. In Proceedings of the 8th International Conference on Computer Supported Education (CSEDU), Singapore, 12–14 May 2016; Scitepress: Rome, Italy, 2016; Volume 1, pp. 426–433. [Google Scholar]
Caragea, C.; Bulgarov, F.; Godea, A.; Gollapalli, S.D. Citation-enhanced keyphrase extraction from research papers: A supervised approach. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Kuala Lumpur, Malaysia, 27–29 November 2014; pp. 1435–1446. [Google Scholar]
Bennani-Smires, K.; Musat, C.; Hossmann, A.; Baeriswyl, M.; Jaggi, M. Simple unsupervised keyphrase extraction using sentence embeddings. arXiv 2018, arXiv:1801.04470. [Google Scholar]
Liu, Z.; Huang, W.; Zheng, Y.; Sun, M. Automatic Keyphrase Extraction via Topic Decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Wuhan, China, 30 October–1 November 2010; pp. 366–376. [Google Scholar]
Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
Rose, S.; Engel, D.; Cramer, N.; Cowley, W. Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2010; pp. 1–20. [Google Scholar]
Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.M.; Nunes, C.; Jatowt, A. Yake! collection-independent automatic keyword extractor. In Proceedings of the Advances in Information Retrieval: 40th European Conference on IR Research, ECIR 2018, Grenoble, France, 26–29 March 2018; Springer: Cham, Germany, 2018; Volume 40, pp. 806–810. [Google Scholar]
Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 21–23 May 2004; pp. 404–411. [Google Scholar]
Wan, X.; Xiao, J. CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction. In Proceedings of the COLING, Manchester, UK, 18–22 August 2008. [Google Scholar]
Wan, X.; Xiao, J. Single Document Keyphrase Extraction Using Neighborhood Knowledge. In Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI’08, Santa Clara, CA, USA, 13 July 2008; AAAI Press: Chicago, IL, USA, 2008; Volume 2, pp. 855–860. [Google Scholar]
Florescu, C.; Caragea, C. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 1–4 July 2017; Volume 1, pp. 1105–1115. [Google Scholar] [CrossRef]
Bougouin, A.; Boudin, F.; Daille, B. Topicrank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Cluj-Napoca, Romania, 28–30 October 2013; pp. 543–551. [Google Scholar]
Boudin, F. Unsupervised keyphrase extraction with multipartite graphs. arXiv 2018, arXiv:1803.08721. [Google Scholar]
Manrique, R.; Herazo, O.; Mariño, O. Exploring the use of linked open data for user research interest modeling. In Proceedings of the Advances in Computing: 12th Colombian Conference, CCC 2017, Cali, Colombia, 19–22 September 2017; Springer: Cham, Germany, 2017; Volume 12, pp. 3–16. [Google Scholar]
Liang, Y.; Zaki, M.J. Keyphrase Extraction Using Neighborhood Knowledge Based on Word Embeddings. arXiv 2021, arXiv:2111.07198v1. [Google Scholar] [CrossRef]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, PMLR, Xi’an, China, 5 November 2014; pp. 1188–1196. [Google Scholar]
Pagliardini, M.; Gupta, P.; Jaggi, M. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New York, NY, USA, 1–6 June 2018; Volume 1. [Google Scholar] [CrossRef]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 4–8 May 2018; Volume 1, pp. 2227–2237. [Google Scholar] [CrossRef]
Arora, S.; Liang, Y.; Ma, T. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Di Tommaso, G.; Faralli, S.; Stilo, G.; Velardi, P. Wiki-MID: A very large multi-domain interests dataset of Twitter users with mappings to Wikipedia. In Proceedings of the The Semantic Web–ISWC 2018: 17th International Semantic Web Conference, Monterey, CA, USA, 8–12 October 2018; Springer: Cham, Germany, 2018; Volume 17, pp. 36–52. [Google Scholar]
Narducci, F.; Musto, C.; Semeraro, G.; Lops, P.; De Gemmis, M. Leveraging encyclopedic knowledge for transparent and serendipitous user profiles. In Proceedings of the International Conference on User Modeling, Adaptation, and Personalization, Atlantic City, NJ, USA, 24–27 June 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 350–352. [Google Scholar]
Jadhav, A.S.; Purohit, H.; Kapanipathi, P.; Anantharam, P.; Ranabahu, A.H.; Nguyen, V.; Mendes, P.N.; Smith, A.G.; Cooney, M.; Sheth, A.P. Twitris 2.0: Semantically Empowered System for Understanding Perceptions from Social Data. 2010. Available online: https://corescholar.libraries.wright.edu/cgi/viewcontent.cgi?article=1253&context=knoesis (accessed on 3 April 2023).
Jean-Louis, L.; Gagnon, M.; Charton, E. A knowledge-base oriented approach for automatic keyword extraction. Comput. Sist. 2013, 17, 187–196. [Google Scholar]
Lu, C.; Lam, W.; Zhang, Y. Twitter user modeling and tweets recommendation based on wikipedia concept graph. In Proceedings of the Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, Shanghai, China, 4–6 December 2012. [Google Scholar]
Xu, T.; Oard, D.W. Wikipedia-based topic clustering for microblogs. Proc. Am. Soc. Inf. Sci. Technol. 2011, 48, 1–10. [Google Scholar] [CrossRef]
Besel, C.; Schlötterer, J.; Granitzer, M. On the quality of semantic interest profiles for onine social network consumers. ACM SIGAPP Appl. Comput. Rev. 2016, 16, 5–14. [Google Scholar] [CrossRef]
Piao, G.; Breslin, J.G. User modeling on Twitter with WordNet Synsets and DBpedia concepts for personalized recommendations. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 2057–2060. [Google Scholar]
Piao, G.; Breslin, J.G. Analyzing aggregated semantics-enabled user modeling on Google+ and Twitter for personalized link recommendations. In Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization, Phoenix, AZ, USA, 25–28 September 2016; pp. 105–109. [Google Scholar]
Piao, G.; Breslin, J.G. Exploring dynamics and semantics of user interests for user modeling on Twitter for link recommendations. In Proceedings of the 12th International Conference on Semantic Systems, Singapore, 20–22 August 2016; pp. 81–88. [Google Scholar]
Degemmis, M.; Lops, P.; Semeraro, G. A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Model. User-Adapt. Interact. 2007, 17, 217–255. [Google Scholar] [CrossRef]
Lops, P.; de Gemmis, M.; Semeraro, G.; Musto, C.; Narducci, F.; Bux, M. A semantic content-based recommender system integrating folksonomies for personalized access. In Web Personalization in Intelligent Environments; Springer: Berlin/Heidelberg, Germany, 2009; pp. 27–47. [Google Scholar]
Abel, F.; Herder, E.; Houben, G.J.; Henze, N.; Krause, D. Cross-system user modeling and personalization on the social web. User Model. User-Adapt. Interact. 2013, 23, 169–209. [Google Scholar] [CrossRef]
Yu, X.; Ma, H.; Hsu, B.J.; Han, J. On building entity recommender systems using user click log and freebase knowledge. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, Warsaw, Poland, 11–14 August 2014; pp. 263–272. [Google Scholar]
Manrique, R.; Mariño, O. How does the size of a document affect linked open data user modeling strategies? In Proceedings of the International Conference on Web Intelligence, Poznan, Poland, 21–23 June 2017; pp. 1246–1252. [Google Scholar]
Abel, F.; Hauff, C.; Houben, G.J.; Tao, K. Leveraging user modeling on the social web with linked data. In Proceedings of the Web Engineering: 12th International Conference, ICWE 2012, Berlin, Germany, 23–27 July 2012; Springer: Berlin/Heidelberg, Germany, 2012; Volume 12, pp. 378–385. [Google Scholar]
Shen, W.; Wang, J.; Luo, P.; Wang, M. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Fuzhou, China, 8–10 July 2013; pp. 68–76. [Google Scholar]
Renuka, S.; Raj Kiran, G.; Rohit, P. An unsupervised content-based article recommendation system using natural language processing. In Proceedings of the Data Intelligence and Cognitive Informatics ICDICI, Kyoto, Japan, 22–24 August 2020; Springer: Singapore, 2021; pp. 165–180. [Google Scholar]
Subathra, P.; Kumar, P. Recommending research article based on user queries using latent dirichlet allocation. In Proceedings of the 2nd ICSCSP Soft Computing and Signal Processing, Alcala de Henares, Spain, 3–5 October 2019; Springer: Singapore, 2020; pp. 163–175. [Google Scholar]
Tao, M.; Yang, X.; Gu, G.; Li, B. Paper recommend based on LDA and PageRank. In Proceedings of the Artificial Intelligence and Security: 6th International Conference, ICAIS 2020, Hohhot, China, 17–20 July 2020; Springer: Singapore, 2020; Volume 6, pp. 571–584. [Google Scholar]
Collins, A.; Beel, J. Document embeddings vs. keyphrases vs. terms for recommender systems: A large-scale online evaluation. In Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Chapel Hill, NC, USA, 11–15 June 2019; pp. 130–133. [Google Scholar]
Zhao, X.; Kang, H.; Feng, T.; Meng, C.; Nie, Z. A hybrid model based on LFM and BiGRU toward research paper recommendation. IEEE Access 2020, 8, 188628–188640. [Google Scholar] [CrossRef]
Ali, Z.; Qi, G.; Muhammad, K.; Ali, B.; Abro, W.A. Paper recommendation based on heterogeneous network embedding. Knowl.-Based Syst. 2020, 210, 106438. [Google Scholar] [CrossRef]
Bereczki, M. Graph neural networks for article recommendation based on implicit user feedback and content. arXiv 2021. [Google Scholar] [CrossRef]
Rios, F.; Rizzo, P.; Puddu, F.; Romeo, F.; Lentini, A.; Asaro, G.; Rescalli, F.; Bolchini, C.; Cremonesi, P. Recommending Relevant Papers to Conference Participants: A Deep Learning Driven Content-based Approach. In Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, Adjunct Proceedings, Omaha, NE, USA, 13–16 October 2022; pp. 52–57. [Google Scholar]
Ferrara, F.; Pudota, N.; Tasso, C. A keyphrase-based paper recommender system. In Proceedings of the Digital Libraries and Archives: 7th Italian Research Conference, IRCDL 2011, Pisa, Italy, 20–21 January 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 14–25. [Google Scholar]
Chaudhuri, A.; Sinhababu, N.; Sarma, M.; Samanta, D. Hidden features identification for designing an efficient research article recommendation system. Int. J. Digit. Libr. 2021, 22, 233–249. [Google Scholar] [CrossRef]
Hong, K.; Jeon, H.; Jeon, C. UserProfile-based personalized research paper recommendation system. In Proceedings of the 2012 8th International Conference on Computing and Networking Technology (INC, ICCIS and ICMIC), Shanghai, China, 21–23 September 2012; pp. 134–138. [Google Scholar]
Hong, K.; Jeon, H.; Jeon, C. Personalized research paper recommendation system using keyword extraction based on UserProfile. J. Converg. Inf. Technol. 2013, 8, 106. [Google Scholar]
Gautam, J.; Kumar, E. An improved framework for tag-based academic information sharing and recommendation system. In Proceedings of the World Congress on Engineering, Wuhan, China, 19–20 December 2012; Volume 2, pp. 1–6. [Google Scholar]
Beel, J.; Langer, S.; Genzmehr, M.; Nürnberger, A. Introducing Docear’s research paper recommender system. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, Cologne, Germany, 20–24 June 2013; pp. 459–460. [Google Scholar]
Beel, J.; Langer, S.; Gipp, B.; Nürnberger, A. The Architecture and Datasets of Docear’s Research Paper Recommender System. D-Lib Mag. 2014, 20, 1045. [Google Scholar] [CrossRef]
Jomsri, P.; Sanguansintukul, S.; Choochaiwattana, W. A framework for tag-based research paper recommender system: An IR approach. In Proceedings of the 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops, Perth, WA, Australia, 20–23 April 2010; pp. 103–108. [Google Scholar]
Al Alshaikh, M.; Uchyigit, G.; Evans, R. A research paper recommender system using a Dynamic Normalized Tree of Concepts model for user modelling. In Proceedings of the 2017 11th International Conference on Research Challenges in Information Science (RCIS), Auckland, New Zealand, 24–27 October 2017; pp. 200–210. [Google Scholar]
Lee, J.; Lee, K.; Kim, J.G. Personalized academic research paper recommendation system. arXiv 2013, arXiv:1304.5457. [Google Scholar]
Sugiyama, K.; Kan, M.Y. Scholarly paper recommendation via user’s recent research interests. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, Houston, TX, USA, 23–26 October 2010; pp. 29–38. [Google Scholar]
Nishioka, C.; Hauke, J.; Scherp, A. Influence of tweets and diversification on serendipitous research paper recommender systems. PeerJ Comput. Sci. 2020, 6, e273. [Google Scholar] [CrossRef] [PubMed]
Nishioka, C.; Hauke, J.; Scherp, A. Research paper recommender system with serendipity using tweets vs. diversification. In Proceedings of the Digital Libraries at the Crossroads of Digital Information for the Future: 21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019, Kuala Lumpur, Malaysia, 4–7 November 2019; Springer: Cham, Germany, 2019; Volume 21, pp. 63–70. [Google Scholar]
Nishioka, C.; Hauk, J.; Scherp, A. Towards serendipitous research paper recommender using tweets and diversification. In Proceedings of the Digital Libraries for Open Knowledge: 23rd International Conference on Theory and Practice of Digital Libraries, TPDL 2019, Oslo, Norway, 9–12 September 2019; Springer: Cham, Germany, 2019; Volume 23, pp. 339–343. [Google Scholar]
Bulut, B.; Kaya, B.; Alhajj, R.; Kaya, M. A paper recommendation system based on user’s research interests. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, USA, 18–21 August 2018; pp. 911–915. [Google Scholar]
Bulut, B.; Kaya, B.; Kaya, M. A Paper Recommendation System Based on User Interest and Citations. In Proceedings of the 2019 1st International Informatics and Software Engineering Conference (UBMYK), Ankara, Turkey, 6–7 November 2019; pp. 1–5. [Google Scholar] [CrossRef]
Chen, J.; Ban, Z. Academic paper recommendation based on clustering and pattern matching. In Proceedings of the Artificial Intelligence: Second CCF International Conference, ICAI 2019, Xuzhou, China, 22–23 August 2019; Springer: Singapore, 2019; Volume 2, pp. 171–182. [Google Scholar]
Amami, M.; Pasi, G.; Stella, F.; Faiz, R. An LDA-Based Approach to Scientific Paper Recommendation. In Proceedings of the International Conference on Applications of Natural Language to Data Bases, Boston, MA, USA, 11–14 December 2016. [Google Scholar]
Lin, S.J.; Lee, G.; Peng, S.L. Academic article recommendation by considering the research field trajectory. In Proceedings of the International Conference on Innovative Computing and Cutting-Edge Technologies, Uttarakhand, India, 14–16 March 2020; Springer: Cham, Germany, 2020; pp. 447–454. [Google Scholar]
Philip, S.; Shola, P.; Ovye, A. Application of content-based approach in research paper recommendation system for a digital library. Int. J. Adv. Comput. Sci. Appl. 2014, 5, 051006. [Google Scholar] [CrossRef]
Nascimento, C.; Laender, A.H.; da Silva, A.S.; Gonçalves, M.A. A source independent framework for research paper recommendation. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, Toronto, ON, Canada, 19 May 2011; pp. 297–306. [Google Scholar]
Hanyurwimfura, D.; Bo, L.; Havyarimana, V.; Njagi, D.; Kagorora, F. An effective academic research papers recommendation for non-profiled users. Int. J. Hybrid Inf. Technol. 2015, 8, 255–272. [Google Scholar] [CrossRef]
Guesmi, M.; Chatti, M.A.; Sun, Y.; Zumor, S.; Ji, F.; Muslim, A.; Vorgerd, L.; Joarder, S.A. Open, Scrutable and Explainable Interest Models for Transparent Recommendation. In Proceedings of the IUI Workshops, Rome, Italy, 15–18 June 2021. [Google Scholar]
Guesmi, M.; Chatti, M.A.; Vorgerd, L.; Joarder, S.; Zumor, S.; Sun, Y.; Ji, F.; Muslim, A. On-demand personalized explanation for transparent recommendation. In Proceedings of the Adjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, Uxbridge, UK, 17–18 December 2021; pp. 246–252.
Guesmi, M.; Chatti, M.A.; Vorgerd, L.; Joarder, S.A.; Ain, Q.U.; Ngo, T.; Zumor, S.; Sun, Y.; Ji, F.; Muslim, A. Input or Output: Effects of Explanation Focus on the Perception of Explainable Recommendation with Varying Level of Details. In Proceedings of the IntRS@ RecSys, Amsterdam, The Netherlands, 27 September–1 October 2021; pp. 55–72. [Google Scholar]
Guesmi, M.; Chatti, M.A.; Ghorbani-Bavani, J.; Joarder, S.; Ain, Q.U.; Alatrash, R. What if Interactive Explanation in a Scientific Literature Recommender System. arXiv 2022. [Google Scholar] [CrossRef]
Chatti, M.A.; Guesmi, M.; Vorgerd, L.; Ngo, T.; Joarder, S.; Ain, Q.U.; Muslim, A. Is More Always Better? The Effects of Personal Characteristics and Level of Detail on the Perception of Explanations in a Recommender System. In Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, Nanjing, China, 27–29 June 2022; pp. 254–264. [Google Scholar]
Guesmi, M.; Chatti, M.A.; Vorgerd, L.; Ngo, T.; Joarder, S.; Ain, Q.U.; Muslim, A. Explaining User Models with Different Levels of Detail for Transparent Recommendation: A User Study. In Proceedings of the Adjunct Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, Turin, Italy, 4–8 July 2022; pp. 175–183.
Guesmi, M.; Chatti, M.A.; Tayyar, A.; Ain, Q.U.; Joarder, S. Interactive visualizations of transparent user models for self-actualization: A human-centered design approach. Multimodal Technol. Interact. 2022, 6, 42. [Google Scholar] [CrossRef]
Guesmi, M.; Chatti, M.A.; Joarder, S.; Ain, Q.U.; Siepmann, C.; Ghanbarzadeh, H.; Alatrash, R. Justification vs. Transparency: Why and How Visual Explanations in a Scientific Literature Recommender System. Information 2023, 14, 401. [Google Scholar] [CrossRef]
Guesmi, M.; Siepmann, C.; Chatti, M.A.; Joarder, S.; Ain, Q.U.; Alatrash, R. Validation of the EDUSS Framework for Self-Actualization Based on Transparent User Models: A Qualitative Study. In Proceedings of the Adjunct Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization, London, UK, 3–6 December 2023; pp. 229–238. [Google Scholar]
Bougouin, A.; Boudin, F.; Daille, B. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction. In Proceedings of the IJCNLP, Nagoya, Japan, 14–18 October 2013. [Google Scholar]
Jardine, J.G.; Teufel, S. Topical PageRank: A Model of Scientific Expertise for Bibliographic Search. In Proceedings of the EACL, Gothenburg, Sweden, 26–30 April 2014. [Google Scholar]
Boudin, F. Unsupervised Keyphrase Extraction with Multipartite Graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 3–10 June 2018; pp. 667–672. [Google Scholar] [CrossRef]
Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.M.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
Hulth, A. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Sapporo, Japan, 11–12 July 2003; pp. 216–223. [Google Scholar]
Yu, P.; Wang, X. BERT-Based Named Entity Recognition in Chinese Twenty-Four Histories. In Proceedings of the International Conference on Web Information Systems and Applications, Cham, Switzerland, 16 August 2020; pp. 289–301. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 866. [Google Scholar]
Mendes, P.; Jakob, M.; García-Silva, A.; Bizer, C. DBpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, New York, NY, USA, 2–9 September 2011; pp. 1–8. [Google Scholar] [CrossRef]
Cheng, Y.; Qiu, G.; Bu, J.; Liu, K.; Han, Y.; Wang, C.; Chen, C. Model bloggers’ interests based on forgetting mechanism. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China, 21–25 April 2008; pp. 1129–1130. [Google Scholar]
Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; Bordes, A. Supervised learning of universal sentence representations from natural language inference data. arXiv 2017, arXiv:1705.02364. [Google Scholar]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
Hill, F.; Reichart, R.; Korhonen, A. SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Comput. Linguist. 2015, 41, 665–695. [Google Scholar] [CrossRef]
Asaadi, S.; Mohammad, S.; Kiritchenko, S. Big BiRD: A large, fine-grained, bigram relatedness dataset for examining semantic composition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MI, USA, 4–8 June 2019; Volume 1, pp. 505–516. [Google Scholar]
Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv 2017, arXiv:1708.00055. [Google Scholar]

Figure 1. Interest model generation pipeline.

Figure 2. Interest model generation pipeline.

Figure 3. Recommendation generation pipeline.

Figure 4. Recommendation generation pipeline.

Figure 5. Interest modeling evaluation results using subjective opinions of users.

Figure 6. Semantic similarity score distributions with SingleRank.

Figure 7. Semantic similarity score distributions with

S I F R a n k_{S q u e e z e B E R T}

.

Figure 7. Semantic similarity score distributions with

S I F R a n k_{S q u e e z e B E R T}

.

Table 1. Keyword extraction algorithm performance measures on the Inspec dataset.

Algorithm	K = 5			K = 10			K = 15
Algorithm	P	R	F	P	R	F	P	R	F
TextRank	18.15	7.10	9.79	16.15	9.58	11.51	14.88	10.15	11.48
SingleRank	30.96	13.60	17.99	26.95	22.04	23.02	23.57	27.01	24.03
TopicRank	26.97	11.52	15.38	21.86	17.31	18.41	19.53	21.24	19.51
TopicalPageRank	30.36	13.37	17.67	26.31	21.44	22.47	23.34	26.43	23.69
PositionRank	32.12	13.82	18.38	25.45	20.79	21.77	22.79	25.80	23.15
MultipartitieRank	28.60	12.11	16.20	21.99	17.83	18.70	19.76	22.75	20.20
Rake	20.02	9.13	11.87	21.54	18.27	18.75	18.42	21.57	18.97
YAKE!	24.80	11.14	14.59	20.32	17.70	17.88	17.86	22.78	18.96
Percentage Improvement (%)	-	-	-	2.4	2.7	2.4	0.9	2.1	1.4

P—precision, R—recall, F—F-measure, K—number of keywords.

Table 2. Interest modeling evaluation results using statistical metrics.

	Precision@k	MRR	MAP
Interest model 1 (SingleRank + Wikipedia)	0.62	0.64	0.65
Interest model 2 ( $S I F R a n k_{S q u e e z e B E R T}$ + Wikipedia)	0.69	0.86	0.78
Interest model 3 ( $S I F R a n k_{S q u e e z e B E R T}$ + DBpedia)	0.73	0.81	0.78

Table 3. Comparison of the selected embedding techniques using Pearson correlation.

Models	SimLex999		BiRD		STS
Models	Pearson Correlation	Time	Pearson Correlation	Time	Pearson Correlation	Time
USE	0.51	396 ms	0.61	2.27 s	0.78	1.12 s
SciBERT	0.07	33.7 s	0.45	2 min 10 s	0.44	2 min 59 s
all-mpnet-base-v2	0.54	34.1 s	0.67	1 min 52 s	0.84	2 min 52 s
all-distilroberta-v1	0.31	26.1 s	0.63	1 min 9 s	0.83	1 min 23 s
all-MiniLM-L12-v2	0.51	14.9 s	0.64	36.8 s	0.83	51.3 s
msmarco-distilbert-base-tas-b	0.55	25.1 s	0.59	1 min 16 s	0.79	1 min 23 s

Table 4. Comparison of embedding techniques at the keyphrase level.

Model	Time	Similarity Scores
USE	24 s	62–53%
SciBERT	1 m 8 s	95–80%
all-mpnet-base-v2	1 m 11 s	76–40%
all-distilroberta-v1	56 s	80–41%
all-MiniLM-L12-v2	45 s	81–41%
msmarco-distilbert-base-tas-b	47 s	95–81%

Table 5. Comparison of embedding techniques at the document level.

Model	Time	Similarity Scores
USE	3 s	59–53%
SciBERT	24 s	72–53%
all-mpnet-base-v2	22 s	70–41%
all-distilroberta-v1	13 s	66–40%
all-MiniLM-L12-v2	6 s	71–41%
msmarco-distilbert-base-tas-b	12 s	89–70%

Table 6. Evaluation result for recommendation generation.

	Precision@k	MRR	MAP	Voting on the Better List
Recommendation list 1 (Our approach)	0.42	0.72	0.60	63%
Recommendation list 2 (Semantic Scholar)	0.39	0.63	0.58	38%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guesmi, M.; Chatti, M.A.; Kadhim, L.; Joarder, S.; Ain, Q.U. Semantic Interest Modeling and Content-Based Scientific Publication Recommendation Using Word Embeddings and Sentence Encoders. Multimodal Technol. Interact. 2023, 7, 91. https://doi.org/10.3390/mti7090091

AMA Style

Guesmi M, Chatti MA, Kadhim L, Joarder S, Ain QU. Semantic Interest Modeling and Content-Based Scientific Publication Recommendation Using Word Embeddings and Sentence Encoders. Multimodal Technologies and Interaction. 2023; 7(9):91. https://doi.org/10.3390/mti7090091

Chicago/Turabian Style

Guesmi, Mouadh, Mohamed Amine Chatti, Lamees Kadhim, Shoeb Joarder, and Qurat Ul Ain. 2023. "Semantic Interest Modeling and Content-Based Scientific Publication Recommendation Using Word Embeddings and Sentence Encoders" Multimodal Technologies and Interaction 7, no. 9: 91. https://doi.org/10.3390/mti7090091

APA Style

Guesmi, M., Chatti, M. A., Kadhim, L., Joarder, S., & Ain, Q. U. (2023). Semantic Interest Modeling and Content-Based Scientific Publication Recommendation Using Word Embeddings and Sentence Encoders. Multimodal Technologies and Interaction, 7(9), 91. https://doi.org/10.3390/mti7090091

Article Menu

Semantic Interest Modeling and Content-Based Scientific Publication Recommendation Using Word Embeddings and Sentence Encoders

Abstract

1. Introduction

2. Related Work

2.1. Interest Model Generation

2.2. Content-Based Scientific Publication Recommendation

2.2.1. Item Representation

2.2.2. User Modeling

2.2.3. Recommendation Generation

3. RIMA Application

3.1. Interest Model Generation

3.1.1. Conceptual Pipeline

3.1.2. Technical Pipeline

3.2. Recommendation Generation

3.2.1. Conceptual Pipeline

3.2.2. Technical Pipeline

4. Evaluation

4.1. Interest Model Generation

4.1.1. Participants

4.1.2. Procedure

4.1.3. Analysis and Results

4.2. Recommendation Generation

4.2.1. Offline Evaluation

4.2.2. Analysis and Results

4.2.3. Online Evaluation

5. Limitations

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI