RecSys Pertaining to Research Information with Collaborative Filtering Methods: Characteristics and Challenges

: Recommendation (recommender) systems have played an increasingly important role in both research and industry in recent years. In the area of publication data, for example, there is a strong need to help people ﬁnd the right research information through recommendations and scientiﬁc reports. The difference between search engines and recommendation systems is that search engines help us ﬁnd something we already know, while recommendation systems are more likely to help us ﬁnd new items. An essential function of recommendation systems is to support users in their decision making. Recommendation systems are information systems that can be categorized into decision support systems, as long as they are used for decision making and are intended to support people instead of replacing them. This paper deals with recommendation systems for research information, especially publication data. We discuss and analyze the challenges and peculiarities of implementing recommender systems for the scientiﬁc exchange of research information. For this purpose, data mining techniques are examined and a concept for a recommendation system for research information is developed. Our aim is to investigate to what extent a recommendation system based on a collaborative ﬁltering approach with cookies is possible. The data source is publication data extracted from cookies in the Web of Science database. The results of our investigation show that a collaborative ﬁltering process is suitable for publication data and that recommendations can be generated with user information. In addition, we have seen that collaborative ﬁltering is an important element that can solve a practical problem by sifting through large amounts of dynamically generated information to provide users with personalized content and services.


Introduction
In the course of digitization, humanity is currently in an information age. The Internet has become an important part of society and offers the possibility of accessing information at almost any time. Search engines also help users to filter or narrow down information from the Internet that actually or roughly corresponds to their search. Institutions and libraries are faced with the challenge of controlling and delivering their large range of publications as quickly as possible, which satisfies the needs of the user or arouses their interest in or need for it. Recommender systems (RecSys) help institutions to provide personalized and individually adapted publications on the basis of the analysis of large amounts of data, which, for example, a researcher may leave on a website. They are thus becoming a more and more important part of marketing strategies. Since the mid-1990s, recommendation systems have developed into an independent research area. In recent years, the interest in recommendation systems has increased further and research has been carried out in the area of publication data in order to recommend relevant articles for the needs of the users as references [1,2].

•
To what extent is making a recommendation possible if it is based on a collaborative filtering approach directed to research information-in particular, publication data?
The feature space for the k-means algorithm is based on the fact that the implemented method of collaborative filtering is carried out as clustering on N-dimensional vectors (where N is the number of publications).
Research information is metadata that describes in a structured way which research activities take place at a research institution. This can be information about ongoing and completed research projects, publications, or other research activities. The data source used is data extracted from cookies from the publication data of the Web of Science database, which we have access to at the German Center for Higher Education Research and Science Studies (DZHW). The publication data consist of 60 columns and 1 million lines. This paper is divided into seven sections. Section 1 is the introduction. Section 2 provides an overview of the recommendation systems and the state of the art in this area. Then, the types of recommendation systems ang their advantages and disadvantages are presented. Section 3 deals with the challenges and special features that result from the application of the collaborative filtering approach. Section 4 describes the materials and methods used in this paper. In Section 5, the procedure, data processing techniques, and technical environment are detailed and a model is created to serve as an aid to research institutions. Finally, the implementation of this model using the KNIME software (https: //www.knime.com/ accessed on 12 December 2021) is described and the results are presented. In Section 6, the important results are summarized and an outlook on future work is given.

State of the Art: RecSys and Collaborative Filtering Processes
Our aim is to analyze which recommendation systems exist in the literature and how this potential has been utilized by various authors working on this topic. Through our literature review, we found that many recommendation algorithms have been proposed and tested by authors in different contexts, but not in the context of research information. The use of recommendation systems in the field of research information processing is of great interest to the research information community as well as to research institutions and researchers. There is no uniform definition of recommendation systems used for research purposes. They take on different meaning depending on the requirements and applications involved. The origins of recommendation systems lie in the research areas of and researchers. There is no uniform definition of recommendation systems used for re search purposes. They take on different meaning depending on the requirements and ap plications involved. The origins of recommendation systems lie in the research areas o information retrieval, cognitive science, and machine learning [10]. Most of the studies on recommendation systems, such as [11][12][13], address the underlying procedures used. Since research in this area began, various recommendation algorithms have been proposed and tested. Based on [14,15], the following six categories of recommendation systems can be distinguished. Their functionality is shown in Figure 1. 1. Non-personalized filtering: The recommendations are the same for all customers This type of filtering materializes, for example, in systems, where the most popula products are recommended.
• Advantages: Non-personalized filtering is easy to implement, since the recom mendations consist of popular or highly rated articles and the data required fo these recommendations can therefore be easily captured [17]. • Disadvantages: As filtering is not personalized, every user receives the same rec ommendation; therefore, such recommendations may not apply to all users [17] 2. Collaborative filtering: Recommendations are based on historical reviews of users o data. Traditionally, either items rated highly by similar users or articles that are sim ilar to already highly rated items are recommended.
• Advantages: The greatest advantage of collaborative filtering systems is that they do not need any reference to the content of the article. This means that they are completely independent of any kind of information involved and the item de scriptions. All that is required to recommend the item is a name and the associ ated rating. • Disadvantages: Collaborative filtering is basically based on statistical method and requires a certain degree of consistency in user ratings. As a result, a very high density of ratings is required in order for this to function properly.
3. Content-based filtering: Recommendations are based on the content or properties o articles. Articles that are similar in content to preferred articles are recommended.
• Advantages: Content-based filtering uses article-to-article correlation [18]. The user is offered articles that would be suitable according to their user profile. Thi knowledge is derived from the profiles of the individual user. With this method

1.
Non-personalized filtering: The recommendations are the same for all customers. This type of filtering materializes, for example, in systems, where the most popular products are recommended.
• Advantages: Non-personalized filtering is easy to implement, since the recommendations consist of popular or highly rated articles and the data required for these recommendations can therefore be easily captured [17]. • Disadvantages: As filtering is not personalized, every user receives the same recommendation; therefore, such recommendations may not apply to all users [17].

2.
Collaborative filtering: Recommendations are based on historical reviews of users or data. Traditionally, either items rated highly by similar users or articles that are similar to already highly rated items are recommended.
• Advantages: The greatest advantage of collaborative filtering systems is that they do not need any reference to the content of the article. This means that they are completely independent of any kind of information involved and the item descriptions. All that is required to recommend the item is a name and the associated rating. • Disadvantages: Collaborative filtering is basically based on statistical methods and requires a certain degree of consistency in user ratings. As a result, a very high density of ratings is required in order for this to function properly.

3.
Content-based filtering: Recommendations are based on the content or properties of articles. Articles that are similar in content to preferred articles are recommended.
• Advantages: Content-based filtering uses article-to-article correlation [18]. The user is offered articles that would be suitable according to their user profile. This knowledge is derived from the profiles of the individual user. With this method, the cold start problem is minimized, and since there is no new item problem only the new user problem remains. There are no privacy problems because one's own ratings and preferences are not visible to other users. Additionally, the detailed preferences of users can be taken into account. • Disadvantages: Content-based filtering and its systems are limited by the contentdescriptive characteristics of the items needing to be evaluated. For example, a content-based recommendation system for publications can only be based on article descriptions, such as the author's name, publication title, journal, volume, issue, year, etc., because the publication itself cannot be interpreted by the system. This means that only fully described publications (namely, all publications that are included in the calculation) can lead to a successful recommendation. One also needs to have enough descriptors to avoid meta-evaluations such as those seen in collaborative filtering. There is also the problem that not all aspects of an article can be formally described.

4.
Knowledge-based filtering: Recommendations are made based on specialist knowledge of how certain properties of an item influence the satisfaction of user needs. For this purpose, similarity functions or knowledge databases with explicit rules are used.
• Advantages: With this method, users have the advantage of finding articles based on the restrictions with which they are familiar or articles that are similar and those that meet the criteria set [18]. • Disadvantages: The system must have access to a knowledge database that the information can be easily identified or derived from. In addition, there is the challenge of making a good recommendation to investigate which article properties are decisive for the user [18].

5.
Demographic filtering: Recommendations are based on sociodemographic data. This method only uses information about the user, such as their gender, age, and employment status. The system thus generates recommendations based on demographic similarities.
• Advantages: In contrast to other filtering methods, this method does not require historical data [19].

•
Disadvantages: Collecting all of the demographic information can conflict with the requirements of data protection, as it is sensitive and personal data. In addition, this is one of the traditional techniques used to create the profiles of users and no advanced data mining techniques are used [10].

6.
Hybrid filtering: A combination of the filters already presented are used to generate recommendations. Recommendations consist of combinations of several filter methods, such as collaborative filtering and content-based filtering.
• Advantages: The aim of hybrid filtering is to bring different models together in order to increase performance and minimize the system-specific and inherent problems of recommendation systems. • Disadvantages: With hybrid filtering, the quality of the description of the items and their quality depends on the number of ratings.
Collaborative recommendation algorithms based on historical evaluation data are the most widely used techniques [20,21]. The literature research in the Web of Science showed that the topic of recommendation systems and the methodology of collaborative filtering in the field of research information processing receives too little attention nowadays and is therefore the focus of the research in this work. The high-quality theoretical elaboration of the related issues opens up a wide range of possibilities for applied research and experiments with research information collections. Collaborative filters can make use of the collective strength of publication data to generate recommendations. The idea behind such procedures is that unknown evaluations can be predicted because known evaluations from different users and/or publications often correlate strongly with one another. Let there be two scientific users A and B, who have very similar research areas. If the submitted ratings are similar, this will be recognized by the recommendation algorithm. In such cases, it is very likely that the ratings given by only one of the two users are also similar. This presumption of similarity is used to draw conclusions about unknown evaluations. hind such procedures is that unknown evaluations can be predicted because known evaluations from different users and/or publications often correlate strongly with one another. Let there be two scientific users A and B, who have very similar research areas. If the submitted ratings are similar, this will be recognized by the recommendation algorithm. In such cases, it is very likely that the ratings given by only one of the two users are also similar. This presumption of similarity is used to draw conclusions about unknown evaluations. Figure 2 shows the schematic flow of a collaborative recommendation process. On the basis of an m × n user-publication matrix, a collaborative filter algorithm generates either recommendations or forecasts, which are presented to the users in the form of a top N recommendation list. Two types of procedures are used in collaborative recommendation systems [22,23]: • Memory-based, where a user rating of an item is predicted on the basis of all previous ratings of the other users.

•
Model-based with the help of stochastic procedures and a training data set, attempting to find patterns in the evaluations of the various users and to create a model based on them. The learned model is then used to propose new articles. In the case of model-based methods, methods from the field of machine learning and data mining are used as a forecasting model. Examples of such methods are decision trees, rulebased models, Bayesian methods, and latent factor models.
The collaborative filtering concept can be implemented with two different methods. In the case of the item-based neighborhood method, the similarity of the articles and evaluation patterns of the user can be used to create recommendations. If two different items are rated good or bad by the same users, the items are considered the same, assuming that the users have the identical preferences for the same items. This method considers the set of items evaluated by a user, calculates from this the usefulness fu (u, i) of a target item, and decides whether it should be suggested to the user or not. Figure 3 shows a user × item matrix in which the user's u1 and u2 rated two items ij and ik. Two types of procedures are used in collaborative recommendation systems [22,23]: • Memory-based, where a user rating of an item is predicted on the basis of all previous ratings of the other users.

•
Model-based with the help of stochastic procedures and a training data set, attempting to find patterns in the evaluations of the various users and to create a model based on them. The learned model is then used to propose new articles. In the case of modelbased methods, methods from the field of machine learning and data mining are used as a forecasting model. Examples of such methods are decision trees, rule-based models, Bayesian methods, and latent factor models.
The collaborative filtering concept can be implemented with two different methods. In the case of the item-based neighborhood method, the similarity of the articles and evaluation patterns of the user can be used to create recommendations. If two different items are rated good or bad by the same users, the items are considered the same, assuming that the users have the identical preferences for the same items. This method considers the set of items evaluated by a user, calculates from this the usefulness f u (u, i) of a target item, and decides whether i t should be suggested to the user or not. Figure 3 shows a user × item matrix in which the user's u 1 and u 2 rated two items i j and i k .
In the user-based neighborhood method, similarities are defined on the basis of similar users. The similarities or neighborhoods between the users are determined by the evaluation behavior. Based on n users and m recommendation items, the matrix R = (r i j) with i = 1 . . . n and j = 1 . . . m is generated. Here, the value r i j represents the evaluation of item j by user i. The rating can be explicitly good/bad or implicitly read/not read. The aim is thus to recommend a recommendation item I y for a user U x by the users U = {U 1 , . . . , U n }, using the U x who are most similar and have given good ratings to items that the user U x has not yet rated. The Pearson correlation is often used for this method in order to calculate the similarities between two users U 1 and U x . If the k nearest neighbors have been found, the rating is calculated from the sum of the weighted average of the rating across the k nearest neighbors. Figure 4 shows the (user × item) matrix that is used to calculate the similarity between users based on jointly rated items.  In the user-based neighborhood method, similarities are defined on the basis of similar users. The similarities or neighborhoods between the users are determined by the evaluation behavior. Based on n users and m recommendation items, the matrix R = (rij) with i = 1...n and j = 1...m is generated. Here, the value rij represents the evaluation of item j by user i. The rating can be explicitly good/bad or implicitly read/not read. The aim is thus to recommend a recommendation item Iy for a user Ux by the users U = {U1,..., Un}, using the Ux who are most similar and have given good ratings to items that the user Ux has not yet rated. The Pearson correlation is often used for this method in order to calculate the similarities between two users U1 and Ux. If the k nearest neighbors have been found, the rating is calculated from the sum of the weighted average of the rating across the k nearest neighbors. Figure 4 shows the (user × item) matrix that is used to calculate the similarity between users based on jointly rated items.

Challenges When Using RecSys
Recommendation systems often have to operate in environments that present a number of challenges. Nevertheless, in the interests of the scientific institutions, they should quickly generate precise recommendations. The extent to which collaborative recommendation processes are able to do this depends on how they deal with the following challenges, which are also part of their characteristics.  In the user-based neighborhood method, similarities are defined on the basis of similar users. The similarities or neighborhoods between the users are determined by the evaluation behavior. Based on n users and m recommendation items, the matrix R = (rij) with i = 1...n and j = 1...m is generated. Here, the value rij represents the evaluation of item j by user i. The rating can be explicitly good/bad or implicitly read/not read. The aim is thus to recommend a recommendation item Iy for a user Ux by the users U = {U1,..., Un}, using the Ux who are most similar and have given good ratings to items that the user Ux has not yet rated. The Pearson correlation is often used for this method in order to calculate the similarities between two users U1 and Ux. If the k nearest neighbors have been found, the rating is calculated from the sum of the weighted average of the rating across the k nearest neighbors. Figure 4 shows the (user × item) matrix that is used to calculate the similarity between users based on jointly rated items.

Challenges When Using RecSys
Recommendation systems often have to operate in environments that present a number of challenges. Nevertheless, in the interests of the scientific institutions, they should quickly generate precise recommendations. The extent to which collaborative recommendation processes are able to do this depends on how they deal with the following challenges, which are also part of their characteristics.

Challenges When Using RecSys
Recommendation systems often have to operate in environments that present a number of challenges. Nevertheless, in the interests of the scientific institutions, they should quickly generate precise recommendations. The extent to which collaborative recommendation processes are able to do this depends on how they deal with the following challenges, which are also part of their characteristics.
Insufficient data: When using the recommendation system, it is imperative that sufficient data are available. However, this does not depend on the pure amount of data. It is much more important that these meet a certain quality standard, because recommendations can only be made with sufficiently high-quality information. Therefore, the institutions should attach great importance to the right choice and the collection of suitable data from the start.
This gives rise to a problem that is well known in the application area of recommendation systems: the cold start problem [24]. The cold start problem describes the problem that occurs when a new user or item is added to the system [24]. In such situations, there are no known likes or dislikes for any user, nor has any user rated a new item. There is therefore no data basis on which recommendations for a user or an article can be calculated. Ambiguity in the database: If the labeling is inaccurate, it can lead to the same items with different labels being incorrectly rated. As a result, publications may be suggested to a user that do not meet their needs.
Scalability: The huge volume of publications overwhelms many research institutions these days. In this respect, it is becoming necessary to use increasingly large data sets for recommendation systems. Collaborative filtering methods have scaling problems due to the increasing number of ways in which to solve this challenge. On the one hand, there is an incremental singular value of decomposition [25], which calculates new forecasts after adding additional evaluations without recalculating the dimension-reduced model [26]. This means that the system is not forced to recalculate the entire algorithm after every update of the database. However, it should be noted that the expensive matrix factorization has to be carried out.
Storage-based approaches, such as the article-based collaborative filtering method, can also achieve a good scalability. Bayesian collaborative filtering methods address the scaling problem by calculating forecasts based on known evaluations [27,28]. Cluster methods, on the other hand, do not search for similar users within the entire database, but rather generate recommendations by searching within small clusters [29]. However, the increased scalability is at the expense of the forecast accuracy.
Gray Sheep: Gray sheep are users whose preferences are neither consistent nor disagreeable with the ratings of other user groups and therefore do not benefit from the collaborative recommendation process. In contrast, the group of black sheep includes users whose unique research areas make recommendations almost impossible. To solve the gray sheep problem, Claypool et al. [30] used a hybrid approach of content-based and collaborative methods that calculated a weighted average of both forecasts. The weights were set in order to generate an optimal mix for each users.
Shilling Attacks: In some cases, all visitors can leave reviews regardless of whether the publication is selected or not. This is a way of ensuring that your own articles are recommended despite receiving bad reviews. This phenomenon is referred to in the literature as a Schilling attack [31]. Bell and Koren [32] proposed a solution to the problem of memory-based methods. To do this, they removed global effects during data normalization and used their residuals to look for neighborhoods.
There are different recommendation systems that should have different properties; these are discussed below: • Relevance: The filtered set of all articles should be relevant to the user.

•
Novelty: The recommendations determined should be novel to the user-i.e., after a certain contact frequency without user interaction, the recommendation should be replaced by a new one. • Discovery: Recommendation systems should deliver results that are unknown and surprising to the user, but at the same time interesting. • Diversity: The recommendations from the recommendation system should have a certain diversity, i.e., even if the user has only looked at publications from the IT area so far, they should not only see publications from the IT area as recommendations.

Materials and Methods
As mentioned in the literature, recommendation systems are essential for different platforms and databases [21,33]. Publication databases such as Web of Science, from which cookies are taken, currently offers a filter function in which the user can filter articles by category. In the context of this database, it should be examined whether recommendations can be generated from the cookies collected. It should also be checked whether it is possible to make a recommendation using the collaborative filtering approach. In order to generate a recommendation, knowledge must be generated from the publication data. For the collaborative filtering approach, it is important to use the data to create profiles of users who are similar in terms of their interests and behavior. To achieve this, the publication data from the cookies must be processed and transformed. Distance functions are used to calculate the similarities of profiles. Profiles should then be clustered. For this purpose, it should be checked which algorithms from the data mining environment are suitable. Articles that have not yet been read should then be recommended to a user. The recommendation is thus based on the similar behaviors and interests of users. The procedure is based on this process, which is shown in Figure 5. Recommendation systems use data mining techniques with the aim of generating patterns from the data sources. The characteristics of these patterns are that they are valid, unknown, potentially useful, and easy to understand. mendations can be generated from the cookies collected. It should also be checked whether it is possible to make a recommendation using the collaborative filtering approach. In order to generate a recommendation, knowledge must be generated from the publication data. For the collaborative filtering approach, it is important to use the data to create profiles of users who are similar in terms of their interests and behavior. To achieve this, the publication data from the cookies must be processed and transformed. Distance functions are used to calculate the similarities of profiles. Profiles should then be clustered. For this purpose, it should be checked which algorithms from the data mining environment are suitable. Articles that have not yet been read should then be recommended to a user. The recommendation is thus based on the similar behaviors and interests of users. The procedure is based on this process, which is shown in Figure 5. Recommendation systems use data mining techniques with the aim of generating patterns from the data sources. The characteristics of these patterns are that they are valid, unknown, potentially useful, and easy to understand. Data mining means digging for data with the aim of generating knowledge. Data and knowledge are regarded as gold by companies today, since sales and profits can be generated from them. The information overload and storage of large amounts of data can obscure interesting relationships between data. The entire process from data selection to knowledge generation is referred to as knowledge discovery in databases. Data mining is therefore a phase in the KDD process. The methods of data mining are divided into different categories in the literature. Some methods, techniques, and algorithms considered in our work (e.g., through clustering methods such as k-means) arere discussed in [34]. Therefore, we do not go into the details of how they work here. The publication data, which are used for the KDD process, are shown in Figure 6. This includes publication data from a Web of Science publication database, which is based on cookies. Cookies can be understood as a supplement to HTTP [35]. They contain text information that is generated when a client accesses a website. Cookies can be saved on the client's browser. In order to use a website again, the cookies can be read out directly by the web server or transmitted to the server via a script [35]. The tasks include, for example, identifying the client and storing shopping carts and registrations [35]. Data mining means digging for data with the aim of generating knowledge. Data and knowledge are regarded as gold by companies today, since sales and profits can be generated from them. The information overload and storage of large amounts of data can obscure interesting relationships between data. The entire process from data selection to knowledge generation is referred to as knowledge discovery in databases. Data mining is therefore a phase in the KDD process. The methods of data mining are divided into different categories in the literature. Some methods, techniques, and algorithms considered in our work (e.g., through clustering methods such as k-means) arere discussed in [34]. Therefore, we do not go into the details of how they work here. The publication data, which are used for the KDD process, are shown in Figure 6. This includes publication data from a Web of Science publication database, which is based on cookies. Cookies can be understood as a supplement to HTTP [35]. They contain text information that is generated when a client accesses a website. Cookies can be saved on the client's browser. In order to use a website again, the cookies can be read out directly by the web server or transmitted to the server via a script [35]. The tasks include, for example, identifying the client and storing shopping carts and registrations [35].

Results
The data were made available by the Web of Science database in the file type table. The raw file was 100 MB in size and contained 60 columns and 1 million lines. All publication data were in the data type string and integer. A higher number of cookie records

Results
The data were made available by the Web of Science database in the file type table. The raw file was 100 MB in size and contained 60 columns and 1 million lines. All publication data were in the data type string and integer. A higher number of cookie records leads to a more relevant analysis because each line represents a session cookie in which a user has accessed an article in the publication database. Each article call is represented in one line. The cookies used in this work have been made available in a hard drive. The complete raw data record and all subsequent processed data records are shown in Figure 6.
This relevant information was extracted from cookies and reduced to these dimensions.
First name; Last name; Gender; Date of birth; ORCID; DOI; Title.
This is the information that could be interpreted and was taken into account in the further course of this work. For the KDD process, we used KNIME, which is an open-source data mining software developed by the University of Konstanz. KNIME is compatible with all operating systems. More information about the KNIME tool can be found in [36].
The publication data were examined for missing, noisy, incorrect, and inconsistent data. A total of 153,065 null values were found; they were replaced by the string unknown in the device column and remained in the data record. There were a lot of noisy, inconsistent, and incorrect data and outliers that were corrected. Sufficient data are required to make a meaningful recommendation.
As can be seen in Figure 7 Each line of the current data record represents an article view by a user. The user can be identified by the AuthorID. In order to recognize which articles a user has read, the previous table was transformed, as can be seen in Figure 7. The new columns of the new table also contain all categories that are intended to indicate the content of the articles (e.g., in the field of medicine, IT, energy, etc.) and have a connection with the title of the article. The values represent the number of articles read in the respective category. This means that there are a total of 10,467 users. The box plot analysis (see Figure 8) shows that the number of articles read varies greatly. New features were added to the table shown here. All users who had read less than 24 articles were assigned a reading record of 1, regardless of the number of documents read. Average readers were assigned reading record numbers between 24 and 75. In addition, prolific readers were given reading record numbers of over 75. A new column Each line of the current data record represents an article view by a user. The user can be identified by the AuthorID. In order to recognize which articles a user has read, the previous table was transformed, as can be seen in Figure 7. The new columns of the new table also contain all categories that are intended to indicate the content of the articles (e.g., in the field of medicine, IT, energy, etc.) and have a connection with the title of the article. The values represent the number of articles read in the respective category. This means that there are a total of 10,467 users.
The box plot analysis (see Figure 8) shows that the number of articles read varies greatly. New features were added to the table shown here. All users who had read less than 24 articles were assigned a reading record of 1, regardless of the number of documents read. Average readers were assigned reading record numbers between 24 and 75. In addition, prolific readers were given reading record numbers of over 75. A new column reading activity was added to the profile of the readers. The entire corpus was analyzed for new feature extractions from the data. The box plot analysis (see Figure 8) shows that the number of articles read varies greatly. New features were added to the table shown here. All users who had read less than 24 articles were assigned a reading record of 1, regardless of the number of documents read. Average readers were assigned reading record numbers between 24 and 75. In addition, prolific readers were given reading record numbers of over 75. A new column reading activity was added to the profile of the readers. The entire corpus was analyzed for new feature extractions from the data. The values of the available data are numerical, so the k-means algorithm as described above is suitable for the collaborative filter approach. In addition, the k-means algorithm for partitioned clustering methods is widely used. The clusters found represent the basis for collaborative filtering. A well-known and established distance function in preparation for the k-means algorithm is the Euclidean distance function. Therefore, the Euclidean distance is calculated and then the k-means algorithm is applied with the choice for k = 9 based on the 9 first-level categories. The iterations were varied so that after 30 iterations the results no longer changed. Each cluster represents a specific user behavior pattern. As a result, a new cluster column is added to each profile, indicating which cluster this profile belongs to (see Figure 9).
The results of the k-means algorithm show that a profile belongs to the respective cluster. Many different articles can be found in the corpus of the clusters, some of which overlap with other clusters. The question arises here as to which articles are to be recommended exactly. For this purpose, the respective clusters are being reduced. In this cluster, the Top 10 most frequently read articles are selected. It then checks which articles users have read and which they have not. The articles that are not read are recommended to the user. The recommendation is thus based on the respective profiles and their affiliation to a specific cluster profile, which is generated from their reading behavior.
for the k-means algorithm is the Euclidean distance function. Therefore, the Euclidean distance is calculated and then the k-means algorithm is applied with the choice for k = 9 based on the 9 first-level categories. The iterations were varied so that after 30 iterations the results no longer changed. Each cluster represents a specific user behavior pattern. As a result, a new cluster column is added to each profile, indicating which cluster this profile belongs to (see Figure 9). The results of the k-means algorithm show that a profile belongs to the respective cluster. Many different articles can be found in the corpus of the clusters, some of which overlap with other clusters. The question arises here as to which articles are to be recommended exactly. For this purpose, the respective clusters are being reduced. In this cluster the Top 10 most frequently read articles are selected. It then checks which articles users have read and which they have not. The articles that are not read are recommended to the user. The recommendation is thus based on the respective profiles and their affiliation to a specific cluster profile, which is generated from their reading behavior.
The data used in this paper were mainly based on cookies. Some information could be taken from the cookies, but the assumption made here is that an AuthorID is also assigned to a profile. Since the cookies are usually automatically deleted when the browser is closed, a new AuthorID is assigned to the same user when the page is called up again This assumption was sufficient for the prototypical approach, but for a productive system it would be better to differentiate between users who have created an account and those without an account. The existing cookies show a total of 153,065 (15.30%) zero values. This is due to the fact that the categories could not be determined from the respective cookie Data processing can be considerably simplified if there is access to the publication database, and this provides the opportunity to work with structured data.
The evaluation of cluster processes can be described as difficult in general. It is easier if examples have already been given and the investigation is based on how similar the clusters are to the already given clusters. With a cluster procedure, however, the target clusters are usually not known. The model can be rated when the expected result is available for comparison. In the context of this paper, recommendations were generated through the KDD process and the use of the respective methods. The model can, e.g., be The data used in this paper were mainly based on cookies. Some information could be taken from the cookies, but the assumption made here is that an AuthorID is also assigned to a profile. Since the cookies are usually automatically deleted when the browser is closed, a new AuthorID is assigned to the same user when the page is called up again. This assumption was sufficient for the prototypical approach, but for a productive system it would be better to differentiate between users who have created an account and those without an account. The existing cookies show a total of 153,065 (15.30%) zero values. This is due to the fact that the categories could not be determined from the respective cookie. Data processing can be considerably simplified if there is access to the publication database, and this provides the opportunity to work with structured data.
The evaluation of cluster processes can be described as difficult in general. It is easier if examples have already been given and the investigation is based on how similar the clusters are to the already given clusters. With a cluster procedure, however, the target clusters are usually not known. The model can be rated when the expected result is available for comparison. In the context of this paper, recommendations were generated through the KDD process and the use of the respective methods. The model can, e.g., be evaluated by the accuracy, such as [accuracy = number of successful recommendations/number of recommendations]. This calculation is used in many different KDD processes, including Burke [12]. The relative proportion of the incorrectly classified data for the entire assignment is then calculated here. However, the prerequisite for this method is that the expected result is available for comparison. In the case of a recommendation whether a user buys a publication from a scientific journal can be observed. In this case, whether a user reads or clicks on the recommended article on the publication database website should be observed.
For the evaluation of accuracy, the F1 score can also be determined as follows: A comparison can then be made here, as can be seen in Figure 10. Since these publication data are not available in this database, the accuracy cannot be calculated. For this, the model would have to be implemented on the productive platform in order to find out to what extent the recommendations generated are accepted by the users. In the context of this work, the timestamps were not taken into account. The topicality of an article can be recognized from the timestamps. After all, current publications are mostly relevant. In addition, the recommendation structure of the top 10 articles can also be improved. There could be a ranking among the Top 10 that shows which article within the Top 10 was read most often.

Conclusions and Future Work
The paper gave an insight into the best-known concepts and procedures from the subject areas of recommendation systems and data mining in the field of research information and indicated which institutions should consider their use. In addition, a prototypical approach for a recommendation system with the collaborative filtering process based on cookies, developed based on the publication data from Web of Science, was presented. With the dataset of cookies used for this paper, recommendations could be generated through a collaborative filtering approach. Similar users were identified by means of Since these publication data are not available in this database, the accuracy cannot be calculated. For this, the model would have to be implemented on the productive platform in order to find out to what extent the recommendations generated are accepted by the users. In the context of this work, the timestamps were not taken into account. The topicality of an article can be recognized from the timestamps. After all, current publications are mostly relevant. In addition, the recommendation structure of the top 10 articles can also be improved. There could be a ranking among the Top 10 that shows which article within the Top 10 was read most often.

Conclusions and Future Work
The paper gave an insight into the best-known concepts and procedures from the subject areas of recommendation systems and data mining in the field of research information and indicated which institutions should consider their use. In addition, a prototypical approach for a recommendation system with the collaborative filtering process based on cookies, developed based on the publication data from Web of Science, was presented. With the dataset of cookies used for this paper, recommendations could be generated through a collaborative filtering approach. Similar users were identified by means of a clustering process and their profile created based on information about the articles they read. Recommendations from their cluster were then made for users. There is no evaluation here because the model developed must be observed in the production environment. In addition, it can be said that the data source provides little information due to the zero values that are relevant for profiling users. This study shows that a collaborative filtering process is suitable for publication data and that recommendations can be generated with little user information. Web mining and data mining in relation to research information remain exciting topics that have not yet had their potential fully exploited. From the selection of approaches to the analysis of user behavior and algorithms, new approaches are constantly being published. In addition, research institutions have to invest in these subject areas because the developments in the publication databases and repositories show that flexible structures are becoming more and more important and that users expect corresponding publications and recommendations.
In future work, we will consider the use of artificial intelligence (AI) and machine learning to generate recommendations because they offer a comprehensive portfolio of algorithms that can be used in recommendation systems to predict ratings or interactions based on item and user attributes [37] in the context of publication data. Based on this premise, we will propose a concrete example of a recommendation system that is able to find and recommend relevant but not frequently accessed publications.
Author Contributions: O.A. and T.K. contributed to the design and implementation of the research to the analysis of the results and to the writing of the manuscript. All authors have read and agreed to the published version of the manuscript.