A Jaccard Similarity-Based Model to Match Stakeholders for Collaboration in an Industry-Driven Portal

: The industry–university collaboration has been emphasized for innovation and economic development in the Triple Helix Model. To facilitate this collaboration often necessitates implement-ing interfaces between stakeholders. EGEVASYON, an industry-driven platform coined from the combination of innovation and Ege for the Aegean Region of Turkey, has been proposed to foster collaboration by involving researchers in industry projects. Moreover, the platform has a portal project under development, where researchers can receive recommendations among ongoing projects. Our study presents the use of Jaccard similarity measure in this recommendation model. Moreover, recommendation selection is demonstrated using a sample dataset of EU projects.


Introduction
The collaboration among universities and the industry is often emphasized as an essential factor that facilitates innovation. The triple helix model implies the necessity of communication and interaction across plural parties that paves the way to knowledgeintensive industries and innovation [1]. Moreover, it is a necessity for universities to understand the challenges and problems being dealt with by the industry and prepare graduates accordingly [2]. There is substantial work in prior research that covers the role of government to establish convenient environments for collaboration between university and industry. Despite the opportunities, a variety of factors inhibit university-industry collaboration. In 2010, those barriers were extensively explored in [3], and one of the primary issues noted was the difference of focus in the stakeholders. The study argued that academic researchers are orientated towards pure science in the long term, while the industry expects more practical outputs in the short run. Furthermore, [4] pointed out significant differences between the university-driven and industry-driven innovation ecosystems. In either way, the outputs of collaboration are mostly complex and intangible, and their benefits are often observed in the long run [5], which complicates performance evaluation in university-industry collaboration.
A variety of actors take place in establishing relations across stakeholders to increase university-industry collaboration, including the Technology Transfer Offices. As another stakeholder in the ecosystem, EGEVASYON Center has been planned as an industrydriven interface to foster innovation and improve the degree of collaboration among researchers and the industry in Turkey. The principal qualification of the center is its industry-driven orientation, which aims to attract academic researchers in R&D projects conducted in the industry. Thereby, a priority for the center is to increase the degree of cooperation between the industry and researchers in developing solutions towards the challenges in industrial projects, while facilitating the commercialization of solutions brought up from academicians. Additionally, the EGEVASYON project is planned to host a web portal that aims to establish an interface among the stakeholders. When launched, researchers and businesses will be able to sign up and exhibit their expertise by filling out details in their profile, such as projects conducted or training programs participated in. Moreover, the users will be able to search and browse other public profiles, contact other users, and announce calls for collaboration.
A feature of the web portal for researchers is to facilitate finding relevant projects regarding their expertise and background. On the other hand, such function is also beneficial for the industry since their R&D projects might attract attention from researchers among portal visitors, who potentially could provide valuable contributions. Such a feature might be described as a match-making process that uncovers opportunities for cooperation, which are mutually beneficial. As a preliminary work towards achieving this functionality, this study adopts a method to match relevant projects with researchers represented as a list of keywords. Such a problem might be defined as the following: "Among a list of projects defined by a list of tokens, what is the most relevant one given a list of particular keywords (tokens)?" This study follows a similarity-based approach to figure out which item is more relevant to a given set of terms describing researchers and research items/projects. The following section introduces the method being employed for measuring similarity. Subsequently, the third section demonstrates which projects are more similar to each other, and provides an example where a set of keywords are matched to the most relevant project, using the similarity measure. Finally, the potential benefits of this approach will be discussed, along with its limitations.

Materials and Methods
As noted, our study addresses a requirement in a university-industry collaboration web portal project and deals with the problem in a similarity-based model. In particular, measuring the similarity between researchers and research items helps to generate recommendations and provide opportunities for collaboration. This sort of a "matchmaking" described in our context corresponds to the problem of picking the most relevant research items for a researcher-or vice versa.
Term-based similarity measures in text mining involve the Euclidian distance, cosine similarity, Manhattan distance, Dice's coefficient, Jaccard similarity, and the simple matching coefficient [6]. Jaccard and cosine measures incorporate different approaches in measuring similarity [7], yet both provide similar results when employed [8]. Our case requires the calculation of similarity among keywords in corpus and might be dealt with as a text-mining problem.
A collaborative filtering approach in recommender systems often computes the similarity of user ratings with measures/methods including the Jaccard coefficient, cosine similarity, Pearson's correlation coefficient, mean square distance, factored item-based similarity, and weighted regularized matrix factorization [9]. In data mining, grouping similar data points is an essential step for clustering that requires the use of similarity measures [10] (p. 116). Among the measures listed above, the Jaccard similarity is often used in document clustering for calculating similarity with respect to the words included [11]. Moreover, text documents represented as document matrices can be classified using the k-nearest neighbors algorithm, where the occurrences of words can be computed by Jaccard coefficient [12] (p. 107).
The next subsection presents a brief overview of this similarity measure. Moreover, our study involves mapping project descriptions into a global list of tokens before the analysis. Accordingly, data collection, tokenization, and preprocessing will also be detailed separately in subsections before similarity calculation.

Jaccard Similarity Coefficient
Jaccard coefficient has been introduced in [13] and has been extensively used to measure the similarity of sets. The coefficient is often used as a similarity measure for sparse binary data sets since it takes the common or disjoint elements in two sets [14] (p. 76).
The use of the Jaccard coefficient is ubiquitous in various applications. The coefficient is occasionally used to measure the similarity of text, which corresponds to quantitative, multidimensional data when handled as a bag of words [14] (p. 75). In addition, the measure might be used in social network analyses [15] to find out the similarity of authors in terms of focal research areas, through a comparison of keywords in publications. Moreover, the measure was used to group machines regarding their components in cellular manufacturing [16]. Another exciting use of the Jaccard coefficient was to evaluate the lesion detection performance of a computer-aided diagnosis system, in comparison with the manual detections from radiologists [17]. Furthermore, a study by Lu et al. [18] improved the scalability of a news recommendation system by integrating the coefficient into kmeans clustering when measuring the similarity (distance) between users.
Given its frequent use for similar tasks in prior research and considering its ease of use in implementation, the Jaccard coefficient was found suitable to calculate project-researcher similarity in this preliminary study. The formal definition of Jaccard similarity measure (or the Jaccard's coefficient) is given below: For sets and with binary representations and , Jaccard's coefficient is a symmetric overlap measure [19] which equals: The formula in (1) might also been generalized for multiple sets as in the following: Calculating the Jaccard index for disjoint sets results in 0 since the count of common elements is zero. In contrast, the calculation of this measure for two sets involving the same members would result in 1. The Jaccard coefficient for any two sets is within the range [0, 1]. A higher score indicates a large number of common items in comparison with the total count of elements in sets. Accordingly, high scores for the Jaccard coefficient signify a high degree of similarity. However, specifying thresholds might depend on the problem domain.
Prior research involves various types of problems where the coefficient has been in use. Jaccard coefficient is applicable to measure the similarity of words in search engines by comparing letters to handle mistyped phrases [20]. Park and Kim [21] utilized the measure for comparing keywords obtained from travelers with those already present in websites. Accordingly, the authors calculated dissimilarities between both lists and utilized the differences to improve recommendations. Additionally, Yu et al. [22] proposed a novel method to generate recommendations in both a content-based and collaborativefiltering-based approach, where the Jaccard distance was utilized to compare recommendations represented by sets including both items and target users. The measure has also been adopted by Fletcher and Islam [23] to measure the similarity of patterns obtained through data mining. Moreover, the authors demonstrated how the interpretability and usability of association rules might be improved through this approach.
The Jaccard coefficient might also be used in data mining when measuring the similarity of transactions in market basket analysis, since such data involves Boolean attributes [24]. Furthermore, Singh et al. [25] proposed the similarity calculation for encrypted data points using Jaccard distance and demonstrated the use of the measure in privacy-preserving data mining.

Dataset
Our analysis utilizes a list of Horizon 2020 projects involved in a dataset [26] publicly shared over the Kaggle platform. Although there is a topic-based classification of projects, such categorization was made based on the predefined programs related to the call. Moreover, the dataset did not involve a selection of keywords that describe the projects. Even so, a variety of attributes were present in the dataset, as listed below: Despite the availability of attributes that describe projects in various aspects, virtually none of them summarizes what the project is about since there were no keywords provided. On the other hand, project details are involved extensively in the objective attribute, and partially in the title. Among 30,084 research projects in the dataset, the average length of values for the "objective" attribute is 1832 characters, while the same average is 80 for project titles. With such consideration, the objective statements were selected as the primary source to generate tokens.

Obtaining Tokens
The tokens in objective statements were generated in RStudio using the "tokenizers" package [27]. The R script executed in this step is below: > library(tokenizers) (1) > filename <-"d:/projects.txt" (2) > my_data <-readChar(filename, file.info(filename)$size) (3) > words <-tokenize_words(my_data) In our dataset, the script generated 2132 unique tokens. The lines 8-11 simply write the list into a text file. The next phase in our methodology involved the preprocessing of data before the analysis.

Data Preprocessing
Accordingly, the list of tokens obtained was manually inspected to eliminate common words such as "a", "will", "be", "an", and "with". Moreover, a variety of irrelevant words that did not specify a particular term were also excluded. Additionally, plural forms of tokens have been excluded in this phase. Initially, the dataset included 2110 tokens. The eliminated data involved 456 common or irrelevant words and 149 plural words. As a result, 1505 tokens remained in our dataset. The list of projects and the count of tokens are given in Table 1. An average project description includes a high number of characters (1802 chars) at first glance. However, the elimination of redundant words has led to lower token counts. Also, the number of tokens listed ignore the repetitive tokens in text. Thereby, the projects listed above have an average of 16.34 distinct tokens. Table 2 reports the most referred tokens in project descriptions: Table 2. Top 50 of most frequent tokens in project descriptions.

Token
Frequency Token  Frequency  Token  Frequency  disease  13  sector  7  initiative  4  stakeholder  11  data  6  observation  4  process  10  infrastructure  6  structure  4  system  10  medicine  6  academia  3  team  10  patient  6  action  3  material  9  programme  6  cell  3  mechanism  9  europe  5  chair  3  application  8  expert  5  collaboration  3  environment  8  innovation  5  communication  3  institution  8  institute  5  community  3  leader  8  management  5  competitive  3  society  8  network  5  complex  3  excellence  7  partnership  5  deep  3  goal  7  region  5  ecosystem  3  group  7  relation  5  effort  3  position  7  economic  4  gene  3  disease  13  industry  4 An SQLite database has been created to store the entities (projects and tokens) and the relations between projects and tokens.

Results
The calculation of the Jaccard similarity coefficient essentially requires a comparison of two sets of tokens in our case. In this way, seeking for the most relevant project against a set of keywords requires calculating the measure by the Formula (1), iteratively.
Our first scenario for recommendation selection assumes the availability of projects before the membership of researchers. The motive for such an inquiry has the intent to find similar R&D projects and to provide opportunities for cooperation across organizations. Although this objective seems to be irrelevant from the industry-university collaboration, facilitating cooperation within an industry is necessarily a positive action that complies to the industry-driven aspect of the center. In this case, the most similar projects might be found through pairwise comparison based on the Jaccard coefficient measure. As listed in Table 3, Project#29 and Project#35 had six common items in a total of 33 tokens. Accordingly, the most similar project pair had an 18.18% similarity. The second type of similarity is across researchers and projects in our case study. Considering a project description filled in by a researcher, or a set of terms obtained from a researcher's profile, it is possible to recommend ongoing industry projects by performing simple subset calculations. Accordingly, a number of projects having high similarity scores might be selected for recommendation for an individual.
As an example, the set (S) of following tokens was assumed to represent a researcher's (i) profile: Let Si = {academia, activities, ageing, agriculture, application, change, climate, conduct, cooperation, ecosystem, expectancy, group, initiative, member, mia, opportunities, sites, transition}.
Considering this set (Si), Table 4 lists the most relevant projects selected based on Jaccard similarity.
The high similarity scores in the table signify higher degrees of relevancy. Resultantly, the most relevant project for the list of tokens given was Project #5, which includes the following tokens: {ageing, ccdrc, disease, excellence, expectancy, gender, independence, intervention, job, living, member, mia, opportunities, portugal, services, sites, stepping, team} Notably, a total of six common tokens (ageing, expectancy, member, mia, opportunities, sites) between the Project #5 and the examplary researcher profile results in a similarity score of 0.2. Accordingly, among a total of 30 distinct tokens in both sets, there were six common (20%) items.

Discussion
Our study demonstrates the use of the Jaccard similarity measure to match a list of projects with a relevant subset of keywords. The motive for that process is to match relevant projects and researchers in a web portal, and hopefully uncover opportunities for cooperation to facilitate the university-industry collaboration. The portal has been conceptually designed and will be developed as an essential part of the EGEVASYON Project, which has been recently under preparation in coordination with the Aegean Region Chamber of Industry.
Finding similar subsets of keywords has various use-cases in accordance with the requirements of the web portal. In particular, it is required to process a given set of keywords that might describe another project, a researcher's fields of interest, or simply a user-provided query. In all three scenarios mentioned, the problem is analogous to selecting more relevant sets based on an input set. As mentioned in [28], Jaccard's index is among the easiest methods (p. 172) with which to measure similarity. Intuitively, recommendations should originate from the projects with the highest similarity to the visitor's profile.
A limitation of our study is the unavailability of data collection on the web portal, which is to be implemented after the kick-off of the EGEVASYON Project. Instead, a small dataset of EU Horizon Projects has been used for obtaining tokens and measuring similarity. The initial steps required the extraction of tokens and the elimination of redundant words before proceeding to match relevant items. An alternative approach is to start with a predefined list of terms. Even so, the maintenance of the list would still occasionally require human intervention. Another limitation of the study is the use of monolingual project descriptions, all of which have been in English. However, a platform for university-industry collaboration might host researchers from various countries and necessitate multilingual statements.

Conclusions
Our study demonstrates the use of a well-known similarity measure in a portal that has been designed as a part of EGEVASYON project. As the primary interface of the pro-ject, the portal will link the researchers with research projects and facilitate industry-university collaboration. Our methodology in recommendation selection utilizes a similaritybased minimization approach that makes use of the Jaccard similarity measure.
The use of the Jaccard similarity measure in our case helped to select relevant projects with an input of 14 tokens, considering project descriptions. Moreover, a pairwise comparison of all projects was easily implemented through an SQL (Structured Query Language) query with a few join operations. However, as mentioned in data preprocessing, human intervention was required to eliminate redundant tokens. Except for this phase, our study demonstrated that using the Jaccard similarity might be a practical idea for matching researchers with projects in the portal. Additionally, the similarity scores might be used to select appropriate recommendations for members, including both businesses and researchers. However, the use of multilingual project descriptions and keywords would go beyond our limitations in this study.