Towards Potential Content-Based Features Evaluation to Tackle Meaningful Citations

: The scientiﬁc community has presented various citation classiﬁcation models to refute the concept of pure quantitative citation analysis systems wherein all citations are treated equally. How-ever, a small number of benchmark datasets exist, which makes the asymmetric citation data-driven modeling quite complex. These models classify citations for varying reasons, mostly harnessing metadata and content-based features derived from research papers. Presently, researchers are more inclined toward binary citation classiﬁcation with the belief that exploiting the datasets of incomplete nature in the best possible way is adequate to address the issue. We argue that contemporary ML citation classiﬁcation models overlook essential aspects while selecting the appropriate features that hinder elutriating the asymmetric citation data. This study presents a novel binary citation classiﬁcation model exploiting a list of potential natural language processing (NLP) based features. Machine learning classiﬁers, including SVM, KLR, and RF, are harnessed to classify citations into important and non-important classes. The evaluation is performed using two benchmark data sets containing a corpus of around 953 paper-citation pairs annotated by the citing authors and domain experts. The study outcomes exhibit that the proposed model outperformed the contemporary approaches by attaining a precision of 0.88.


Introduction
A scientific study is usually conducted by capitalizing on the earlier researches of peers in a domain. It establishes a connection with precursory studies via "citation". A citation serves as an acknowledgment that a document receives from another paper in reciprocation of referring the study [1,2]. Besides this, the citation has a crucial role in forming decisions for multifarious academic policies, such as research grant allocations [3], clustering of publications, peer judgment [4], authors ranking [5][6][7][8], assessing the academic influence of country [9], in diversified disciplines ranging from machine learning [10], Internet of things (IoT), networking, etc. [11][12][13].
These policies primarily utilize quantitative citation analysis-based measures wherein a mere count of a citation is considered. A high count of a citation is deemed as an indicator to correlate with the prestige of a publication, author, institute, etc. [8].
Each citation reason serves a different purpose, thus varying significance, which discourages treating all citations equally. The quantitative citation analysis-based approaches assign equal weight to all the citations irrespective of the reason a particular citation has been made [14][15][16][17]. The scientific community argues against harnessing pure quantitative citation analysis-based measures and argues that the reason for a citation must be contemplated [16][17][18]. The majority of the researchers sift through misleading citations prior to employing them to measure the policies mentioned above [17,18]. Back in the late 90s, citing authors were interviewed to provide the reason at the time of publication [19,20]. However, the method did not garner approval as it involves a complex manual process. After that, researchers have proposed different citation classification methodologies that manually scrutinize research papers' content to determine their classes [21,22]. Finney floated the idea that the process may be automated by capturing clues from research papers [23]. Her idea was substantialized by [24] in the form of a first fully automated citation classification technique that considers cue-terms and linguistic features for the classification. However, the study has been criticized due to overlapping and large categories (i.e., 35). Afterward, various other approaches have also been presented to classify citations into a varying number of reasons. There has been a continuous dispute regarding the sufficient number of citation classes to serve the objective of refining citation count-based measures [19,[24][25][26].
Moreover, one of the critical issues faced by the citation classification community is incomplete data. Typically, data corpus of such a nature involves symmetric and asymmetric data. In the context of the scenario being considered in this study, there are few benchmark datasets, making the asymmetric citation data-driven modeling quite complex. The missing parts of data sets pose a challenge in exploiting the available incomplete data in the best possible way so that the accurate information may be ascertained with significant accuracy. Presently, the scientific community is more inclined towards reducing the number of categories into two (i.e., important and incidental) to tackle only meaningful reasons by exploiting the contemporary incomplete data sets of unstructured or semi-structured nature appropriately to discover the hidden knowledge pertaining to the accurate class of a citation. This idea has been implemented in binary citation classification wherein citations are classified into (1) important and (2) incidental categories [2,10,14,16,17]. Similar to the aforementioned studies, we consider that classification of citation into the said two categories would play an immense role in finding meaningful citations. Valenzuela et al. are at the vanguard of classifying citations into important and incidental categories by using metadata and content-based features with SVM and Random Forest classifiers [16]. Now a question crops up that which citation reason should be considered important or incidental? The existing classic binary citation classification considers those citations as important which inspire the citing study for using or extending it, whereas, incidental citations contribute to the citing work in terms of explaining the background theme of a study [2,14,16,17,27].
Besides refining the quantitative citation analysis-based approaches, the binary citation classification can also help to find highly relevant research material for researchers. For instance, consider the following scenario: researchers pursuing research degrees in different disciplines, pose a query on the web to find closely relevant research documents against the research topic being focused on. Web sources return millions of records exhibiting them as relevant papers wherein only a few are actually relevant. On the other hand, if citations of the papers related to the focused topic are considered and further classified into important and incidental, then there is a high probability that the user will have a maximum number of relevant documents, unlike the existing web sources.
The contemporary classification studies exploit different features relating to metadata or the content of research papers [18,[24][25][26]. The content-based features are dominant among others due to being rich in terms of meaningfulness. However, critical analysis of contemporary approaches depicts a need to incorporate some important aspects while selecting appropriate features. This study proposes a comprehensive methodology that exploits a list of novel content-based features to classify citations into important and incidental classes. The features include section-wise citation count, citation sentences, content similarity as a whole, and between Introduction, Methods, Results, and Discussion (IMRaD) sections of research papers. Another contribution of this study is to assess the potential of different parts-of-speech (PoS) terms appearing in citation sentences and also in IMRaD of research papers. Two benchmark datasets have been employed to evaluate the proposed study. The binary citation classification is performed using support vector machine (SVM), random forest (RF), and kernel logistic regression (KLR) classifiers. The outcomes revealed that the proposed approach outperformed existing studies [2,16,17] by achieving precisions of 0.88 and 0.80 for Valenzuela's and Qayyum's data sets, respectively.
The rest of the paper is organized as follows: Section II presents related work; Section III deals with the proposed methodology. Finally, the study outcomes are presented in Section IV, and Section V concludes the paper.

Literature Review
The main idea to discover citation reasons was presented by [15]. The author identified fifteen reasons for citations. This study originated new dimensions of research towards finding other possible reasons. Subsequently, [21] presented thirteen other reasons for citations. These reasons were identified by analyzing 66 articles from multiple disciplines. The specified reasons pulled the scientific community towards the critical scrutiny of purely citation count-based approaches. Until now, various studies have contented regarding equal importance of citations. In 1975, Moravcsik and Murugesan presented the first manual technique by classifying the citations into four categories [28].
Nanba and Okumura [29] classified the citation into three types: (a) Type B: that states the relevance in terms of explaining methods and theories of other studies, (b) Type C: it states relevance in terms of comparing the related works or finding the existing issues and (c) Type O: these categories contain all those relations that do not fall in Type B and C. All of these schemes mentioned above have classified the citations by applying manual methodologies. The inclination towards automatic citation classification was increased after the idea presented by Finney [23]. Finney has automatically classified citations into five categories by employing cue phrases. Subsequently, Garzone and Mercer [23] proposed the first complete automatic citation classification scheme. Their system takes different articles as input and produces the corresponding citation function as an output. They have presented 35 classes for citations which were merged into ten categories. For classification, almost 200 linguistic rules were employed.
In 2003, Pham and Hoffman [24] harnessed cue phrases developed a rule-based knowledge system to classify citations into four categories. Teufel et al. [19] presented a citation classification model that segregates them into 12 classes, generalized into four types. This scheme has adopted rules from Spiegel's method [21]. In that study, the first time, the machine learning algorithm was utilized for citation classification. The scheme has attained an F-measure of 0.71 to conclude neutral category holds around 65% of the citations. Pride et al. [30] classified citations using features of [16] by changing the model's configuration settings. The study has been evaluated on the set containing 465 papercitation pairs collected by [16]. This model has yielded a precision of 0.69. Another study proposed by Tandon et al. [31] harnessed citation context from research articles to produce its summary automatically. In this scheme, the citations are classified into five categories. A language model approach was employed for this classification in which the language models were developed for all five classes. The language models stipulating optimal probability to generate a certain citation context were deemed for the classification. The training set was formed using 500 citation contexts extracted from Microsoft Academy. The model resulted in 0.68 precision.
A binary citation classification model was recently presented by Valenzuela et al. [16] that classifies citations into two classes, i.e., Important and incidental. In this scheme, a dataset comprising of 465 (citing, cited) pairs were collected. Two domain experts performed the annotation of pairs. This was the first work in which citations were segregated into two classes. They proposed a novel machine learning model to classify the citations into binary categories using twelve features. This system has achieved the best result with the in-text citation count by obtaining an F-measure of 0.37.
Furthermore, while considering all the twelve features, the system has achieved 0.65 precision and 0.90 recall. In the same year, Zhu et al. [17] presented a binary (influential and non-influential) citation classification model. The authors have used five types of features and yielded that the "in-text citation count" feature outperformed other features.
Another study proposed by [27] performs binary citation classification by combining features of four state-of-the-art approaches [16,18]. The study reported 29 top-scored features with a precision of 0.89 for the data set containing 465 pairs collected by Valenzuela et al. In another study, authors [14] classified citations into important and incidental categories using metadata-based features. This study has been evaluated on the same two data sets which are employed in our proposed study. The study reported a precision of 0.72 attained using an RF classifier. Likewise, the study by [2] classified citations into important and incidental categories using features such as similarity score, IMRaD based, and overall citation count. Similar to our proposed technique, the study [14] uses two data sets: (1) Valenzuela et al.'s data set that comprises 465 pairs and (2) Qayyum et al.'s data set, which contains 465 pairs. The study [2] recently presented a binary citation classification model that uses KLR, SVM, and RF classifiers and formed features by computing sentiment analysis of in-text citations. The study has used the same benchmarks datasets as used by [14] and reported an F-measure of 0.83 and 0.67 for both datasets. Our proposed research presents a binary citation classification technique that primarily focuses on introducing a list of novel potential features that have not been given attention by the approaches stated above.

Methods
This section encompasses details about the systematic steps to classify citations into: (1) important and (2) incidental classes. A detailed architecture of the proposed system is shown in Figure 1. As explained earlier, this study primarily focuses on discovering the best features from the content of a research paper to maximize the accuracy of binary citation classification. We devise a comprehensive methodology that exploits a list of potential content-based features for binary classification. The employed features listed below are exploited in the best possible way as explained in the following sections: Two comprehensive data sets from [16] and [14,16] have been employed, and a list of potential features is extracted from them. Then, these features have been pre-processed to be prepared for the experimentation phase. After that, N-Gram, PoS Tagging, and semantic-based methods are applied to the features for their score calculation. A detailed explanation of all the applied methods is explained below.

Benchmark Dataset
Appropriate data plays a significant role in revealing various facts. Considering this aspect, we have employed two data sets that can help evaluate the proposed features in classifying citations into important and incidental categories.

Valenzuela's Dataset
The first data set has been collected by Valenzuela et al. [16]. The authors have acquired the annotations of citations as important and incidental for paper-citation pairs taken from a collection of 20,527 papers published in the domain of Information Sciences on ACL anthology. These papers contain around 106,509 citations. Valenzuela et al. [16] formed 465 paper-citation pairs and annotated them as important and incidental from two domain experts. This is the only data set of the required nature that is freely available online; we have chosen this to apply the proposed methodology. A total of 14.6% of pairs are important, and the remaining 85.4% of pairs are incidental among the pairs of this data set.

Qayyum and Afzal's Dataset
Conclusions drawn from a single data set might not be adequate to assess overall results in the given scenario. For instance, 465 pairs are less in number, and there are only 14.6% important citations. Therefore, another data set was formed by considering the faculty members from Capital University of Science and Technology as citing authors and formed 488 paper-citation pairs. The data has been collected by one of our earlier studies [14]. These papers have been published in the domains of Databases, Information Science, Software testing, and networks. These pairs have been labeled as important and incidental by the citing authors themselves. The annotation formed 18.4% pairs from the important category.

PDF to Text Conversion
The authors [16] have only provided paper IDs of the annotated pairs published in the ACL anthology. We tracked those papers through their IDs and downloaded them. For the dataset by Qayyum et al. [14], we already have their portable document format (PDF) files as those were required to provide relevant materials to annotators to recall the citing reason. Since PDF files are hard to process and we require the text of the papers to apply the proposed methodology, therefore, the PDF files were converted into (extensible markup language) XML using PDFX (portable document format exchange) tool. We extracted the required text using a script prepared for this purpose in python.

Features
As explained earlier, the proposed study primarily focuses on identifying potential features that have an essential role in discovering important citations. The features are extracted from the plain text of the pairs. The list of extracted features is shown in Table 1.

No. Description
1 Section-wise citation count 2 Citation context: Bigram terms 3 Presence of noun in citation context 4 Presence of adjective in citation context 5 Presence of adverb in citation context 6 Presence of verb in citation context 7 Section-wise Similarity 8 Section-wise existence of noun 9 Section-wise existence of adjective 10 Section-wise appearance of verb 11 Section-wise appearance of adverb

Citation Count
A citation serves as a helpful measure in the decision-making of academic policies such as researchers' or institutions' ranking, funds allocation, finding cognoscenti in a domain, etc. For example, a research paper typically contains an Abstract, Introduction, Related Work, Methodology, Results, and Conclusion. In this study, we analyze the potential of citation count appearing in different logical sections. This has been carried out based on the following assumptions: • Introduction and related work/literature review sections contain a comparatively higher number of citations [32]. We believe that these sections present a brief overview of the background knowledge of the topic or explanation of the terminologies in the domain. Hence, an author cites those studies that can connect with proposed research in terms of background knowledge (i.e., incidental citation).

•
Methodology and results sections delineate information on the proposed methodology; therefore, it is highly likely to contain in-text citations of those papers that might have been extended or adopted by the proposed study. • Based on the assumptions stated above, this study exploits in-text citations' existence in the IMRaD logical sections: Introduction, Methods, Results, and Discussion, using formulas 1, 2, 3, and 4, respectively. This count has been divided by the total count of in-text citations in the paper using Equation (1).
The following is the description of the formulas: Let Sections = {I, M, R, D, F} where I represents "Introduction", M represents "Methodology", R represents "Results," D represents "Discussion," and F represents "Full-content".

•
Consider the records shown in Figure 2 from D1. Each row in the above records represents a pair; thus, as per the Figure, there are 12 pairs (which are 465 in actuality for D1). Let i be the citing paper (shown in column A) and i = {1,2,3, . . . ,n} where n represents total number of citing papers. Let j be the cited papers (shown in column B), and j = {1,2,3, . . . ,m i } be the count of total m number of cited papers. Since the number of cited papers for each citing paper would be different, therefore, we consider m i to be the total count of citing paper as per i cited paper. In the context of the following Equations (1)-(4), let S be the citing paper and C be the cited paper. So, S i C j would represent the ith citing paper and its corresponding jth cited paper. Now, the following formula computes the number of times cited paper appears in the "Introduction" section of the citing paper. For instance, the numerator in Equation (1) is the total count of in-text citations in the introduction section of the jth cited paper in ith citing paper. The denominator represents the total number of in-text citations of jth cited paper in ith citing paper.
Let us assume a pair as (A, B), where A represents citing paper and B represents cited paper, which means "A" has cited "B" in its references, so obviously "B" must have been cited within the body of paper "A" which are termed as "in-text citations". Let us suppose that "B" appears 8 times in the overall body of the paper and 4 times in the "Introduction" section of the citing paper, so the formula shown in Equation (1) will be computed as the ratio of 4/8. The same procedure is followed for all the remaining sections (i.e., Methods, Results, and Discussion). Similarly, Equations (2)-(4) computes the citation count of jth cited paper in ith citing paper in "Methods", "Results" and "Discussion" sections. Equations (2)-(4), computes the citation count in the "Methods" section, respectively.

Citation Context
A sentence containing an in-text citation is known as a citation sentence [16]. While citing a study in text, authors mention the description that can provide a clue regarding the purpose of a citation. The description comprises such words that can provide a vital indication regarding the reason for a citation. Consider the following two sentences as an example: Sentence 1: "this study further investigates the problem addressed by [5]" Sentence 2: "the study [6] also explains this theory" The terms used in the first sentence such as "further", "investigates", "problem" hint that this citation belongs to the "important" category. On the contrary, terms appear-ing in sentence 2, "explains", "definition," provide a clue that this citation is from the "incidental" category.
In this study, we have extracted such terms from citation sentences in two dimensions: (1) Unigram and Bigram terms (2) PoS including a verb, adverb, adjective, and adverb. Such terms have been extracted from 70% of the pairs used for training. The terms are maintained in the lexicon, verified by a domain expert from the domain Computer Science who has a strong command over the English language and can differentiate terms from important or incidental categories. The following are the steps performed to extract the terms.
A. Pre-processing: This step is mandatorily performed in the scenarios wherein text is required to be processed. The pre-processing phase removes all the noise or redundant information from the data. In this study, the stop words were removed, and the terms were converted into root terms via stemming.
The detail is given below.
• Stop Words Removal: Different English words fail to provide any clue regarding relevance to the particular class. These words include "is", "are", "am", "the" etc., and are known as stop words. We have removed the stop words from extracted citation sentences using Onix Text Retrieval Toolkit. • Stemming: Stemming is performed to convert terms into their base terms so that there is no need to keep a separate record for semantically similar terms. We have used the porter stemmer algorithm [33] to stem the terms of citation sentences. For example, stemming converts terms such as "computing, computer and computes into comput", etc.
B. Bigram Score: Analyzing a single term might not strongly determine the relevance of a citation of important pair. Bigram terms have been proven more helpful in citation classification systems [14]; therefore, in this study, we form a list of bigram terms extracted from citation context (of cited paper) in citing paper. This is conducted based on the assumption that two consecutive terms can clearly depict their associated class. First of all, the bigram terms appearing in important pairs were extracted using the NLP library in python. The next step involves preparing a list for all the bigram terms labeled as "important" terms by a domain expert who is an Associate Professor in the field of Computer Science. The terms not labeled as important were excluded from the list. The list was developed from the citation context of 70% of pairs used for training (from both the data sets) which was then tested on the remaining 30% pairs using the algorithm below. In this case, any of the terms from the lists are matched with the bigram term appearing in a test pair, is assigned a weight of "1". Similarly, the value of 1 adds up for each matched bigram term. In simpler words, for a given score type, e.g., bigrams, and a given ML classifier, scores for each citing/cited pair in the training set, and expert-based binary classification, are provided. The classifier trains on this and then predicts classification for the 30% of citing papers that were held back. The quality is then assessed by comparing to the domain expert's classifications for that 30%. Algorithm 1 computes the bigram score of the pair. It takes a testing pair as an input, computes its bigram terms score, and returns it. The returned value is then kept as the bigram score of the input pair, which is then given to the classifiers for binary classification.

Algorithm 1: Bigram Score Computation of Paper-Citation Pairs
Input: P test //A testing pair Output: BT score (P) Extract Bigram terms from P Initialization: BT train = {BT 0 , BT 1 , BT 2 , . . . , BT m } // bigram terms list generated by domain experts BT(P test ) = {T 0 , T 1 , T 2 , . . . , T q } //bigram terms of testig pair BT score (P) = 0 //bigram term score of pair P Loop i = 0 to n //iterate n testing pairs Loop j = 0 to m //iterate m bigram terms (BT) in annoatted BT list Loop k = 0 to s //iterate s BT in testing pair { if((P testi (k)) == BT train (j)) //matches the bigram term of testing pair i with the annotated bigram term BT score (P i )= BT score (P i ) + 1 } End Loop End Loop End Loop return BT score (P)

PoS Score
Part-of-speech (PoS) tagging is performed to tag each word in a text into its corresponding PoS. To the best of our knowledge, none of the contemporary binary citation classification-based studies have assessed the potential of PoS in determining important citations. This study exploits PoS, including noun, verb, adjective, and adverb, appearing in a citation sentence. We believe that the mentioned PoS are sufficient to determine the importance of a citation; therefore, we have discarded all other PoS such as pronouns, determiners, etc. The idea here is to pick 70% of the pairs and form a separate list for each PoS, i.e., noun, adjective, adverb, and verb and obtain the lists labeled as "important" from the same domain expert who labeled Bigram terms explained in the above section. For PoS tagging, Standford CoreNLP (shown in Figure 3) is utilized. Next, the four PoS extracted from the testing pair were matched with the corresponding PoS list annotated by the domain expert. For instance, the verbs extracted from the citation context of the testing pair (i.e., from both citing and cited paper) were matched with the list of verbs, and the same process is performed for the other three PoS. The PoS found in the remaining 30% of pairs are matched to the list stored separately for each PoS. As previously conducted for bigram terms matching, the same methodology is adopted here to match the PoS in citation sentences. Algorithm 2 computes the PoS score of the pairs. It accepts a testing pair as an input, extracts the four PoS from the citation sentences using Stanford coreNLP, and prepares a separate list for each PoS, as mentioned in Algorithm 2. Next, similar lists are picked, i.e., a list of nouns from the testing pair and a list of annotated nouns are considered, and term by term matching is performed. On each matching, the value of 1 is added to the score of the respective PoS. The process continues until all the four lists from testing pairs are term-by-term matched with the respective annotated list. The algorithm returns the score for all four PoS, which is given to the classifiers for binary classification.

Similarity Computation
In this module, the content similarity between paper-citation pairs is computed in two dimensions: (1) section-wise and (2) overall. The notion of this similarity computation is picked with an assumption that a high similarity count may determine important citation. There may be a probability that a high similarity score appears among certain logical sections of pairs, providing a solid clue regarding the citation class. Based on this assumption, we intend to scrutinize section-wise similarity behavior among pairs. In section-wise similarity computation, IMRaD sections of pairs are assessed based on their similarity score.
For instance, The similarity is computed using above mentioned combinations of sections. The similarity is calculated using the cosine measure.

A. Cosine Similarity:
Cosine similarity is a metric that decides the similarity among two documents of variable sizes. The cosine similarity follows the notion that closer the documents by angle will lead to the high cosine value, which lies between 0 and 1. Equation (5) computes the cosine similarity between two documents.
. . + A n B n is the dot product among two vectors.
In the proposed study, "A" represents the content of the citing paper, and "B" is the cited paper's content. It is pertinent to mention here that cosine similarity is computed in five ways: (1) cosine similarity between full content of citing and cited paper (2) Cosine similarity between "Introduction" sections of both citing and cited paper, (3) cosine similarity between "Methodology" sections of citing and cited paper, (4) cosine similarity between "Results" sections of citing and cited paper and (5) cosine similarity between "Discussion" sections of citing and cited paper.

Section-Wise Part of Speech (PoS)
Typically, a research paper encompasses different sections, often referred to as IM-RaD. In this study, we intend to find the high existence of a particular PoS in the sections mentioned above. This experiment analyzes whether a specific PoS in a certain behavior helps determine the relationship between the pair. For this experiment, we consider four PoS, noun, verb, adverb, and adjective. The content of a section is pre-processed before the PoS extraction. In a pre-processing step, all the stop words are removed from the text using Onix Text Retrieval Toolkit (https://rdrr.io/cran/qdapDictionaries/man/ OnixTxtRetToolkitSWL1.html (accessed on June 2021)). After that, PoS extraction is performed using Standford CoreNLP. Figure 2 shows an example of a paper from Valenzuela's data set on how Standford CoreNLP labels terms into their corresponding PoS.
The objective of performing this experiment is to discover the patterns of PoS existence in important and incidental papers. It should be noted that all types of the four PoS, labeled by Stanford NLP, have been merged into a single PoS. For instance, all types of nouns such as proper nouns, abstract nouns, etc., have been combined into the category "noun". The same has been performed for all the remaining PoS, i.e., verb, adjective, and adverb. It is pertinent to mention that authors do not strictly follow the same terminology for logical sections of research papers. For instance, some used "related work", while others used "Literature review" to describe state-of-the-art studies. In this experiment, the papers containing different terminologies for section names were labeled to the particular section of IMRaD by reading the section's content. It should be noted that the purpose of utilizing this feature is to analyze the difference of appearing PoS in the same sections of important pairs and non-important pairs.
Following are notations explanations of the equations used to calculate the PoS score between four logical sections of citing papers. I represents "Introduction", M represents "Methods", R represents "Results", D represents "Discussion," and P denotes paper-citation "pair".
Let S ij , C ij be a pair wherein S i represents i th source (cited) paper and C ij denotes the jth citing paper of S i . Thus, S ij represents ith source paper paired with jth citing the paper and, i = {1,2,3, . . . ,n} denotes the count of the source paper from 1 to n and j = {1,2,3, . . . ,m} the number of citing a paper from 1 to m. Thus, as explained earlier, this study determines the section-wise role of the four PoS in important and non-important pairs. Equations (6)-(9) computes the appearance of "noun", "verb", "adjective" and "adverb" respectively, in "Introduction" sections of the cited paper (i.e., S ij ) and its citing paper C ij .
The following Equations (14)- (17) are used to compute the scores of noun, verb, adjective, and adverb, respectively, in the "Results" sections of the cited paper (i.e., S ij ) and its citing paper C ij .
Likewise, Equations (18)- (21) calculates the noun, verb, adjective, and adverb scores, respectively, in the "Discussion" sections of the cited paper (i.e., S ij ) and its citing paper C ij .

Results and Discussion
This section delineates results achieved by applying the proposed methodology along with their detailed analysis. Some of the research papers from the dataset by Valenzuela et al. [16] were not found on Association for Computational Linguistics (ACL) anthology; therefore, they have been discarded from the data set. The availability of both the data sets is similar, as stated in [14].

Classification
Once the above-listed features have been calculated by applying the proposed methodology, they are assigned as features' scores to the machine learning tool Waikato Environment for Knowledge Analysis (WEKA) for classification. We have employed SVM, RF, and KLR classifiers, with 10-fold cross-validation using the WEKAtool. The configurations details of these classifiers are as follows: (1) Radial basis function (RBF) kernel with degree 2 for SVM, (2) 10 number of trees for RF and 0 maximum depth, and (3) KLR with degree 2. The classification outcomes are evaluated using standard evaluation parameters that contain recall, precision, and F-measure. The reason for choosing these measures is that the contemporary studies have evaluated their results using the measures, so it would be feasible to draw comparisons. The evaluation measures are represented in macro-average; therefore, the F-measure is not necessary to be relied upon between precision and recall.

Features' Individual Performance
Firstly, we have scrutinized the individual potential of each feature and the bestperformed classifier. Figure 4 shows the precision, recall, and F-measure values achieved against all the employed features. It is pertinent to mention that since our focus is to find the best performing binary classifier among the applied ones, we have only reported values of those classifiers for which the highest value of precision, recall, and F-measure is attained.
The mentioned classifiers are the ones that have outperformed the other classifiers used in this study. It can be seen that the highest value of F-measure (i.e., 0.71) is achieved by the feature M_M (i.e., the content similarity between methodology sections of a pair) from the section-wise similarity category, followed by the PoS based feature noun with 0.63 F-measure. The lowest F-measure score is 0.42, observed for the feature I_I (i.e., the content similarity between Introduction sections of a pair). Figure 3 illustrates the outcomes attained by using Valenzuela's data set.  Figure 5 shows details about precision, recall, and the F-measure score achieved by the harnessed features for Qayyum's data set. The highest value of F-measure is 0.73, secured for CC_methodology by random forest classifier. The second top scored feature is Noun from the PoS category, having an F-measure of 0.71 for SVM, and the minimum F-measure score is achieved by adverb feature from PoS with an F-measure of 0.49.

Features' Combinations
To assess collective contributions of features towards binary citation classification, we have formed every possible combination of features ranging from double to combination of all features. The results achieved by combining all the features are reported in the comparison section. In this section, we have noted the best combination from all the remaining combinations. Figures 6 and 7 visualize the results of outperformed combinations for both the data sets. The combination of "Section-wise similarity (M_M) + CC_Methods" scored the highest with an F-measure of 0.73. For Qayyum's data set, the best performance is observed by the combination "CC_Methods + Noun" that attained an F-measure of 0.76. The stop scored combinations for both the data sets were classified with an RF classifier. In both the data sets, the feature "CC_Methods" is present in the top-scored combination as shown in Equations (5) and (6), which indicates that considering the count of a citation in the methodology section has a strong influence in determining an important citation.

Comparisons
The results achieved by the proposed methodology are compared with three state-ofthe-art techniques in binary citation classification [14,16].
The reasons for drawing a comparison with these three approaches are as follows: • Valenzuela et al. [16]: In this study, we have harnessed the data set accumulated by Valenzuela et al. that has used content and metadata-based features.
• Qayyum et al. [14]: This study has also employed the same data sets and reported similar binary citation classification results using metadata-based features based on both the datasets employed in our proposed study. • Nazir et al. [2]: Nazir et al. performed binary citation classification harnessing two same data sets used in our proposed study.
Since all of these studies have reported overall precision results, we have also drawn the comparisons using the precision score. The following table shows the precision scores achieved by combining all the features. Figure 8 shows that the proposed model achieves the highest value of precision as compared to existing studies for valenzuela's data set. A precision of 0.88 is achieved by combining all the features in the proposed approach for Valenzuela's data set. This is the highest of all other precision scores. Similarly, the proposed methodology achieved the highest precision value for Qayyum's data set, as shown in Figure 9. Another essential aspect to be contemplated here is the result produced by the RF classifier remained consistent. The studies [2,14] have also reported that the RF classifier performed best in their proposed approaches.
The outcomes of the proposed study have revealed different insights to binary citation classification. Analysis of the performance varying from individual to collective contribution of the employed features shows significant potential in tackling important citations. From individual features, CC_method and similarity between M_M sections outperformed other features. These features were incorporated based on the assumption that citation count of cited paper in the methodology section of citing paper may depict an "important" relation between pairs as the methodology section usually contains comparatively a smaller number of citations and the ones that are part of this section are usually represent those papers which are very close to the citing paper. Similarly, the highest value of cosine similarity between the methodology section of pairs has also been proven quite helpful. This validates our assumption that cited and citing papers mostly use similar terms if they hold an "important" relation. Among these two features, CC_method could be considered more worthy as it is present in the top-scored combinations of features for both the employed data sets, as shown in Figure. Another important finding is the existence of the noun feature from the PoS group. To the best of our knowledge, no existing literature has the potential of PoS in finding important citations. The outcomes here suggest that the presence of 'Noun' in citation sentences of important pairs should be given considerable importance.
In this study, we manually formed a list of four PoS from 70% of the pairs because we were intended to find which of the four PoS has more presence in citing sentences of important pairs. The outcomes have ensured that "Noun" has more existence than the other three PoS, i.e., verb, adjective, adverb. In the future, only a high count of Noun in citation sentences could be deemed as a clue in determining important citations. Based on the highest precision value from existing studies, we claim that the identified list of features and proposed methodology holds strong potential for finding important citations.

Conclusions
There has been a continuous debate in the scientific community regarding filtering un-important reasons to refine the approaches wherein a mere count of citations is deemed a quintessential measure. Based on this argument, researchers have classified the citations into different reasons. Recently, the primary citation reasons have been converted into a small number of citation classes to identify only meaningful citations. Most of the schemes have preferred to exploit content-based features due to their diversity and richness. However, to the best of our knowledge, none of the existing studies produce sufficient accuracy. This paper has presented a comprehensive list of content-based features identified by critically analyzing the current state-of-the-art. The content of paper-citation pairs has been exploited to extract the required features, and then the proposed methodology is applied to classify citations into important and incidental classes. The classification has been performed using SVM, RF, and KLR. The outcomes yielded a precision of 0.80 and 0.88 for two different data sets. We claim that the proposed methodology has significant potential to tackle important citations.