The Automatic Detection of Dataset Names in Scientific Articles

: We study the task of recognizing named datasets in scientiﬁc articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub. Dataset: https://github.com/xjaeh/ner_dataset_recognition


Introduction
This paper contributes to the creation of a dataset citation network, a knowledge graph linking datasets to scientific articles when used in an article.Unlike the citation network of papers, the dataset citation infrastructure is still primitive, due to the limited referencing of dataset usage in scientific articles [1][2][3][4].The use and value of such a dataset citation network is similar to that of the ordinary scientific citation network: realizing recognition for dataset providers by computing the impact scores of datasets based on citations [2,4], ranking datasets in dataset search engines by impact [1], creating a representation of a dataset by its use instead of its metadata and content [4,5], studying cooccurrences of datasets, etc.According to Kratz and Strasser [2], researchers believe that the citation count is the most valuable way to measure the impact of a dataset.
Creating the dataset citation network from a collection of articles involves three main steps: scientific PDF parsing, recognizing and extracting mentioned datasets, and cross documenting the coreference resolution ("dataset name de-duplication").This paper is only concerned with the dataset extraction process, which we view as a Named Entity Recognition (NER) task.When focusing on the articles that use a NER method for this task, it becomes clear that almost every article uses another approach and another dataset.Not only do these approaches differ, but they also deviate from the core dataset NER task, as every approach has something extra added onto it [3,[6][7][8][9][10][11][12][13][14].This makes it hard to compare which method or component fits a task best.According to Beltagy et al. [15], SciBERT has shown state-of-the-art results on one of the datasets (SciERC), while other methods have outperformed this score on a similar task and dataset [9,10].To fully be able to compare the performance and annotation costs of each (basic) model, we compare their performance with them all being trained and tested on the same dataset.This results in the following research question: RQ Which Named Entity Recognition model is best suited for the dataset name recognition task in scientific articles, considering both performance and annotation costs?
The comparison of the performance of each method, when only run once, is not sufficient enough to fully compare them, as [16] showed that annotation choices have an irrefutable impact on the system's performance.This effect was neglected in the aforementioned papers that focused on the dataset extraction task.However, not only can these choices impact the models' performances, but they can also impact the annotation costs.We consider a number of factors that could influence the performance and annotation costs of the models.First, domain transfer is considered, as this is shown to impact NER performance [17][18][19].This has become a trending topic in NER in an effort to reduce the amount of training data needed [20].Another factor that is taken into account is the training set size, as multiple sources have shown that a small amount of training data can lead to performance problems [7,21].Next to the size of the training set, the effect of the distribution of positive and negative samples is considered, as this, too, has been shown to influence performance [22][23][24][25].These choices all influence the amount of training data that is needed to achieve the best performance, thus influencing the annotation costs since adding 'real examples' is costly.In order to further reduce the annotation costs, the effect of adding weakly supervised examples in the training data is investigated for the best performing model.Summing up, we answer the following questions: RQ1 What is the performance of rule-based, CRF, BiLSTM, BiLSTM-CRF [26][27][28], BERT [29] and SciBERT [15] models on the dataset name recognition task?
RQ2 How well do the models perform when tested on a scientific (sub)domain that is not included in the training data?RQ6 Is there a difference in the performance of NER models when predicting easy (short) or hard (long) dataset mentions?

RQ3
To answer these questions on realistic input data, we created a hand-annotated dataset of 6000 sentences based on four sets of conferences in the fields of neural machine learning, data mining, information retrieval and computer vision (see Section 3.1).NER can be evaluated in many ways.We mostly use the most strict and realistic, that is, the exact match on a zero shot test set.We note, however, that, due to enumerations and ellipses, many NER hits contain several datasets, which makes the partial and B-match also useful (as the found NER hits have to be post-processed anyway).
Our main findings are that SciBERT performs best, particularly on realistic test data (with >90% sentences, not mentioning a dataset).Surprisingly our own developed rule based system (using POS tags and keywords) performed almost as well, and all others, except BERT, perform (much) worse than this rule-based system.SciBERT was also robust when looking at the other tests performed, regarding domain adaptability, the negative sample ratio and the train set size.However, nothing comes for free; we did not succeed in training SciBERT to outperform the rule-based system when we gave it only weakly supervised training examples (obtained without manual annotation).
All code and datasets used for this paper can be found at https://github.com/xjaeh/ner_dataset_recognition.

Related Work
The overwhelming volume of scientific papers have made extracting knowledge from them an unmanageable task [30], making automatic IE especially relevant for this domain [31].Scientific IE has been of interest since the early 1990s [32].Despite the growing interest in the automatic extraction of scientific information, research on this topic is still narrow even now [7].The reason for the limited research in scientific IE in comparison to the general domain is the specific set of challenges associated with the scientific domain.The main challenge is the expertise that is needed for annotated data, making these data costly and hard to obtain, resulting in very limited data available [7].However, there is a significant focus on this kind of research in the scientific sub-domains: medicine and biology [30].
Where at the beginning of Scientific IE, the focus mainly laid upon citations and topic analyses [13], now, the focus has become broader and has shifted toward scientific fact extraction (for example, population statistics, variants of genomics, material properties, etc.) [30].Research on the dataset name extraction task uses a great variety of methods throughout the NER spectrum, including, but not limited to, the following: rule-based, BiLSTM-CRF and BERT [3,[6][7][8][9][10][11][12][13][14].
For dataset extraction, it was found that verbs surrounding the dataset provide information about the role or function; as such, the words, use, apply or adopt, indicate a 'use' function [10].Nevertheless, not only these verbs surrounding it play an important role, as for dataset detection, a wide range of context is needed [6], indicating that a model's ability to grasp context could play a significant role in the performance of that model.
We briefly go through the NER models that we tested for dataset extraction.The rule-based approach was the most prominent one in the early stages of NER [33].Despite the fact that most state-of-the-art results are now achieved by machine learning methods, the rule-based model is still attractive to use, due to its transparency [34].The authors conclude that rule-based methods can achieve state-of-the-art extraction performance, but note that the rule development is very time consuming and a manual task.Not only is this method used as a stand-alone classification method, but it is also suitable as a form of weak supervision [35] as an alternative to the manual labeling of data, providing training examples for the other methods [36,37].
Conditional Random Fields (CRF) is a probabilistic model for labeling sequential data, which has proven its effectiveness in NER, producing state-of-the-art results around the year 2007 [38].A dataset extraction model solely based on CRF is missing, but it was used for other tasks in Scientific IE.A well-known example is the GROBID parser, which extracts bibliographic data from scientific texts (such as the title, headers, references, etc.) [39].
The BiLSTM-CRF is a hybrid model, combining LSTM layers with a CRF layer on top [40].Using this combination, the advantages of both models can be joined.The advantage of BiLSTM is that it is better at predicting long sequences, predicting every word individually [41], while CRF predicts based on the joint probability of the whole sentence, making sure that the optimal sequence of tags is achieved [41][42][43].To date, the BiLSTM-CRF based model produces the best performance on the dataset extraction task, with an F1 score of 0.85 [8].
BERT produces state-of-the-art results in a range of NLP tasks [29].It is based on a transformer network, which is praised for its context-aware word representations, improving the prediction ability [44].BERT has revolutionized classical NLP.However, its performance as a base for the dataset extraction tasks differs greatly, as one research study found an F1 score of 0.68 [13], while another has found an F1 score of 0.79 [10].Beltagy et al. [15] developed the SciBERT model based on the BERT model.The big and only difference between those models is that, unlike BERT, which is trained on general texts, SciBERT is trained on 1.14 M scientific papers from Semantic Scholar, consisting of 18% computer science papers, and the remaining 82% consisting of papers from the biomedical domain.This model, which was specially created for knowledge extraction in the scientific domain, indeed achieves better performance, in comparison to BERT, in the computer science domain.

Description of the Data
We describe the created manually annotated dataset.The annotation guidelines in Appendix B contain many illuminating examples.Here, we simply give two examples (annotated datasets are in marked in gray):

•
The second collection (called ClueWeb) that we used is ClueWeb09 Category B , a large-scale web collection . . .• Tables 3 and 4 show the average precision of 20 categories and MAP on the PASCAL VOC 2007 and 2012 testing set , respectively.

Origins
The sentences in the dataset originate from published articles from four different corpora within the computer science domain: the Conference on Neural Information Processing Systems (NIPS, 2000-2019), SIAM International Conference on Data Mining (SDM, 2000-2019), ACM SIGIR conference (2007-2018), and papers from the main conferences (ICCC, CVPR) participating in the Computer Vision foundation (VISION, 2017-2019).These conferences were chosen because they are top tier A * venues that have existed for an extensive period; all of them, except SIGIR, are freely available; they cover a wide range of topics within the information sciences; and experimental evaluation is a key aspect in these venues.Thus, we expected most articles to contain references to the datasets.Papers from these conferences were collected in PDF format and parsed using GROBID to extract the text [39].The extraction of sentences using GROBID made it possible to exclude references, titles and tables.This way, only 'real' sentences from the main text were selected.From these sentences occurring in the main text, we selected sentences that likely contained a reference to a dataset for manual annotation.These selected sentences had to contain one of the following phrases (including their plural form): dataset, data set, database, data base, corpus, treebank, benchmark, test/validation/testing/training data or train/test/validation/testing/training set.The regular expression in Appendix A was used to implement this selection.

Annotation
The annotation scheme used for the annotation task is based on the ACL RD-TEC Annotation Guidelines [45].An example of a guideline is that generic nouns (e.g., dataset) accompanying a term should be annotated.Another example is the 'ellipses rule', which states that when two noun phrases in a conjunction are linked through ellipses, the term needs to be annotated as one.For the task of the dataset name annotation, this would mean that the phrase PASCAL VOC 2006 and 2007 datasets are marked as one entity.Annotation was done by four persons, each annotating 1500 sentences plus a part of the kappa calculation.The resulting Fleiss kappa of 0.72 representing a substantial agreement was calculated based on fifty sentences [46].The full annotation scheme is available in Appendix B.

Train and Test Sets
Each annotated sentence was given an ID, tokenized using the spaCy tokenizer [47], and given POS-tags using the NLTK package [48].The gold standard annotations themselves were transformed into the corresponding IOB-tags for each token.
The entire dataset contains a total of 6000 sentences, having an even distribution between corpora, with 1500 sentences from each corpus.Slightly under half of them contain a dataset mention.These 6000 sentences were split into a train set, test set and zero shot test set.Sentences in the zero shot test set do not contain dataset names that occur in the train set.All sets were created using stratified sampling, more or less keeping the equal distribution among the four conferences.The distribution of positive and negative samples in these sub-sets is shown in Table 1.The number of sentences containing a dataset name is not equal to the number of datasets being named, which is 4164.This leaves an average mention of 1.43 datasets in a sentence containing at least one dataset mention and an average mention of 0.69 overall in this dataset.
We ran the SciBERT tagger over the complete corpus of over 15,000 articles, and observed that within the VISION papers, 5% of the papers did not mention a dataset.For the other three conferences, this was remarkably similar between 20.2 and 22.4%.A manual total scan of 30 random NIPS papers produced a slightly higher part of nine papers without any dataset mention.

Experimental Setup
Full details of all experiments, including more detailed measurements, are available on the GitHub repository.All methods were evaluated using seqeval [49] for the B-and I-tags, and nervaluate [50] for the partial-and exact-match scores.
As a natural baseline, we created a rule-based system containing rules such as "If the word is a proper-noun and one of the keywords follows: then mark the proper-noun including its keyword as a dataset", which were developed through careful consideration, following the annotation guidelines.As developing a rule-based system takes time [34], the rule development was an iterative process by trial and error, each time adding or adjusting rules as deemed necessary.The rules were made machine readable, using spaCy's rulebased matching method [51].This translated to 10 spaCy patterns; see the notebook Rule-based.ipynb in the dataset belonging to this paper.
For the CRF, the sklearn_crfsuite from the scikit-learn library was used [52].Both BiLSTM methods are keras based.No parameter optimization was performed.The used parameters were taken from the "Depends on the definition NER series" [53].
Both BERT models are based upon a scikit-learn wrapper [54].This wrapper provides multiple models that can be selected.The example of [29] is followed by choosing the cased model for NER.While the uncased variant of the model generally performs better, the choice was made to utilize a case-sensitive model, as capitalization can be indicative of whether a phrase refers to a dataset: words referring to datasets are often capitalized.To compare both models equally, the BERTbase model was chosen, just like [15], as SciBERT only has a base model.For SciBERT, the scivocab was chosen, as this represents the frequently used words in scientific papers.The model configuration and architecture are the same as those in the SciBERT paper [15].The following hyperparameters were used for the training of the model: A learning rate of 5 × 10 −6 for the Adam optimizer, with a batch size of 16.Training lasted 15 epochs, and checkpoints were saved every 300 training steps.Gradient clipping was used, with a max gradient of 1.

Results
We report the results grouped by the five subquestions.Appendix D contains additional results (e.g., precision and recall scores, and scores for the B(eginning) and I(nternal) tags.

RQ1, Overall Performances
Table 2 contains the F1 performance scores for the six different NER models we tested.This is the only experiment we conducted with (5-fold) cross validation on the complete set of 6000 sentences.BERT and SciBERT perform almost the same on both scores and (much) better than all the others, except that the rule-based system performs equally well on the partial match score.
Notice that the partial and exact match scores are closest for SciBERT.Due to conjunctions, ellipses and the used annotation guidelines, NER phrases can be quite long, so a large difference between the two ways of scoring could be expected.An error analysis shows that SciBERT is especially good in learning the beginning of a dataset mention.
The two most interesting systems seem to be SciBERT and the rule-based one, and thus we will mostly report results on the other subquestions for these two.

RQ2, Domain Adaptability
The models' ability to adapt to differences within the scientific domain is shown in Table 3.These scores are achieved using one corpus as a test set, while training on the other three corpora.The corpus on which it is tested can be found in the header.We expected the scores to be lower than the cross validation scores, but we only found a small negative effect when testing on the VISION conferences.We note that the VISION set is different in that the sentences come from the last three years, while the others are from the last two decades.Here, we look at the major cost factor: the size of the training set. Figure 1 shows the exact match F1 score on the zero-shot set for varying amounts of training sentences, ranging from 500 to 4500.We see a clear difference between CRF and the two BERT models on the one hand and the two BiLSTM models on the other.We now zoom in on the most stable behaving models, CRF and SciBERT.Figure 2 zooms in on both precision and recall, also for the (supposedly easier) test set.Both models show remarkably robust behavior: only a slight influence of the amount of training examples and hardly any difference in performance for the test and zero-shot test set.It is noticeable that CRF can be seen as a precision-oriented system, while for SciBERT, precision and recall are very similar.We see this as evidence that these two systems learn the structure of a dataset mention well and do not overfit on the dataset names themselves.

RQ4, Negative/Positive Ratio
Recall that about half of the sentences in our dataset do not mention a dataset, while containing one of the trigger words, such as dataset, corpus, collection, etc.We can decide to use those in training or not.As noted in previous research, the ratio of positive and negative sentences was found to be important for NER models trained on a dataset mention extraction task [6].We see a slight improvement in F1 scores for all models when adding also negative training examples, but this is quite small.
What is more interesting is when we test on a set in which sentences mentioning a dataset are very rare, just like in a real scientific article.Using the developed rule-based system, we added sentences that most probably do not mention a dataset (i.e., they did not contain any of the trigger words (more precisely, did not match the regex in Appendix A)) to the test set until we obtained a 1 in 100 ratio.Table 4 shows the results.We see that all F1 scores drop, compared to those in Table 2.This is expected as the task becomes harder.However, note that the recall remains very high for the two BERT models, indicating that the drop in F1 is caused mainly by extra false positives (of course, it might be that the SciBERT model discovered genuine dataset mentions not containing one of the trigger terms.We did not check for this).with the same number of positive and negative sentences as in the manually annotated train set.Table 5 shows that the performance of SciBERT is substantially lower when trained on those 'cost-free' training examples alone than when trained on the hand-annotated data (train set).A reason for this is that the SSC can contain false negatives or false positives, and learning from these false data will impact the model's prediction ability, thus impacting the scores.According to [55], weakly supervised negative examples harm the performance.To test this effect, SciBERT was also trained on a combination of the manually labeled data and only the positive data from the SSC.The differences were very small: a 0.01 improvement for both partial and exact match on the zero-shot test, no difference for the partial match, and a 0.02 decrease on the test set.

RQ6, Easy vs. Hard Sentences
Sentences enumerating a number of named datasets are common in scientific articles.According to the guidelines, these are tagged as one entity, leading to long BI+ tag sequences.We wanted to test whether SciBERT is able to learn these more complex long entities just as well as the easier ones.So we split both the train and the zero-shot test sets into a hard and easy set, with sentences being hard if they contained a BI+ tag sequence of a length of four or more.We then performed all four possible train on hard/easy, test on hard/easy experiments.Only (we also saw a 6% drop in F1 when trained on hard and test on easy, but this may be due to much less training sentences) with train on easy, test on hard did we see an expected but still remarkable difference in scores (a drop in F1 of 42%).This means that the network is also able to understand and interpret ellipses and enumerations.These more complex rules and structures are not harder for the network to identify than simple one-or two-word dataset mentions.These structures and patterns are difficult, even for human annotators to consistently parse and classify correctly, making the network's ability to understand the nuances of the labeling task significant.

Discussion
We have created a large and varied annotated set of sentences likely to contain a dataset name, with about half actually containing one or more datasets.We have shown that extracting these datasets using traditional NER techniques is feasible but clearly not straightforward or solved.We believe our results show that the created gold standard is a valuable asset for the scientific document parsing community.The set stands out because the sentences come from all sections in scientific articles, and come with exact links to the articles.Except for those coming from SIGIR, all articles are openly available in PDF format.
Analysis of the errors of the NER systems and the disagreements among the annotators revealed that dataset entity recognition from scientific articles is complicated through the use of enumerations, conjunctions and ellipsis in sentences.This means that, for example, in the sentence 'We used the VOC 2007, 2008 and 2009 collections.',the phrase 'VOC 2007, 2008 and 2009 collections' is tagged as one dataset entity mention, as individual elements of the enumeration are nonsensical without the context provided by the other elements [7].We think it is this aspect that makes the task exciting and different from standard NER.Postprocessing the found mention, extracting all dataset entities, and completing the information hidden by the use of ellipses is an NLP task needed on top of dataset NER before we can create a dataset citation network.Of course, a cross-document coreference resolution of the found dataset names is then needed for the obtained network to be useful [56].Expanding the provided set of sentences with this extra information, linking every sentence to a set of unique dataset identifiers is not that much work and would make the dataset also applicable for training the dataset reconciliation task.
We wanted to know which NER system performs well and at what cost.Not surprisingly, the best performing systems were BERT and SciBERT.Unsupervised pretraining also helps for this task.Both systems (and CRF) worked already almost optimally with relatively few training examples.They were robust on our domain adaptation experiments, and kept a high recall at the cost of some loss in precision when we diluted the test set to a realistic 1 in 100 ratio of sentences with a dataset.
We found the performance of our quite simple rule-based system to be remarkable.In fact, this system can be seen as a formalization of the annotation guidelines, and having those carefully spelled out made it almost effortless to create; this is, in our opinion, the reason for its strong performance.The experiment in which we trained SciBERT with extra examples found by the rule-based system was inconclusive in that we saw hardly any change in performance.However, there may be more clever ways to combine these two models.

Future Directions
We think a gold standard dataset reminiscent of the end-to-end task of dataset mention extraction from scientific PDFs could lead to a big step forward in this field.In particular, we could then train and test end-to-end systems, which would link dataset DOIs to article DOIs.
Additionally, the articles from the four chosen ML/DM/IR/CV conferences are relatively easy for the dataset extraction task, as they do not contain that many named entities.The task is likely harder with papers from biological, chemical or medical domains.
A different approach to this task is to start with a knowledge base of existing research datasets containing their names and some metadata and then to use that in an informed dataset mention extraction system.
How does the amount of training data impact the models' performance?RQ4 How are the models' performance affected by the ratio of negative versus positive examples in the training and test data?RQ5 Does adding weakly supervised examples further improve the scores of the best performing model?Additionally, how well does the best performing model perform without any manually annotated labels?

Figure 1 .
Figure 1.RQ3, the influence of the amount of training data for all models tested on the zero-shot test set.

Figure 2 .
Figure 2. RQ3, the influence of the amount of training data for the CRF model (left) and for SciBERT (right).

Table 1 .
Distribution of positive and negative samples across the train, test and zero-shot set.

Table 2 .
RQ1, Partial and exact match mean F1 scores for various NER models based on 5-fold cross-validation, using the complete dataset of 6000 sentences (all standard deviations are between 0.01 and 0.03).

Table 4 .
RQ4, testing on a real ratio (positive vs. negative) test set.We now see how much SciBERT can learn from positive training examples discovered by the rule-based system.As these examples are not hand annotated, we call them weakly supervised.We created a weakly supervised training set, SSC (for Silver Standard Corpus),

Table 5 .
RQ5, differences between training on supervised or weakly supervised train data (F1 scores).