Next Article in Journal
The Effect of Interface Damage between Slab and Mortar Layer on the Dynamic Performances of Vehicle and Track Systems under the High Frequency Train Loads
Next Article in Special Issue
Conceptual Framework for Implementing Temporal Big Data Analytics in Companies
Previous Article in Journal
Gridding Optimization for Hydraulic Fractured Well in Reservoir Simulation Using Well Test Analysis for Long Term Prediction
Previous Article in Special Issue
Digitization of Accounting: The Premise of the Paradigm Shift of Role of the Professional Accountant
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PaperNet: A Dataset and Benchmark for Fine-Grained Paper Classification

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(9), 4554; https://doi.org/10.3390/app12094554
Submission received: 10 April 2022 / Revised: 24 April 2022 / Accepted: 25 April 2022 / Published: 30 April 2022
(This article belongs to the Special Issue Big Data: Advanced Methods, Interdisciplinary Study and Applications)

Abstract

:
Document classification is an important area in Natural Language Processing (NLP). Because a huge amount of scientific papers have been published at an accelerating rate, it is beneficial to carry out intelligent paper classifications, especially fine-grained classification for researchers. However, a public scientific paper dataset for fine-grained classification is still lacking, so the existing document classification methods have not been put to the test. To fill this vacancy, we designed and collected the PaperNet-Dataset that consists of multi-modal data (texts and figures). PaperNet 1.0 version contains hierarchical categories of papers in the fields of computer vision (CV) and NLP, 2 coarse-grained and 20 fine-grained (7 in CV and 13 in NLP). We ran current mainstream models on the PaperNet-Dataset, along with a multi-modal method that we propose. Interestingly, none of these methods reaches an accuracy of 80% in fine-grained classification, showing plenty of room for improvement. We hope that PaperNet-Dataset will inspire more work in this challenging area.

1. Introduction

The number of scientific papers has been increasing ever more rapidly [1,2,3,4,5]. Researchers have to spend a lot of time classifying papers relevant to their studies, especially into fine-grained sub-fields. When the number of papers in a field reaches the order of 10 4 , it becomes really hard to track them manually. Therefore, intelligent fine-grained paper-classification is highly desirable.
However, a lot of existing paper classification models are coarse-grained [6,7]; i.e., the classification model simply divides papers into few large fields such as “math”, “chemistry”, “physics”, and “biology”. In sub-fields such as “machine translation” or “text generation”, the cost of data labeling could be quite high because scientific expertise is usually needed. As a result, fine-grained paper classification datasets are lacking. Additionally, papers usually contain text and figure information. This may not be sufficient to classify paper documents based only on texts [8].
To address this issue, we introduce the PaperNet-Dataset, which includes 12 datasets and multi-modal data, and ran experiments on it using current mainstream classification models. PaperNet contains 2 coarse-grained (CV and NLP) and 20 fine-grained (7 in CV and 13 in NLP) classes.
The main contributions of this paper are summarized as follows:
  • We introduce PaperNet-Dataset, which contains multi-modal data (text and figure) for fine-grained paper classification. To the best of our knowledge, this is the first multi-modal fine-grained paper dataset. In addition, it was pre-processed for convenience of use.
  • Extensive experiments using current mainstream models were conducted to evaluate PaperNet. None of them reached an accuracy of 80%. This shows that fine-grained paper classification is a challenging task and PaperNet could be used as a worthy benchmark.
  • Additionally, we propose a multi-modal paper classification method as a potential direction for better performance. The proposed method combines the strengths of MobileNetV3 and Albert for multi-modal representation fusion and shows promising results.

2. Background and Related Works

2.1. Related Datasets

Reuters dataset [9] contains lots of short news and corresponding topics. There are 10,789 samples and 90 classes in the dataset. Cifar-10 is an image classification dataset for identifying universal objects. The dataset contains 10 categories of RGB color images: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The sizes of images are 32 × 32, and there are 50,000 training images and 10,000 test images in the dataset. CUB-200 is a multi-modal fine-grained classification dataset and contains 11,788 images of birds, including 200 classes of birds, among which 5994 images are in the training dataset and 5794 images are in the test dataset. Each sample provides information about image tags, bird attributes, and image description.
In terms of paper classification dataset, AAPD (Arxiv Academic Paper Dataset) [10] is a large dataset in the field of computer science for multi-label text classification. There are 55,840 papers, including abstracts and corresponding topics, with a total of 54 classes. The DocFigure dataset [11] consists of 33K annotated figures of 28 different categories present in scientific articles published in the CVPR, ECCV, ICCV, etc., conferences in the last several years. The details of related datasets including the average word count in each sample are shown in Table 1.
Due to a large number of categories of scientific papers, researchers more frequently need to locate papers in subdivisions of their research fields. Compared with coarse-grained classification, the sample similarity in the sub-fields is higher, making classification more difficult. Another aspect of scientific papers is that they contain multi-modal information. Figures are usually indispensable besides texts. It is insufficient to classify paper based only on texts. Therefore, multi-modal document classification in the paper field is a potential way to increase accuracy in fine-grained paper classification.
However, because scientific expertise is usually needed, the cost of sample labeling can be quite high. As a result, fine-grained paper classification datasets are lacking. Here, we introduce PaperNet-Dataset, a multi-modal paper classification dataset.

2.2. Multi-Modal Learning

Modalities are different ways in which people receive information. Researchers have achieved remarkable research results in the field of multi-modal learning [12,13].
Multi-modal learning aims to integrate multiple modal information to obtain a consistent and common model output. The fusion of multi-modal information can obtain more comprehensive features, improve the robustness of the model, and ensure that the model can work efficiently even when some modes are missing.
Ref. [14] proposed a new multi-modal event topic model to model the social media documents. Ref. [15] proposed a new hashing algorithm, which integrates the multi-modal features extracted by weak supervision into binary code, thus using the kernel function and SVM for classification. Ref. [16] built their fusion layer by the outer product instead of simple concatenation in order to obtain more features.

2.3. Paper Document Classification

Paper classification belongs to document-level text classification. Compared with other document-level text classification tasks, paper classification contains lots of levels of categories, which are from coarse-grained to fine-grained. Paper documents consist of image information and text information. As for text information, we could get richer semantic information from title to abstract to full document content. Ref. [17] proposed XMLCNN which is based on the popular model [18]. Another popular model, Hierarchical Attention Network [19], models Hierarchical information of documents to extract meaningful features and classify documents in combination with word level and sentence level encoders. Nguyen et al. [20] proposed an improved feature weighting technique for document representation. SGM [10] is a generative method for classifying multi-label documents, which uses encoder-decoder sequence generation model to generate labels for each document. Ref. [21] proposed a simple and properly-regularized single-layer BiLSTM.
Recently, a large number of studies have shown that the “pre-trained model” based on large corpus can learn general language representations, which is beneficial to downstream text classification tasks and can avoid training the model from scratch. Some pre-trained model focuses on learning “context-sensitive word embedding”, such as Elmo [22], OpenAI GPT [23], and Bert [24]. These learned encoders are also used to represent words in downstream tasks. In addition, various pre-trained tasks have been proposed to learn pre-trained model for different purposes.
DocBERT [7] presents the first application of BERT to document classification. As for multi-modal classification, some popular visual question answering (VQA) and Image caption methods such as VisualBERT and UNITER cannot be directly applied. The method used in [25] first studied the relation between the textual and visual aspects in multi-modal posts from major social media platforms.

3. Method

Inspired by [25] and DocBERT [7], we propose a multi-modal paper classification method. More modal information is extracted from scientific paper documents. We also combine the complementary strengths of MobileNetV3 and Albert for better multi-modal information joint representation.

3.1. Figure Feature Representation

The figures are put into the pre-trained MobileNetV3 model for feature extraction. Here we fine-tune the MobileNetV3 to achieve better results for this task. The extracted figure vector is subsequently fused with the text vector of the paper.
V i m = M o b i l e N e t V 3 I i
The figure features F i m = [ I i 1 m , I i 2 m , I i 3 m , , I i n m ] , F i m V i m . Where I i k m means the figure feature, V i m means the feature vector of encoded figure.

3.2. Text Feature Representation

For the text of the paper, we first vectorize the text information. The Albert pre-processing model is used for embedding. Then, the Albert pre-trained model is used for encoding and feature extraction.
V i t = A l b e r t ( T i )
The text features F i t = [ T i 1 t , T i 2 t , T i 3 t , , T i n t ] , F i t V i t . Where T i k t means the text feature, V i t means the feature vector of text.

3.3. Multi-Modal Feature Fusion

After Albert processing, the text vectors are concatenated with the weighted figure vectors.
V i f = V i t ( V i m × W )
where ⊕ is the concatenation operator, W is the weighting matrix. After the concatenation, the fully connected layer is used for the multi-modal information joint representation. Then, we combine the text vector, the figure vector, and the fusion vector for classification.
V i f = R e L U ( w · V i f + b )
V = w 1 · V i f + w 2 · V i t + w 3 · V i m
where w is the weighting coefficient of each vector.

3.4. Classification

We use a two-layer fully-connected neural network as our classification layer. The activation function of the hidden layer and the output layer are element-wise ReLu and softmax functions, respectively. The loss function is cross-entropy.
We show that this modification significantly helps the optimization of this model and outperforms other baselines significantly in most cases.

4. Dataset

4.1. Papernet-Dataset

More than 38,000 samples were collected from “Google Scholar” and well-known NLP and CV conferences such as ACL, EMNLP, CVPR, etc. Because of the academic nature of the papers, we invite scholars or experts who have researched for some years in the related field to label the samples. In our experiments, the dataset is divided with 80% of samples as the training set and validation set. These samples are shuffled and the 10-fold cross validation technique is used. The remaining 20% of the samples are used as the test set. PaperNet contains following subsets.
  • PaperNet_2. PaperNet_2 dataset is a coarse-grained paper classification dataset and contains 2 classes, CV and NLP.
  • PaperNet_20. PaperNet_20 dataset is for fine-grained paper classification. It contains 20 classes, 7 in CV and 13 in NLP.
  • PaperNet_CV. PaperNet_CV dataset is a subset of the PaperNet_20 dataset. The dataset includes 7 classes.
  • PaperNet_NLP. PaperNet_NLP dataset is another subset of the PaperNet_20 dataset which includes 13 classes.
Each of the above 4 datasets contains text, figure, and multi-modal subsets. So, there are 12 datasets in PaperNet 1.0 version. The details are shown in Table 2.

4.2. Data Pre-Processing and Feature Engineering

Because of the scientific expertise of the paper, the cost of sample labeling is extremely high when the paper is classified in a fine-grained way. To solve the problem of a limited number of samples, we extract the figure information and the abstract information for information expansion.
Due to the need for data transmission and storage, most scientific papers exist in PDF format including text content and figures. We used the PDFplumber framework to extract the text content of the abstract part of the paper in PDF, and use the PIL framework to extract the figures. Figure extraction is a challenge because of the size and format of figures in PDF documents. Additionally, some figures contain useless features such as the logo or biography figures which need to be removed in the figure extraction step. Therefore, we modify the PIL framework for better extraction performance. We resize the images and use ResNet for feature extraction. Next, for the text abstract of the paper, we vectorize the text information. TF-IDF algorithm and word2vector algorithm are used for encoding and feature extraction. The details of the paper dataset are shown in Table 2.

5. Experiment

In this section, we evaluate our PaperNet-Dataset with a series of experimental tasks.

5.1. Algorithms

We compare our proposed multi-modal fusion method with multiple well-known classification and embedding methods such as:
  • ResNet50: Residual Network [26] is widely used in the image classification field and as part of the backbone of neural networks in computer vision tasks.
  • DenseNet121: The DenseNet model [27] alleviates the problem of gradient disappearance, strengthens feature propagation, and reduces the number of parameters.
  • MobileNetV3: MobileNetV3 [28] is a light-weight model. Combined with network design and NAS technology, a new generation of MobileNets is proposed.
  • ULMFiT: This is a general language model based on fine tuning (ULMFiT), which can be applied to a variety of tasks in NLP [29].
  • Albert: Albert model from paper [30]. Compared to BERT, Albert achieves better results with fewer parameters. We use the Albert model as part of our proposed model for text encoding.
  • Concat: Previous work [25] concatenates different feature vectors of different modalities as the input of the classification layer. We implement this concatenation model with our feature vectors of different modalities and apply it for classification.

5.2. Settings

For baseline models, we used default parameter settings as in their original papers or implementations and add the dropout layer to the pre-trained models with a dropout rate of 0.3. The learning rate is 1 × 10 4 . We selected 10% of the training set as the validation set and the 10-fold cross validation technique was used. Following [31], we trained the model using Adam [32] and stopped training if the validation loss did not decrease for 10 consecutive epochs.

5.3. Main Results

We aimed to use popular models to evaluate the PaperNet-Dataset and set up benchmarks for paper classification. The results are shown in Figure 1 and Figure 2.

5.3.1. Text Classification

In the experiments, we first evaluated text classification models. The Ulmfit and Albert model perform well on PaperNet_2 dataset, which is coarse-grained. As for fine-grained classification, the accuracy of the two models dropped significantly.

5.3.2. Image Classification

Next, we used popular image classification models to conduct experiments on four datasets. To our surprise, all the models did not perform well on fine-grained datasets. It is more difficult to carry out fine-grained classification because of the high similarity in the subdivided fields. Details of our further analysis can be found in Figure 3 and Figure 4. The typical figures in datasets are shown in Figure 5 and Figure 6.

5.3.3. Multi-Modal Classification

Since the single-modal classification methods can not achieve satisfactory results, we consider using multi-modal classification methods. Some popular VQA and Image caption methods such as VisualBERT or UNITER are not suitable for the multi-modal fine-grained paper classification task. We used the model proposed by [25] for multi-modal paper classification. Additionally, in order to make the model more suitable for paper classification tasks, we improved the model and propose our method introduced in Section 3. As shown in Table 3, the proposed method achieves the highest accuracy in three datasets.

5.4. Other Machine Learning Algorithms

As shown in Table 4, compared with popular pre-trained models, the Random Forest algorithm, the SVM algorithm, and the Naive Bayes algorithm surprisingly achieve higher accuracy on coarse-grained PaperNet_2 dataset. The algorithms consume very little time and computing resources. However, in fine-grained classification, the performance also drops significantly. More details of the results are shown in Appendix A.

6. Conclusions and Future Work

To facilitate research on fine-grained paper classification, we introduce PaperNet-Dataset Version 1.0, which consists of multi-modal data (texts and figures). It contains hierarchical categories, 2 coarse-grained and 20 fine-grained (7 in CV and 13 in NLP). We ran multiple well known mainstream models on the PaperNet-Dataset. They performed poorly in the fine-grained tasks, never reaching an accuracy of 80%. In addition, we propose a multi-modal fusion method that increases the accuracy but still not satisfactory. The results show that there is plenty of room for improvement in fine-grained classification and PaperNet could be used as a benchmark dataset. We plan to expand PaperNet in future versions and hope that it will inspire more work in this challenging area of fine-grained paper classification.

Author Contributions

Conceptualization, T.Y.; Data curation, X.S. and J.Q.; Formal analysis, T.Y.; Funding acquisition, Z.H.; Investigation, T.Y. and Z.F.; Methodology, T.Y.; Project administration, T.Y.; Resources, Y.L. and Z.H.; Software, T.Y.; Supervision, Y.L. and Z.H.; Validation, T.Y.; Visualization, T.Y.; Writing—original draft preparation, T.Y.; Writing—review and editing, Y.L. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

The work described in this paper was supported by the BUPT innovation and entrepreneurship support program (2022-YC-S002) and the Beijing Key Laboratory of Work Safety and Intelligent Monitoring Foundation.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CVComputer Vision
NLPNatural Language Processing
KNNK-Nearest Neighbor
SVMSupport Vector Machine

Appendix A

Some detailed experiment results.
Table A1. Detailed results of the Naive Bayes algorithm.
Table A1. Detailed results of the Naive Bayes algorithm.
DatasetClassPrecisionRecallF1-Score
PaperNet_2CV0.990.940.97
NLP0.960.990.98
PaperNet_20CV_attention0.600.150.55
CV_classification0.520.470.50
CV_detection0.420.930.58
CV_GAN0.680.470.55
CV_recognition0.820.380.52
CV_retrieval0.990.050.10
CV_segmentation0.610.210.31
NLP_bert0.460.660.54
NLP_conversation0.670.810.73
NLP_cross0.430.450.44
NLP_extraction0.490.620.55
NLP_Few_shot0.830.210.33
NLP_knowledge_graph0.990.400.57
NLP_machine_reading0.990.050.10
NLP_machine_translation0.670.880.76
NLP_multilingual0.650.520.58
NLP_multimodal0.670.530.59
NLP_named_entity_recognition0.990.200.33
NLP_sentiment_analysis0.700.400.51
NLP_text_generation0.890.400.55
PaperNet_CVCV_attention0.670.150.24
CV_classification0.520.410.46
CV_detection0.400.950.56
CV_GAN0.410.200.27
CV_recognition0.700.230.35
CV_retrieval0.410.210.27
CV_segmentation0.500.150.24
PaperNet_NLPNLP_bert0.420.650.51
NLP_conversation0.640.790.71
NLP_cross0.440.410.43
NLP_extraction0.460.620.52
NLP_Few_shot0.750.120.21
NLP_knowledge_graph0.990.250.40
NLP_machine_reading0.510.120.20
NLP_machine_translation0.570.860.69
NLP_multilingual0.580.420.49
NLP_multimodal0.710.500.59
NLP_named_entity_recognition0.800.200.32
NLP_sentiment_analysis0.690.310.43
NLP_text_generation0.500.130.20
Table A2. Detailed results of the Adaboost algorithm.
Table A2. Detailed results of the Adaboost algorithm.
DatasetClassPrecisionRecallF1-Score
PaperNet_2CV0.960.960.96
NLP0.970.970.97
PaperNet_20CV_attention0.830.850.84
CV_classification0.930.840.88
CV_detection0.930.970.95
CV_GAN0.730.600.66
CV_recognition0.950.900.92
CV_retrieval0.950.900.92
CV_segmentation0.920.870.89
NLP_bert0.820.800.81
NLP_conversation0.810.850.83
NLP_cross0.760.790.77
NLP_extraction0.780.820.80
NLP_Few_shot0.590.670.63
NLP_knowledge_graph0.540.700.61
NLP_machine_reading0.910.500.65
NLP_machine_translation0.860.860.86
NLP_multilingual0.800.820.81
NLP_multimodal0.750.700.72
NLP_named_entity_recognition0.740.850.79
NLP_sentiment_analysis0.590.740.66
NLP_text_generation0.680.650.67
PaperNet_CVCV_attention0.520.990.68
CV_classification0.990.810.90
CV_detection0.920.980.95
CV_GAN0.500.490.50
CV_recognition0.990.050.10
CV_retrieval0.510.500.50
CV_segmentation0.880.990.94
PaperNet_NLPNLP_bert0.870.850.86
NLP_conversation0.960.830.89
NLP_cross0.870.930.90
NLP_extraction0.830.850.84
NLP_Few_shot0.690.750.72
NLP_knowledge_graph0.680.750.71
NLP_machine_reading0.560.450.50
NLP_machine_translation0.920.900.91
NLP_multilingual0.840.840.84
NLP_multimodal0.930.930.93
NLP_named_entity_recognition0.750.900.82
NLP_sentiment_analysis0.760.890.82
NLP_text_generation0.930.700.80
Table A3. Detailed results of the KNN algorithm.
Table A3. Detailed results of the KNN algorithm.
DatasetClassPrecisionRecallF1-Score
PaperNet_2CV0.980.940.96
NLP0.960.980.97
PaperNet_20CV_attention0.240.230.23
CV_classification0.440.440.44
CV_detection0.570.750.65
CV_GAN0.440.600.50
CV_recognition0.620.550.58
CV_retrieval0.380.150.21
CV_segmentation0.540.400.46
NLP_bert0.620.640.63
NLP_conversation0.710.790.75
NLP_cross0.490.520.50
NLP_extraction0.680.600.64
NLP_Few_shot0.480.460.47
NLP_knowledge_graph0.880.750.81
NLP_machine_reading0.990.500.67
NLP_machine_translation0.730.850.79
NLP_multilingual0.660.540.59
NLP_multimodal0.770.670.71
NLP_named_entity_recognition0.800.600.69
NLP_sentiment_analysis0.570.600.58
NLP_text_generation0.920.550.69
PaperNet_CVCV_attention0.450.380.41
CV_classification0.530.470.50
CV_detection0.560.750.64
CV_GAN0.480.580.53
CV_recognition0.640.480.55
CV_retrieval0.700.350.47
CV_segmentation0.550.440.49
PaperNet_NLPNLP_bert0.600.620.61
NLP_conversation0.670.730.70
NLP_cross0.530.550.54
NLP_extraction0.710.660.69
NLP_Few_shot0.450.580.51
NLP_knowledge_graph0.890.800.84
NLP_machine_reading0.990.500.67
NLP_machine_translation0.720.830.77
NLP_multilingual0.570.500.53
NLP_multimodal0.820.900.86
NLP_named_entity_recognition0.800.600.69
NLP_sentiment_analysis0.640.660.65
NLP_text_generation0.900.450.60
Table A4. Detailed results of the SVM algorithm.
Table A4. Detailed results of the SVM algorithm.
DatasetClassPrecisionRecallF1-Score
PaperNet_2CV0.970.990.98
NLP0.980.990.98
PaperNet_20CV_attention0.630.300.41
CV_classification0.630.800.70
CV_detection0.680.960.79
CV_GAN0.820.620.71
CV_recognition0.860.730.79
CV_retrieval0.990.050.10
CV_segmentation0.870.870.87
NLP_bert0.620.930.74
NLP_conversation0.930.790.85
NLP_cross0.750.770.76
NLP_extraction0.750.810.78
NLP_Few_shot0.880.290.44
NLP_knowledge_graph0.990.500.67
NLP_machine_reading0.990.300.46
NLP_machine_translation0.830.950.88
NLP_multilingual0.910.780.84
NLP_multimodal0.880.770.82
NLP_named_entity_recognition0.940.750.83
NLP_sentiment_analysis0.770.690.73
NLP_text_generation0.990.700.82
PaperNet_CVCV_attention0.750.300.43
CV_classification0.680.790.73
CV_detection0.620.960.75
CV_GAN0.870.580.69
CV_recognition0.840.680.75
CV_retrieval0.990.050.10
CV_segmentation0.890.750.81
PaperNet_NLPNLP_bert0.570.940.71
NLP_conversation0.930.790.85
NLP_cross0.760.730.75
NLP_extraction0.700.820.76
NLP_Few_shot0.900.380.53
NLP_knowledge_graph0.990.500.67
NLP_machine_reading0.990.100.18
NLP_machine_translation0.840.970.90
NLP_multilingual0.930.760.84
NLP_multimodal0.890.800.84
NLP_named_entity_recognition0.940.850.89
NLP_sentiment_analysis0.860.710.78
NLP_text_generation0.990.600.75

References

  1. Zyuzin, V.; Ronkin, M.; Porshnev, S.; Kalmykov, A. Automatic Asbestos Control Using Deep Learning Based Computer Vision System. Appl. Sci. 2021, 11, 532. [Google Scholar] [CrossRef]
  2. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef] [Green Version]
  3. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
  4. Dhaliwal, S.S.; Nahid, A.A.; Abbas, R. Effective Intrusion Detection System Using XGBoost. Information 2018, 9, 149. [Google Scholar] [CrossRef] [Green Version]
  5. Mukhamediev, R.I.; Symagulov, A.; Kuchin, Y.; Yakunin, K.; Yelis, M. From Classical Machine Learning to Deep Neural Networks: A Simplified Scientometric Review. Appl. Sci. 2021, 11, 5541. [Google Scholar] [CrossRef]
  6. Ma, X.; Wang, R. Personalized Scientific Paper Recommendation Based on Heterogeneous Graph Representation. IEEE Access 2019, 7, 79887–79894. [Google Scholar] [CrossRef]
  7. Adhikari, A.; Ram, A.; Tang, R.; Lin, J. DocBERT: BERT for Document Classification. arXiv 2019, arXiv:cs.CL/1904.08398. [Google Scholar]
  8. Quan, J.; Li, Q.; Li, M. Computer Science Paper Classification for CSAR. In New Horizons in Web Based Learning; Cao, Y., Väljataga, T., Tang, J.K., Leung, H., Laanpere, M., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 34–43. [Google Scholar]
  9. Apté, C.; Damerau, F.; Weiss, S.M. Automated learning of decision rules for text categorization. ACM Trans. Inf. Syst. (TOIS) 1994, 12, 233–251. [Google Scholar] [CrossRef]
  10. Yang, P.; Sun, X.; Li, W.; Ma, S.; Wu, W.; Wang, H. SGM: Sequence generation model for multi-label classification. arXiv 2018, arXiv:1806.04822. [Google Scholar]
  11. Jobin, K.; Mondal, A.; Jawahar, C. DocFigure: A dataset for scientific document figure classification. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, NSW, Australia, 22–25 September 2019; Volume 1, pp. 74–79. [Google Scholar]
  12. Cadene, R.; Ben-younes, H.; Cord, M.; Thome, N. MUREL: Multimodal Relational Reasoning for Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  13. Zhu, J.; Zhou, Y.; Zhang, J.; Li, H.; Zong, C.; Li, C. Multimodal Summarization with Guidance of Multimodal Reference. Proc. AAAI Conf. Artif. Intell. 2020, 34, 9749–9756. [Google Scholar] [CrossRef]
  14. Qian, S.; Zhang, T.; Xu, C.; Shao, J. Multi-Modal Event Topic Model for Social Event Analysis. IEEE Trans. Multimed. 2016, 18, 233–246. [Google Scholar] [CrossRef]
  15. Xia, Y.; Zhang, L.; Liu, Z.; Nie, L.; Li, X. Weakly Supervised Multimodal Kernel for Categorizing Aerial Photographs. IEEE Trans. Image Process. 2017, 26, 3748–3758. [Google Scholar] [CrossRef]
  16. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the Empirical Methods in Natural Language Processing, EMNLP, Copenhagen, Denmark, 7–11 September 2017. [Google Scholar]
  17. Liu, J.; Chang, W.C.; Wu, Y.; Yang, Y. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, 7–11 August 2017; pp. 115–124. [Google Scholar]
  18. Kim, Y. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:cs.CL/1408.5882. [Google Scholar]
  19. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
  20. Nguyen, D.B.; Shenify, M.; Al-Mubaid, H. Biomedical Text Classification with Improved Feature Weighting Method. In Proceedings of the International Conference on Bioinformatics and Computational Biology, Las Vegas, NV, USA, 4–6 April 2016. [Google Scholar]
  21. Adhikari, A.; Ram, A.; Tang, R.; Lin, J. Rethinking complex neural network architectures for document classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4046–4051. [Google Scholar]
  22. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
  23. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:cs.CL/2005.14165. [Google Scholar]
  24. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  25. Schifanella, R.; de Juan, P.; Tetreault, J.; Cao, L. Detecting Sarcasm in Multimodal Social Platforms. In Proceedings of the 24th ACM International Conference on Multimedia (MM ’16); Association for Computing Machinery: New York, NY, USA, 2016; pp. 1136–1145. [Google Scholar] [CrossRef] [Green Version]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Chengdu, China, 15–17 December 2016; pp. 770–778. [Google Scholar]
  27. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  28. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1314–1324. [Google Scholar]
  29. Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar]
  30. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
  31. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  32. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:cs.LG/1412.6980. [Google Scholar]
Figure 1. The machine learning algorithm experiments.
Figure 1. The machine learning algorithm experiments.
Applsci 12 04554 g001
Figure 2. Popular pre-trained model accuracy experiments.
Figure 2. Popular pre-trained model accuracy experiments.
Applsci 12 04554 g002
Figure 3. Visualization of the classification results on the PaperNet_2 dataset. (Red: NLP; Blue: CV.)
Figure 3. Visualization of the classification results on the PaperNet_2 dataset. (Red: NLP; Blue: CV.)
Applsci 12 04554 g003
Figure 4. Visualization of the classification results on the PaperNet_CV dataset. (Red: CV_attention; Yellow: CV_classification; Green: CV_segmentation; Brown: CV_retrieval; Purple: CV_recognition; Blue: CV_detection; Orange: CV_GAN.)
Figure 4. Visualization of the classification results on the PaperNet_CV dataset. (Red: CV_attention; Yellow: CV_classification; Green: CV_segmentation; Brown: CV_retrieval; Purple: CV_recognition; Blue: CV_detection; Orange: CV_GAN.)
Applsci 12 04554 g004
Figure 5. Examples of the PaperNet_CV dataset figures.
Figure 5. Examples of the PaperNet_CV dataset figures.
Applsci 12 04554 g005
Figure 6. Examples of the PaperNet_NLP dataset figures.
Figure 6. Examples of the PaperNet_NLP dataset figures.
Applsci 12 04554 g006
Table 1. The details of related datasets.
Table 1. The details of related datasets.
ModalityDatasetDetail
ClassSampleWord
Reuters9010,789144.3
TextAAPD5455,840167.3
IMDB10135,669393.8
Cifar-101060,000-
FigureDocFigure2833,000-
Deepchart55000-
CUB20011788-
Multi-modalFood-10110190,704-
PaperNet v1.02038,608150.43
Table 2. Details of PaperNet datasets (Multi. means the pre-processed multimodal data).
Table 2. Details of PaperNet datasets (Multi. means the pre-processed multimodal data).
DatasetSubsetClassTextFigureMulti.
PaperNet_2 CV186312,77425,548
NLP2538660913,060
Average Coarse-grained class2200.59691.519,304
PaperNet_20PaperNet_CVCV_attention20111572314
CV_classification38719513902
CV_detection62142808560
CV_GAN22819323864
CV_recognition28414322864
CV_retrieval82270540
CV_segmentation26017523504
PaperNet_NLPNLP_bert37212152430
NLP_conversation2626981396
NLP_cross2775761152
NLP_extraction3406811362
NLP_Few_shot119309614
NLP_knowledge_graph82244488
NLP_machine_reading59104208
NLP_machine_translation4908321664
NLP_multilingual2535571114
NLP_multimodal1458291656
NLP_named_entity_recognition105189378
NLP_sentiment_analysis172179358
NLP_text_generation100121240
Average Fine-grained class241.95965.41930.4
Table 3. Comparison of accuracy.
Table 3. Comparison of accuracy.
ModalityAlgorithmPaperNet_2PaperNet_20PaperNet_CVPaperNet_NLP
ImageResNet5083.94 ± 0.5550.30 ± 0.6660.23 ± 0.6145.33 ± 0.50
DenseNet12182.34 ± 0.9645.16 ± 0.4658.80 ± 0.4046.21 ± 0.46
MobileNetV383.66 ± 0.8648.38 ± 0.5656.45 ± 0.4040.08 ± 0.54
TextUlmfit96.26 ± 0.4471.30 ± 1.1275.30 ± 1.0673.33 ± 0.18
Albert96.31 ± 0.0973.32 ± 0.2473.18 ± 0.1274.23 ± 0.36
Multi-modalConcat96.27 ± 0.1173.45 ± 0.6972.43 ± 0.2775.36 ± 0.51
Our method97.05 ± 0.0573.85 ± 0.3974.26 ± 0.3279.27 ± 0.37
Table 4. Performance comparison of machine learning algorithms.
Table 4. Performance comparison of machine learning algorithms.
DatasetMetricsNaive BayesAdaboostKNNSVMRandom Forest
PaperNet_2Precision97.91 ± 0.6596.51 ± 0.5696.87 ± 0.6399.12 ± 0.6297.23 ± 0.12
Recall97.20 ± 0.4696.67 ± 0.3296.42 ± 0.2698.94 ± 0.3397.58 ± 0.26
F1-score97.56 ± 0.2396.66 ± 0.2596.64 ± 0.2198.96 ± 0.1697.42 ± 0.22
Accuracy96.44 ± 0.3694.08 ± 0.2393.56 ± 0.4396.98 ± 0.2696.32 ± 0.16
PaperNet_20Precision70.75 ± 1.3179.15 ± 0.3962.64 ± 1.3183.79 ± 1.2181.32 ± 0.17
Recall43.98 ± 0.6378.36 ± 0.3255.89 ± 0.7566.72 ± 0.6972.15 ± 0.08
F1-score46.92 ± 0.3978.46 ± 0.1658.01 ± 0.8769.97 ± 0.5276.46 ± 0.09
Accuracy50.22 ± 0.5971.76 ± 0.2653.01 ± 0.6869.43 ± 0.3172.89 ± 0.21
PaperNet_CVPrecision50.89 ± 1.2976.81 ± 0.8955.85 ± 0.8580.67 ± 1.1282.12 ± 0.06
Recall32.85 ± 0.6470.02 ± 0.4949.33 ± 0.6458.67 ± 0.6275.43 ± 0.15
F1-score34.11 ± 0.6765.33 ± 0.2851.22 ± 0.4760.98 ± 0.2178.64 ± 0.08
Accuracy44.96 ± 0.9172.40 ± 0.6350.67 ± 0.5664.34 ± 0.4973.86 ± 0.36
PaperNet_NLPPrecision61.58 ± 1.1680.68 ± 0.6770.65 ± 0.8386.56 ± 1.2482.89 ± 0.18
Recall41.43 ± 0.8781.29 ± 0.5664.55 ± 0.6568.85 ± 0.7280.25 ± 0.16
F1-score43.81 ± 0.6481.04 ± 0.5166.51 ± 0.3972.64 ± 0.5181.55 ± 0.11
Accuracy48.16 ± 0.8979.01 ± 0.3256.97 ± 0.4168.92 ± 0.5578.89 ± 0.26
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yue, T.; Li, Y.; Shi, X.; Qin, J.; Fan, Z.; Hu, Z. PaperNet: A Dataset and Benchmark for Fine-Grained Paper Classification. Appl. Sci. 2022, 12, 4554. https://doi.org/10.3390/app12094554

AMA Style

Yue T, Li Y, Shi X, Qin J, Fan Z, Hu Z. PaperNet: A Dataset and Benchmark for Fine-Grained Paper Classification. Applied Sciences. 2022; 12(9):4554. https://doi.org/10.3390/app12094554

Chicago/Turabian Style

Yue, Tan, Yong Li, Xuzhao Shi, Jiedong Qin, Zijiao Fan, and Zonghai Hu. 2022. "PaperNet: A Dataset and Benchmark for Fine-Grained Paper Classification" Applied Sciences 12, no. 9: 4554. https://doi.org/10.3390/app12094554

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop