# AlgoLabel: A Large Dataset for Multi-Label Classification of Algorithmic Challenges

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Task Definition

## 4. Dataset

#### 4.1. Data Collection

#### 4.2. Text Dataset

#### 4.3. Code Dataset

## 5. Methodology

#### 5.1. Text Preprocessing

#### 5.2. Code Preprocessing

#### 5.3. Models

#### 5.4. Metrics

## 6. Experimental Results

#### 6.1. Text Classification

#### 6.2. Code Classification

#### 6.3. Dual Text-Code Classification

## 7. Discussion

#### 7.1. Text Classification

#### 7.2. Code Classification

#### 7.3. Computational Efficiency

#### 7.4. Study Limitations

## 8. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Data Availability

## References

- Yin, P.; Neubig, G. TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; Blanco, E., Lu, W., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 7–12. [Google Scholar] [CrossRef]
- Polosukhin, I.; Skidanov, A. Neural Program Search: Solving Programming Tasks from Description and Examples. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Yin, P.; Deng, B.; Chen, E.; Vasilescu, B.; Neubig, G. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. In Proceedings of the International Conference on Mining Software Repositories, Gothenburg, Sweden, 28–29 May 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 476–486. [Google Scholar] [CrossRef] [Green Version]
- Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, T.; Nakamura, S. Learning to Generate Pseudo-code from Source Code Using Statistical Machine Translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; IEEE Computer Society: Lincoln, NE, USA, 2015; pp. 574–584. [Google Scholar] [CrossRef]
- Athavale, V.; Naik, A.; Vanjape, R.; Shrivastava, M. Predicting Algorithm Classes for Programming Word Problems. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Hong Kong, China, 4 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 84–93. [Google Scholar] [CrossRef] [Green Version]
- Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germeny, 2011; pp. 145–158. [Google Scholar]
- Zavershynskyi, M.; Skidanov, A.; Polosukhin, I. NAPS: Natural Program Synthesis Dataset. arXiv
**2018**, arXiv:1807.03168. [Google Scholar] - Miceli Barone, A.V.; Sennrich, R. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, 1 December 2017; Asian Federation of Natural Language Processing: Taipei, Taiwan, 2017; pp. 314–319. [Google Scholar]
- Agashe, R.; Iyer, S.; Zettlemoyer, L. JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 5436–5446. [Google Scholar] [CrossRef]
- Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. code2vec: Learning distributed representations of code. Proc. ACM Program. Lang.
**2019**, 3, 1–29. [Google Scholar] [CrossRef] [Green Version] - Massarelli, L.; Di Luna, G.A.; Petroni, F.; Querzoni, L.; Baldoni, R. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Proceedings of the 16th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), Gothenburg, Sweden, 19–20 June 2019. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv
**2013**, arXiv:1301.3781. [Google Scholar] - Lin, Z.; Feng, M.; dos Santos, C.N.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A Structured Self-Attentive Sentence Embedding. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
- Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. arXiv
**2020**, arXiv:2002.08155. [Google Scholar] - Karampatsis, R.; Sutton, C.A. SCELMo: Source Code Embeddings from Language Models. arXiv
**2020**, arXiv:2004.13214. [Google Scholar] - Lample, G.; Conneau, A. Cross-lingual Language Model Pretraining. In Proceedings of the 2019 Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 7059–7069. [Google Scholar]
- Lachaux, M.; Rozière, B.; Chanussot, L.; Lample, G. Unsupervised Translation of Programming Languages. arXiv
**2020**, arXiv:2006.03511. [Google Scholar] - Allamanis, M.; Tarlow, D.; Gordon, A.; Wei, Y. Bimodal modelling of source code and natural language. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 7–9 July 2015; JMLR.org: Cambridge, MA, USA, 2015; pp. 2123–2132. [Google Scholar] [CrossRef]
- Wan, Y.; Shu, J.; Sui, Y.; Xu, G.; Zhao, Z.; Wu, J.; Yu, P. Multi-modal attention network learning for semantic source code retrieval. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, 11–15 November 2019; IEEE: New York, NY, USA, 2019; pp. 13–25. [Google Scholar]
- Ye, H.; Li, W.; Wang, L. Jointly Learning Semantic Parser and Natural Language Generator via Dual Information Maximization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 2090–2101. [Google Scholar] [CrossRef]
- Forišek, M. The difficulty of programming contests increases. In International Conference on Informatics in Secondary Schools-Evolution and Perspectives; Springer: Berlin/Heidelberg, Germany, 2010; pp. 72–85. [Google Scholar]
- de Boer, R. Finding Winning Patterns in ICPC Data. Master’s Thesis, Faculty of Science, Utrecht University, Utrecht, The Netherlands, 2019. [Google Scholar]
- Iancu, B.; Mazzola, G.; Psarakis, K.; Soilis, P. Multi-label Classification for Automatic Tag Prediction in the Context of Programming Challenges. arXiv
**2019**, arXiv:1911.12224. [Google Scholar] - Katakis, I.; Tsoumakas, G.; Vlahavas, I. Multilabel text classification for automated tag suggestion. In Proceedings of the ECML/PKDD, Antwerp, Belgium, 15–19 September 2008; Volume 18, p. 5. [Google Scholar]
- Nam, J.; Kim, J.; Mencía, E.L.; Gurevych, I.; Fürnkranz, J. Large-scale multi-label text classification— Revisiting neural networks. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France, 14–18 September 2014; Springer: Berlin/Heidelberg, Germnay, 2014; pp. 437–452. [Google Scholar]
- Huang, W.; Chen, E.; Liu, Q.; Chen, Y.; Huang, Z.; Liu, Y.; Zhao, Z.; Zhang, D.; Wang, S. Hierarchical Multi-Label Text Classification: An Attention-Based Recurrent Network Approach. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1051–1060. [Google Scholar] [CrossRef] [Green Version]
- Peng, H.; Li, J.; Wang, S.; Wang, L.; Gong, Q.; Yang, R.; Li, B.; Yu, P.; He, L. Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large-Scale Multi-Label Text Classification. IEEE Trans. Knowl. Data Eng.
**2019**. [Google Scholar] [CrossRef] [Green Version] - Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Cambridge, MA, USA, 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]
- Adhikari, A.; Ram, A.; Tang, R.; Lin, J. DocBERT: BERT for Document Classification. arXiv
**2019**, arXiv:1904.08398. [Google Scholar] - Halim, S.; Halim, F.; Skiena, S.S.; Revilla, M.A. Competitive Programming 3; Lulu Independent Publish: Morrisville, NC, USA, 2013. [Google Scholar]
- Szymański, P.; Kajdanowicz, T. A Network Perspective on Stratification of Multi-Label Data. In Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, ECML-PKDD, Skopje, Macedonia, 22 September 2017; Torgo, L., Krawczyk, B., Branco, P., Moniz, N., Eds.; PMLR: ECML-PKDD, Skopje, Macedonia, 2017; Volume 74, pp. 22–35. [Google Scholar]
- Bird, S.; Loper, E. NLTK: The Natural Language Toolkit. In Proceedings of the ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–21 July 2006; pp. 69–72. [Google Scholar] [CrossRef]
- Jasper, D. Clang-Format: Automatic Formatting for C++. 2013. Available online: http://llvm.org/devmtg/2013-04/jasper-slides.pdf (accessed on 5 November 2020).
- Marjamäki, D. Cppcheck: A Tool for Static C/C++ Code Analysis. 2013. Available online: http://cppcheck.sourceforge.net/ (accessed on 5 November 2014).
- Spinellis, D. Dspinellis/Tokenizer: Version 1.1. 2019. Available online: https://github.com/dspinellis/tokenizer/ (accessed on 5 November 2014). [CrossRef]
- Kovalenko, V.; Bogomolov, E.; Bryksin, T.; Bacchelli, A. PathMiner: A library for mining of path-based representations of code. In Proceedings of the 16th International Conference on Mining Software Repositories, Montreal, QC, Canada, 25–31 May 2019; IEEE Computer Society: Los Alamitos, CA, USA, 2019; pp. 13–17. [Google Scholar] [CrossRef] [Green Version]
- Breiman, L. Random Forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] [Green Version] - Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST)
**2011**, 2, 1–27. [Google Scholar] [CrossRef] - Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y. Xgboost: Extreme gradient boosting. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, USA, 24–27 August 2015; pp. 785–794. [Google Scholar] [CrossRef] [Green Version]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv
**2019**, arXiv:1910.03771. [Google Scholar] - Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Min. (IJDWM)
**2007**, 3, 1–13. [Google Scholar] [CrossRef] [Green Version] - Wang, K.; Su, Z. Blended, Precise Semantic Program Embeddings. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, London, UK, 15–20 June 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 121–134. [Google Scholar] [CrossRef]

**Figure 1.**An example of a classic algorithmic challenge. The large upper bound value for N, which represents the maximum input size, suggests that an efficient solution needs to employ a dynamic programming technique.

**Figure 2.**Attention scores for the statement of a graph problem that was misclassified as dp & greedy. Stopwords ‘the’, ‘of’, ‘to’ are filtered and every word is converted to lowercase prior to training the model. Remaining words without an associated embedding are marked with brackets.

Dataset | CF | Kattis | OJ | Total |
---|---|---|---|---|

Labeled | 5785 | 771 | 1965 | 8521 |

Unlabeled | 584 | 1495 | 2935 | 5019 |

Avg. Num. Labels | 3.19 | 1.72 | 1.78 |

Dataset | Train | Dev | Test |
---|---|---|---|

dp & greedy | 1866 | 338 | 338 |

implementation | 1607 | 261 | 261 |

graphs | 1311 | 208 | 165 |

math | 1457 | 266 | 266 |

Unrated Difficulty | 1375 | 0 | 0 |

Easy Difficulty | 1321 | 243 | 265 |

Medium Difficulty | 1465 | 358 | 320 |

Hard Difficulty | 645 | 132 | 155 |

Size | 4806 | 733 | 740 |

Avg. Statement Length | 187.3 | 179.2 | 184.2 |

Avg. Input Length | 75.1 | 63.5 | 66.1 |

Avg. Output Length | 41.4 | 37.6 | 37.3 |

Vocabulary Size | 69k | 17k | 17k |

dp & Greedy | Implementation | Graphs | Math |
---|---|---|---|

maximum | cards | graph | coordinates |

a_i | letter | edges | modulo |

ways | word | roads | point |

modulo | lowercase | vertex | permutation |

polycarp | name | edge | answer |

strings | team | tree | … |

${10}^{5}$ | letters | connected | considered |

total | characters | u | points |

array | quotes | road | a_1 |

choose | table | vertices | f |

**Table 5.**Statistics for solutions from Codeforces (CF), Infoarena (IA), and university assignments (Uni).

Dataset | CF | IA | Uni |
---|---|---|---|

Labeled | 15651 | 6912 | 3860 |

Unlabeled | 171 | 15743 | 0 |

Avg. Num. Labels | 2.51 | 1.37 | 1.74 |

Num. Problems | 6374 | 2321 | 31 |

Avg. Num. Solutions | 2.48 | 9.76 | 124.51 |

Dataset | Train | Dev | Test |
---|---|---|---|

dp & greedy | 7224 | 676 | 654 |

implementation | 2564 | 528 | 550 |

graphs | 2873 | 348 | 342 |

math | 3303 | 576 | 593 |

Avg. Tokens | 643.5 | 659.4 | 670.8 |

Avg. Num. AST Paths | 369.4 | 345.5 | 358.0 |

Avg. AST Path Height | 10.28 | 10.29 | 10.28 |

Avg. Functions | 3.9 | 3.1 | 3.2 |

Num. Problems | 3562 | 633 | 633 |

Size | 14,314 | 1513 | 1513 |

Original Sequence | Normalized Surface Form |
---|---|

$1\le i,j\le n$ | $range(i,n),range(j,n)$ |

0 < n < 200,000 | $range(n,{10}^{5})$ |

a_0, a_1, a_2, ..., a_n | $sequence(a,n)$ |

$\mathtt{p}\_\mathtt{2},\mathtt{ldots},\mathtt{p}\_\left\{\mathtt{n}\right\}$ | $sequence(p,n)$ |

x_1, x_2, ..., x_{n - 1} | $sequence(x,n-1)$ |

$(xi,yi)$ | $pair({x}_{i},{y}_{i})$ |

**Table 8.**Performance on the text classification task, measured using Hamming loss (lower is better) and F1 score (higher is better).

Model | Hamming | F1 |
---|---|---|

Random Forest | 0.25 | 0.54 |

SVM | 0.27 | 0.55 |

XGBoost | 0.25 | 0.60 |

BERT | 0.29 | 0.40 |

CNN [5] | 0.28 | 0.56 |

AlgoLabelNet (Ours) | 0.27 | 0.62 |

Ablation study | ||

- statement | 0.34 | 0.57 |

- input/output | 0.37 | 0.51 |

- shared encoder | 0.36 | 0.52 |

- pretrained embeddings | 0.33 | 0.55 |

XGBoost | AlgoLabelNet | |||||
---|---|---|---|---|---|---|

Label | Precision | Recall | F1 | Precision | Recall | F1 |

dp & greedy | 0.61 | 0.61 | 0.61 | 0.58 | 0.74 | 0.65 |

implementation | 0.62 | 0.36 | 0.46 | 0.47 | 0.67 | 0.55 |

graphs | 0.84 | 0.64 | 0.71 | 0.86 | 0.60 | 0.71 |

math | 0.72 | 0.53 | 0.61 | 0.59 | 0.62 | 0.61 |

**Table 10.**Performance on the code classification task, highlighting the different input features: byte-code (BC), AST paths, source code tokens, and anonymised code tokens.

Model | Input Features | Hamming | F1 |
---|---|---|---|

SAFE | BC | 0.31 | 0.46 |

SAFE | BC (attention) | 0.35 | 0.50 |

Code2Vec | AST (300 Paths) | 0.35 | 0.51 |

Code2Vec | AST (500 Paths) | 0.27 | 0.55 |

BiLSTM | Code Tok. | 0.33 | 0.55 |

BiLSTM | Anon. Code Tok. | 0.39 | 0.48 |

AlgoCode | BC+AST (300)+Tok. | 0.31 | 0.55 |

AlgoCode | BC+AST (500) | 0.30 | 0.56 |

Labels | Precision | Recall | F1 |
---|---|---|---|

dp & greedy | 0.52 | 0.61 | 0.57 |

implementation | 0.53 | 0.53 | 0.53 |

graphs | 0.68 | 0.60 | 0.64 |

math | 0.60 | 0.53 | 0.56 |

Model | Input Features | Hamming | F1 |
---|---|---|---|

AlgoLabelNet | Text+Code | 0.26 | 0.65 |

Labels | Precision | Recall | F1 |

dp & greedy | 0.55 | 0.81 | 0.65 |

implementation | 0.58 | 0.56 | 0.57 |

graphs | 0.72 | 0.75 | 0.74 |

math | 0.55 | 0.81 | 0.65 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Iacob, R.C.A.; Monea, V.C.; Rădulescu, D.; Ceapă, A.-F.; Rebedea, T.; Trăușan-Matu, Ș.
AlgoLabel: A Large Dataset for Multi-Label Classification of Algorithmic Challenges. *Mathematics* **2020**, *8*, 1995.
https://doi.org/10.3390/math8111995

**AMA Style**

Iacob RCA, Monea VC, Rădulescu D, Ceapă A-F, Rebedea T, Trăușan-Matu Ș.
AlgoLabel: A Large Dataset for Multi-Label Classification of Algorithmic Challenges. *Mathematics*. 2020; 8(11):1995.
https://doi.org/10.3390/math8111995

**Chicago/Turabian Style**

Iacob, Radu Cristian Alexandru, Vlad Cristian Monea, Dan Rădulescu, Andrei-Florin Ceapă, Traian Rebedea, and Ștefan Trăușan-Matu.
2020. "AlgoLabel: A Large Dataset for Multi-Label Classification of Algorithmic Challenges" *Mathematics* 8, no. 11: 1995.
https://doi.org/10.3390/math8111995