Multi-Task Romanian Email Classification in a Business Context
- Phishing Emails
- These are the types of emails that cybercriminals use to trick the receiver into offering personal sensitive information (e.g., credit card, login credentials). They are very dangerous because they mislead the receiver into thinking that following the requests of the attacker is the best course of action;
- A common phishing technique is email spoofing. These emails mimic an email from a legitimate sender (e.g., Google, Apple, PayPal) and ask the receiver to take action (e.g., payment of an invoice, request to reset a password, request for updated billing information).
- These emails usually trick the user into thinking an important reward is received for completing an action. We give a few examples of types of scams that a receiver could be subjected to;
- Current events scams are emails that use hot new topics to get attention. For example, during the COVID-19 pandemic, scammers sent spam messages offering remote jobs that paid in Bitcoin or messages offering financial relief for small businesses. Ultimately to complete the scam, the scammers would ask for bank account details;
- Another type of scam is advance-fee scam emails. These scam emails promise a big financial reward; however, the receiver must first pay a comparatively small advance to be eligible to obtain the reward. Usually, the advance is a processing fee needed to unlock a larger amount of money. A small variation on this idea involves the scammer pretending to be a family member of the receiver who is in trouble and in need of money.
- These are the types of emails that are used to infect the receiver’s device with malware. Usually, this is done by clicking a link to a malicious website or opening an attachment that contains malware (e.g., ransomware, trojan, bots, crypto-miners, spyware, keyloggers). A common technique is to use a PDF, Word, or Powerpoint file with a malicious script that runs when the attachment is opened.
1.2. Business Context in Romania
1.3. Related Work
1.3.1. Public Datasets for Email Classification
1.3.2. Existing Approaches in Email Classification
1.4. Research Objectives
- Publish an anonymized version of a dataset of 1447 manually labeled emails on multiple business-related criteria. The dataset is available at: https://huggingface.co/datasets/readerbench/ro-business-emails (accessed on 12 May 2023);
- Introduce a strong baseline for the curation model that supports follow-up in-depth classifications;
- Develop a strong baseline for multi-task email classification consisting of the identification of “Personal Identifiable Information” and the five previously mentioned text classification categories. The entire codebase was open-sourced at: https://github.com/research-technology-ai/ro-business-emails (accessed on 12 May 2023).
2.1.1. Acquisition and Preprocessing
- Simple separators
- These are separators that do not contain any information about the message. A few examples of this type of separator are the following:
- “-Mesaj redirecționat-”
- “-Mesaj original-”
- “Begin forwarded message:”
- Header-aware separators
- These separators include information from the header of that specific message. The following are sample separators included in this category:
- “În <date> <user name> <user email> a scris:”
- “La <date> <user name> <user email> a scris:”
- “On <date> <user name> <user email> wrote:”
- “<date> <user name> <user email>:”
- “Quoting <user name> <user email>:”
- Header-only separators
- These separators include portions from the header of that specific message, thus being similar to the header of the original email. However, there is variability in the name of the fields and which field is present in the separator.
- The “from” field in the header is missing or null
- The “date” field in the header is missing or null
- The email’s body and attachment list are both empty
2.1.2. Proposed Dataset
- We randomly sampled 2000 of the emails from the original collection;
- We manually checked all emails and eliminated the ones that were written in a different language than Romanian; if an email was written in multiple languages, out of which one was Romanian, then that email was kept; in case of soft-duplicates, only one message was kept;
- Lastly, we also added a few selected harmful spam emails retrieved from our mail inboxes to the dataset in order to ensure a higher frequency of these messages.
- Is Automatically Generated:
- This class has boolean values (i.e., True or False);
- The label is True when it can be deduced that the email was generated by an application (e.g., generated by a language model or generated by filling in personalized data on a predefined template) and False when it was composed by a human user.
- Needs Action from the User:
- This class has boolean values (i.e., True or False);
- This label is True if the email specifies the need for the person who received it to perform a specific action; otherwise, the label is False.
- Is SPAM
- This class has boolean values (i.e., True or False);
- This label is True if the email can be considered relatively harmless or harmful spam according to the classification made in Section 1. For this annotation task, we considered the following as relatively harmless spam: marketing email (e.g., promotional offers, ads for products, surveys) and newsletters (i.e., news from organizations about their products, services, and offers). Harmful spam emails include phishing, scam, or malspam emails; we opted to combine these two sub-categories since harmful spam had a low frequency in our dataset;
- Is Business-Related
- This class has boolean values (i.e., True or False).
- This label is True if the email is about a subject that can be related to the company’s activity (e.g., tasks, notifications, equipment, financial transactions, legal matters); otherwise, it is False.
- Type of Writing Style
- This class has one of the following three values: “Formal”, “Neutral”, and “Informal”.
- This class specifies the type of language used in the redaction of the email.
- Initial Addressing:
- The selected text fragment represents an initial addressing formulation (e.g., “Salut”, “Stimate client”—eng. “Hello”, “Esteemed client”). This is generally at the beginning of the message and has the role of initializing the communication act;
- Final Addressing:
- The selected text fragment represents a final addressing formulation (e.g., “O zi buna!”, “Mulțumesc!”—eng. “Have a good day!”, “Thank you!”). This is generally at the end of the message and has the role of finalizing the communication act. Typically if there happens to be other text after it, it is usually the signature of the sender or a disclaimer;
- The selected text fragment represents the signature of the sender. A signature is a fragment of text at the end of the communication, generally between the final addressing and the disclaimer (if both exist) that contains the contact information of the writer. Common information here includes the writer’s name, his position in the company, the name of the company, the address of the company, phone number, fax, emails, and other similar information;
- The selected text fragment represents the disclaimer, a text which usually contains clauses and legal considerations in relation to the email and its content. Usually, only companies and institutions have a disclaimer for the emails sent by their employees. When it appears, the disclaimer is generally the last part of the email body;
- Personal Identifiable Information:
- The text fragment contains personal information like name, surname, bank accounts, address, and car registration identifiers;
- Information can only be considered personal if it relates to a person; in other words, a business address is not PII, but the address of “John Doe” is PII.
2.2. Email Classification
- The text is split into words (which shall be referred to as tokens).
- Each word has a corresponding tag.
- There are two important tags for each annotated label: the “B” tag marks that the respective token is the first in the corresponding label, while the “I” tag signifies that the respective token is not the first in the labeled sequence.
- All tokens which do not correspond to a label receive the “O” tag.
2.2.2. Multi-Head Classification
- Two BERT  models pre-trained for the Romanian language, RoBERT  (https://huggingface.co/readerbench/RoBERT-large; https://huggingface.co/readerbench/RoBERT-base; accessed on 12 May 2023). This was chosen since an encoder pre-trained in the Romanian language should adequately process text in Romanian.
- Two multilingual RoBERTa models  (xlm-roberta-base/large; https://huggingface.co/xlm-roberta-base and https://huggingface.co/xlm-roberta-large; accessed on 12 May 2023). These were chosen since some emails can be written in multiple languages (including Romanian) and because disclaimers and signatures can also be written in English, thus an encoder that can process multiple languages should help.
- A multilingual Longformer model  (https://huggingface.co/markussagen/xlm-roberta-longformer-base-4096; accessed on 12 May 2023). Since emails, especially with the signature and disclaimers included, can exceed the normal 512-token limit of encoders, we hypothesized that a model that can process the entire context could improve results compared to the others.
5. Conclusions and Future Work
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
|CNN||Convolutional Neural Network|
|CRF||Conditional Random Field|
|CSS||Cascading Style Sheets|
|DOM||Document Object Model|
|GCNs||Graph Convolutional Networks|
|GDPR||General Data Protection Regulation|
|GRU||Gated Recurrent Unit|
|HTML||HyperText Markup Language|
|PII||Personal Identifiable Information|
|LDA||Latent Dirichlet Allocation|
|LSTM||Long-Short Term Memory|
|MCD||Manually Curated Dataset|
|NER||Named Entity Recognition|
|NMF||Non-negative Matrix Factorization|
|NSA||Negative Selection Algorithm|
|PSO||Particle Swarm Optimization|
|RCNN||Region-Based Convolutional Neural Network|
|SVD||Singular Value Decomposition|
|SVM||Support Vector Machine|
|TF-IF||(Term Frequency—Inverse Document Frequency|
- Klimt, B.; Yang, Y. The enron corpus: A new dataset for email classification research. In Proceedings of the Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, 20–24 September 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 217–226. [Google Scholar]
- Srirag, D. Text Classification on Emails. 2020. Available online: https://www.kaggle.com/datasets/dipankarsrirag/topic-modelling-on-emails (accessed on 3 May 2023).
- Jabbari, S.; Allison, B.; Guthrie, D.; Guthrie, L. Towards the Orwellian nightmare: Separation of business and personal emails. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, 17–18 July 2006; pp. 407–411. [Google Scholar]
- Alkhereyf, S.; Rambow, O. Work hard, play hard: Email classification on the Avocado and Enron corpora. In Proceedings of the TextGraphs-11: The Workshop on Graph-Based Methods for Natural Language Processing, Vancouver, BC, Canada, 3 August 2017; pp. 57–65. [Google Scholar]
- Oard, D.; Webber, W.; Kirsch, D.; Golitsynskiy, S. Avocado Research Email Collection; Linguistic Data Consortium: Philadelphia, PA, USA, 2015. [Google Scholar]
- Mason, J. The Apache SpamAssassin Public Corpus. Available online: https://spamassassin.apache.org/old/publiccorpus/ (accessed on 3 May 2023).
- Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 2 May 2023).
- Metsis, V.; Androutsopoulos, I.; Paliouras, G. Spam filtering with naive bayes-which naive bayes? In Proceedings of the CEAS, Mountain View, CA, USA, 27–28 July 2006; Volume 17, pp. 28–69. [Google Scholar]
- Cormack, G.V.; Lynam, T.R. TREC 2005 Spam Track Overview. In Proceedings of the TREC, Gaithersburg, MD, USA, 15–18 November 2005; pp. 274–500. [Google Scholar]
- Nazario, J. Phishing Corpus. 2007. Available online: http://monkey.org/~jose/wiki/doku.php (accessed on 12 May 2023).
- Radev, D. CLAIR Collection of Fraud Email. 2008. Available online: http://aclweb.org/aclwiki (accessed on 12 May 2023).
- Alghoul, A.; Al Ajrami, S.; Al Jarousha, G.; Harb, G.; Abu-Naser, S.S. Email classification using artificial neural network. Int. J. Acad. Eng. Res. 2018, 2, 8–14. [Google Scholar]
- Li, W.; Meng, W.; Tan, Z.; Xiang, Y. Design of multi-view based email classification for IoT systems via semi-supervised learning. J. Netw. Comput. Appl. 2019, 128, 56–63. [Google Scholar] [CrossRef][Green Version]
- Sharaff, A.; Gupta, H. Extra-tree classifier with metaheuristics approach for email classification. In Proceedings of the Advances in Computer Communication and Computational Sciences: Proceedings of IC4S 2018, Bangkok, Thailand, 20–21 October 2018; pp. 189–197. [Google Scholar]
- Pan, W.; Li, J.; Gao, L.; Yue, L.; Yang, Y.; Deng, L.; Deng, C. Semantic graph neural network: A conversion from spam email classification to graph classification. Sci. Program. 2022, 2022, 1–8. [Google Scholar] [CrossRef]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Saleh, A.J.; Karim, A.; Shanmugam, B.; Azam, S.; Kannoorpatti, K.; Jonkman, M.; Boer, F.D. An intelligent spam detection model based on artificial immune system. Information 2019, 10, 209. [Google Scholar] [CrossRef][Green Version]
- Forrest, S.; Perelson, A.S.; Allen, L.; Cherukuri, R. Self-nonself discrimination in a computer. In Proceedings of the 1994 IEEE Computer Society Symposium on Research in Security and Privacy, Oakland, CA, USA, 16–18 May 1994; pp. 202–212. [Google Scholar]
- Yasin, A.; Abuhasan, A. An intelligent classification model for phishing email detection. arXiv 2016, arXiv:1608.02196. [Google Scholar] [CrossRef]
- Niu, W.; Zhang, X.; Yang, G.; Ma, Z.; Zhuo, Z. Phishing emails detection using CS-SVM. In Proceedings of the 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), Guangzhou, China, 12–15 December 2017; pp. 1054–1059. [Google Scholar]
- Egozi, G.; Verma, R. Phishing email detection using robust nlp techniques. In Proceedings of the 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Singapore, 17–20 November 2018; pp. 7–12. [Google Scholar]
- Harikrishnan, N.; Vinayakumar, R.; Soman, K. A machine learning approach towards phishing email detection. In Proceedings of the Anti-Phishing Pilot at ACM International Workshop on Security and Privacy Analytics (IWSPA AP), Tempe, AZ, USA, 21 March 2018; Volume 2013, pp. 455–468. [Google Scholar]
- Fang, Y.; Zhang, C.; Huang, C.; Liu, L.; Yang, Y. Phishing email detection using improved RCNN model with multilevel vectors and attention mechanism. IEEE Access 2019, 7, 56329–56340. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Alhogail, A.; Alsabih, A. Applying machine learning and natural language processing to detect phishing email. Comput. Secur. 2021, 110, 102414. [Google Scholar] [CrossRef]
- Baccouche, A.; Ahmed, S.; Sierra-Sosa, D.; Elmaghraby, A. Malicious text identification: Deep learning from public comments and emails. Information 2020, 11, 312. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Harsha Kadam, S.; Paniskaki, K. Text Analysis for Email Multi Label Classification. Master’s Thesis, University of Gothenburg, Gothenburg, Sweden, 2020. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Sharaff, A.; Nagwani, N.K. ML-EC2: An algorithm for multi-label email classification using clustering. Int. J. Web-Based Learn. Teach. Technol. (IJWLTT) 2020, 15, 19–33. [Google Scholar] [CrossRef]
- Jlailaty, D.; Grigori, D.; Belhajjame, K. Business process instances discovery from email logs. In Proceedings of the 2017 IEEE International Conference on Services Computing (SCC), Honolulu, HI, USA, 25–30 June 2017; pp. 19–26. [Google Scholar]
- Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef][Green Version]
- Alkhereyf, S.; Rambow, O. Email classification incorporating social networks and thread structure. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 1336–1345. [Google Scholar]
- Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Šošić, M.; Graovac, J. Effective Methods for Email Classification: Is it a Business or Personal Email? Comput. Sci. Inf. Syst. 2022, 19, 1155–1175. [Google Scholar] [CrossRef]
- Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378. [Google Scholar] [CrossRef]
- Fleiss, J.L.; Levin, B.; Paik, M.C. Statistical Methods for Rates and Proportions; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
- Krippendorff, K. Content Analysis: An Introduction to Its Methodology; Sage Publications: Thousand Oaks, CA, USA, 2018. [Google Scholar]
- Krippendorff, K. Computing Krippendorff’s Alpha-Reliability. Computing 2011, 1, 25–2011. [Google Scholar]
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Brandsen, A.; Verberne, S.; Lambers, K.; Wansleeben, M.; Calzolari, N.; Béchet, F.; Blache, P.; Choukri, K.; Cieri, C.; Declerck, T.; et al. Creating a dataset for named entity recognition in the archaeology domain. In Proceedings of the Conference Proceedings LREC 2020, Marseille, France, 11–16 May 2020; The European Language Resources Association: Luxembourg, 2020; pp. 4573–4577. [Google Scholar]
- Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the stratification of multi-label data. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, 5–9 September 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 145–158. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Masala, M.; Ruseti, S.; Dascalu, M. Robert–a romanian bert model. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 6626–6637. [Google Scholar]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
- Sagen, M. Large-Context Question Answering with Cross-Lingual Transfer. Master’s Thesis, Uppsala University, Department of Information Technology, Uppsala, Sweden, 2021. [Google Scholar]
|Category||Fleiss’ Kappa||Krippendorff’s Alpha|
|Is Automatically Generated||0.736||0.737|
|Needs Action from User||0.383||0.385|
|How is the Writing Style||0.367||0.586|
|0.00 to 0.20||Slight agreement|
|0.21 to 0.40||Fair agreement|
|0.41 to 0.60||Moderate agreement|
|0.61 to 0.80||Substantial agreement|
|0.81 to 1.00||Almost perfect agreement|
|Label||Overall F1 Score|
|Personal Identifiable Info||0.737|
|Annotation Type||Unpartitioned Dataset||Train||Validation||Test|
|Is Automatically Generated||21.35%||21.42%||20.06%||22.41%|
|Needs Action from User||40.98%||40.09%||40.83%||43.79%|
|How is the Writing Style—Formal||25.29%||26.03%||25.95%||22.41%|
|How is the Writing Style—Neutral||47.47%||48.61%||44.98%||46.55%|
|How is the Writing Style—Informal||27.22%||25.34%||29.06%||31.03%|
|Personal Identifiable Info||77.47%||73.04%||94.11%||74.13%|
|Model||F1-Score Train||F1-Score Validation|
|XLM RoBERTa Base||0.777||0.704|
|XLM RoBERTa Large||0.786||0.739|
|XLM Longformer Base||0.547||0.687|
|XLM RoBERTa Large with CRF(sum)||0.557||0.618|
|XLM RoBERTa Large with CRF(mean)||0.704||0.735|
|XLM RoBERTa Large with CRF(token_mean)||0.909||0.765|
|Model||F1 Train||F1 Validation MCD||F1 Validation ACD|
|XLM RoBERTa Base||0.805||0.785||0.753|
|XLM RoBERTa Large||0.798||0.740||-|
|XLM Longformer Base||0.900||0.750||-|
|XLM RoBERTa Base||0.826||0.746||0.743|
|Task||Train||F1 Validation ACD||F1 Test ACD|
|Personal Identifiable Information||0.353||0.185||0.237|
|Is Automatically Generated||0.861||0.807||0.783|
|Needs Action from User||0.787||0.740||0.685|
|How is the Writing Style||0.529||0.467||0.449|
|Model||F1 Train||F1 Validation||F1 Test|
|XLM RoBERTa Base||0.809||0.770||0.764|
|XLM RoBERTa Large||0.685||0.658||-|
|XLM Longformer Base||0.900||0.750||-|
|Task||Train||F1 Validation||F1 Test|
|Is Automatically Generated||0.817||0.785||0.820|
|Needs Action from User||0.811||0.796||0.739|
|How is the Writing Style||0.561||0.524||0.462|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Dima, A.; Ruseti, S.; Iorga, D.; Banica, C.K.; Dascalu, M. Multi-Task Romanian Email Classification in a Business Context. Information 2023, 14, 321. https://doi.org/10.3390/info14060321
Dima A, Ruseti S, Iorga D, Banica CK, Dascalu M. Multi-Task Romanian Email Classification in a Business Context. Information. 2023; 14(6):321. https://doi.org/10.3390/info14060321Chicago/Turabian Style
Dima, Alexandru, Stefan Ruseti, Denis Iorga, Cosmin Karl Banica, and Mihai Dascalu. 2023. "Multi-Task Romanian Email Classification in a Business Context" Information 14, no. 6: 321. https://doi.org/10.3390/info14060321