Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem

: In South Korea, the rapid proliferation of smartphones has led to an uptick in messenger phishing attacks associated with electronic communication financial scams. In response to this, various phishing detection algorithms have been proposed. However, collecting messenger phishing data poses challenges due to concerns about its potential use in criminal activities. Consequently, a Korean phishing dataset can be composed of imbalanced data, where the number of general messages might outnumber the phishing ones. This class imbalance problem and data scarcity can lead to overfitting issues, making it difficult to achieve high performance. To solve this problem, this paper proposes a phishing messages classification method using Biased Discriminant Analysis without resorting to data augmentation techniques. In this paper, by optimizing the parameters for BDA, we achieved exceptionally high performances in the phishing messages classification experiment, with 95.45% for Recall and 96.85% for the BA metric. Moreover, when compared with other algorithms, the proposed method demonstrated robustness against overfitting due to the class imbalance problem and exhibited minimal performance disparity between training and testing datasets.


Introduction
Financial fraud criminals access their victims through mobile devices, such as smartphones, which are widely used by many people [1,2].Specifically, scams that deceive victims and exploit them for personal gain through messages or messenger conversations are commonly referred to as messenger phishing or messenger phishing attacks [3].Recent observations indicate a substantial rise in global messenger usage, from 2.56 billion users in 2019 to 2.91 billion users in 2020, with a projected increase to approximately 3.3 billion users by 2023 [4].Consequently, the prevalence of phishing attacks through Social Network Services (SNS) has exponentially escalated.In the context of South Korea, which boasts the highest smartphone penetration rate, damages caused by messenger phishing reached 57.64 billion KRW (12,402 cases) in 2020, reflecting an increase of approximately 201.6% compared to the previous year.As the global adoption rate of smartphones, which serve as a medium for messenger phishing crimes, continues to increase, the incidence and impact of phishing attacks are expected to grow persistently [5,6].
Messenger phishing criminals utilize phishing messages to target their victims.Phishing messages can be defined as web links, promotional messages, or unrelated text messages that are regularly sent to a large number of recipients for advertising purposes [7].Phishing messages can be sent indiscriminately to a broad audience based on predefined templates, requiring minimal effort in comparison to voice phishing crimes, thus making them actively exploited in criminal activities [1].Proactive classification of phishing messages by telecommunications providers can serve as an effective preventive measure against phishing attempts.However, this approach may raise concerns regarding privacy invasion and the potential for creating a 'Big Brother' problem.Therefore, a practical alternative lies in post-delivery phishing messages classification methods implemented at the recipient's end, such as on mobile devices, as a means of filtering phishing content.
While these methods demonstrate relatively high classification performance, there are still several limitations that remain.First, there are morphological challenges in language processing.Agglutinative languages such as Korean, Japanese, Chinese, German, Russian, and Spanish exhibit diverse ways of expressing messages, and similar words can collide with each other, resulting in lower performance in morphological analysis [25][26][27].Particularly, Korean, as an agglutinative language, combines nouns and verbs with particles and suffixes, leading to a significant increase in the number of derived word units, which in turn drastically increases the number of features [28].Therefore, alternative approaches are required when dealing with agglutinative languages.Secondly, in the training process for phishing messages classification, non-phishing messages are generally much more abundant than phishing messages.Unless it involves legal authorities, collecting phishing messages is highly restricted.Consequently, there is a imbalanced data problem where the non-target class (non-phishing) that is not the focus of classification has a large number of samples, while the target class (phishing) has a significantly smaller number of collected samples [29,30].The imbalanced data problem becomes more severe when the class of interest is relatively rare and has a small number of samples compared to the non-target class [31].In machine learning modeling, when the size of the non-target class greatly outweighs the target class, biased learning outcomes towards the non-target class can occur, ultimately leading to an inability to effectively address various target classes that exist in real-world scenarios [30,32].Moreover, the cost of misclassifying the target class is much higher than the cost of misclassifying the non-target class.
In this paper, we propose a method for classifying phishing messages among messages written in Korean.In the data collection phase, we assume an extreme class imbalance problem and collect data in such a way that the dataset size of the non-phishing class is more than 70 times larger than the phishing class.From the text-based phishing data, we use KoNLPy's MeCab, a Korean morphological analyzer, to extract lemma keywords targeting verbs and nouns through lemmatization, which are then used as features.Based on the extracted features, we define the data structure by creating a Bag of Words (BoW) for the entire dataset, including phishing and non-phishing.To address the class imbalance problem, we employ Biased Discriminant Analysis (BDA) [33].The primary focus of biased learning is to distinguish a specific class of interest (e.g., phishing) from other classes (e.g., non-phishing).BDA is designed to resolve the asymmetry between these target and non-target classes, and it is utilized to enhance the robustness, especially when dealing with small training samples [33].Additionally, in this process, the optimal parameters are selected to resolve the asymmetry between classes.Experimental results show high classification performance in the class imbalance problem, with Recall and Balanced Accuracy (BA) reaching 95.45% and 96.85%, respectively.This paper is structured as follows.In Section 3, we construct a BDA feature space for classification and propose a data classification method.In Section 4, we analyze the proposed method by selecting optimal parameters to address the class imbalance problem and evaluating performance through comparison experiments with various models.In Section 5, we conclude the paper.

Spam Detection in Balanced Dataset
Ref. [20] experimented with balanced and imbalanced class datasets.They compared traditional machine learning techniques with deep learning methods for phishing messages detection.Traditional machine learning techniques included SVM, NB, DT, LR, RF, and AdaBoost, while deep learning methods employed ANN and CNN.The performance experiments compared results in imbalanced (4827 phishing and 747 non-phishing messages) and balanced datasets (1000 spam and 1000 non-spam messages).The experimental results showed that CNN performed best in both datasets, with results of 96.4% and 97.5%, respectively.The detection results in imbalanced datasets appeared relatively lower than those in balanced datasets.Consequently, spam detection performance in imbalanced datasets proved to be relatively lower than in balanced datasets.
Ref. [18] generated embedding vectors using the TF-IDF method and detected phishing classes through SGD.The dataset consisted of 1143 entries, with 574 phishing and 569 non-phishing instances.The phishing detection performance using SGD was indicated to be 97.2%.Ref. [16] enhanced the performance of a spam messages detection model by employing XGBoost.The total dataset consisted of 550 entries, using data collected directly, with the spam and non-spam datasets comprising 300 and 250 instances, respectively.In their experiments, XGBoost showed the highest result at 82.6%.
However, phishing messages detection often involves class imbalance issues, making the application of these studies to real-world settings challenging.

Traditional Methods
In [10], methods such as NB, SVM, and Maximum Entropy classifier were employed to perform smishing classification.The dataset consisted of 4827 non-spam messages and 747 spam messages.The experimental results showed classification accuracies of 90.9%, 96.4%, and 85.9%, respectively.Similarly, ref. [9] conducted a comparative analysis of various machine learning algorithms to find a suitable spam classification model for biased datasets.Five machine learning classifiers, namely kNN, Linear Support Vector Machine, RBF Support Vector Machine, RF, and DT, were applied to classify spam SMS messages.The dataset comprised 4827 non-spam messages and 747 spam messages.The experimental results indicated that Linear Support Vector Machine achieved the highest accuracy of 92.3% on the imbalanced dataset based on Hashing.
In [34], machine learning classifiers such as NB, SVM, LR, k-Nearest neighbor (kNN), DT, and AdaBoost, as well as hybrid models like k-means+NB, k-means+SVM, and k-means+LR, were used to classify spam messages.This study combined the unsupervised learning-based k-means algorithm for clustering with classification models to enhance performance.The dataset comprised 4825 non-spam messages and 747 spam messages.The experimental results showed that k-means+SVM achieved the highest classification accuracy of 92%.Ref. [8] proposed an SMS spam detection and classification model using the NB machine learning method.The dataset contains 747 spam messages and 4778 nonspam messages.The NB classification methodology achieved a performance of 97.3% on this dataset.

Deep Learning-Based Methods
In [21], the BiLSTM model was employed for phishing detection.The training dataset consisted of 6792 non-spam messages and 3200 spam messages.Using Word to Vector (Word2Vec) as the embedding model, the proposed method achieved a phishing detection performance of 91.7%.Furthermore, ref.
[23] introduced a phishing-detection approach based on a hybrid model that combines CNN and GRU within a hybrid framework.The dataset consists of 5572 text messages, including 747 phishing messages and 4825 nonphishing messages.When compared to CNN, Gated Recurrent Unit (GRU), Multi-Layer Perceptron (MLP), SVM, and XGBoost, the proposed hybrid model exhibited the highest performance at 96.5%.

Gradient Boosting Methods
In [17], four rank correlation algorithms, namely Pearson, Spearman's, Kendall rank, and Point biserial, were used to determine the most suitable feature set for phishing SMS detection.The dataset consisted of 4831 non-phishing messages and 747 phishing messages.For performance evaluation, classifiers including RF, DT classifier, AdaBoost classifier, and SVM were compared, and AdaBoost Classifier achieved the highest accuracy of 98.7%.The phishing SMS detection performance results using Pearson, Spearman's, Kendall rank, and Point biserial were 90.2%, 91.0%, 91.4%, and 90.2%, respectively.Consequently, the Kendall rank correlation algorithm showed the highest accuracy at 91.4%.In [15], spam SMS messages were detected using XGBoost, LGBM, and Bernoulli Naive Bayes.The dataset consisted of 5574 text messages, including 747 spam messages and 4827 non-spam messages.To address the class imbalance issue, down sampling was employed to equalize the number of spam and non-spam messages.LGBM exhibited a classification delay of 1.703 s and achieved high performance with an accuracy of 95.6%.

Non-Parametric Supervised Learning Methods
In [12], the performance of phishing detection was compared using machine learning algorithms kNN, and DT.The dataset consisted of 747 phishing data and 4827 non-phishing data.Among the three algorithms, DT-based phishing detection showed the highest performance at 93.1%.Ref. [13] analyzed better vectorization methods for feature extraction to detect phishing via SMS.They applied vectorization methods such as Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word2Vec to preprocessed data.The dataset contained 638 phishing messages and 5333 non-phishing messages.Performance evaluation was conducted using RF, LR, and Gaussian Naïve Bayes classifiers.
The experimental results showed that the combination of TF-IDF vectorization and RF Classifier achieved the highest classification performance at 85.0%.Ref. [14] compared and analyzed machine learning classification algorithms for detecting spam SMS.The dataset contained 429 spam messages and 2179 non-spam messages.The algorithms used for analysis were NB, LR, DT, and RF, with RF showing the highest performance at 96.5%.
Most languages, excluding English, commonly used for phishing detection, present challenges in direct phishing messages dataset collection.Difficulty in data collection can lead to the formation of extremely imbalanced datasets, potentially resulting in algorithmic overfitting issues.

Spam Detection in Extremely Imbalanced Datasets
Several studies have performed spam detection experiments in environments with extreme imbalance problems [11,19,24].In this paper, an extremely imbalanced dataset is defined as one where non-phishing messages outnumber phishing messages by a ratio of more than 10 to 1.
In [11], phishing messages written in Turkish were converted into embedding vectors from BoW and TF-IDF, and phishing detection was conducted using machine learning algorithms RF, LR, AdaBoost, and SVM.The dataset consisted of 119 phishing messages and 3526 non-phishing messages.In the TF-IDF-based dataset, RF and LR had the highest performance with 92.5%.Meanwhile, in the frequency-based dataset, RF, LR, and SVM delivered a performance of 90.0%.In [19], an efficient smishing detection system was developed using an Artificial Neural Network (ANN).The dataset comprised 5858 text messages, including 538 phishing and 5320 non-phishing messages.The detection performance of smishing using ANN was reported as 92.4%.In [24], a hybrid model combining CNN and LSTM was employed for classifying phishing messages in Arabic.This model achieved a notable classification accuracy of 87.9% on a dataset that included 7579 non-phishing and 785 phishing messages (Table 1).

Proposed Method
In this paper, we propose a method for classifying phishing messages in a Korean dataset with a class imbalance problem.The proposed method consists of three stages: data conversion, feature engineering, and decision.In the data conversion stage, phishing messages (M s ) and non-phishing messages (M ns ) were assigned as the target class and the non-target class, respectively.Additionally, we extracted keywords of verbs and nouns from the collected phishing messages using a morphological analyzer.Then, we created a numerical BoW composed of the frequency of each keyword.In the parameter estimation stage, we generated the BDA feature space (W BDA ) from the training dataset, setting the optimal parameters.These included the regularization parameter, the number of BDA feature vectors, and the threshold needed to construct the space.Finally, in the decision stage, we measured the distance between the projected test data and the mean vector of the training data that belongs to the phishing class within the BDA feature space.Based on this distance and a specified threshold, we classified the messages as either phishing or non-phishing.The overall procedure of the proposed method is shown in Figure 1.To classify phishing messages written in Korean, the following considerations need to be taken into account.

•
Data Conversion: Typically, messages are in the form of text, and they need to be converted into a format that can be understood by machines.

•
Curse of Dimensionality: Like all languages, using all morphemes can lead to excessively high dimensionality in the data.
• Morphology: Korean, an agglutinative language, combines nouns and verbs with particles, suffixes, and endings, resulting in a large number of derived word units and a significant increase in the number of features.• Intention of Writing: Since phishing messages are written with similar intentions, the text often includes a multitude of similar keywords.

•
Class Imbalance Problem: The number of phishing messages is extremely small compared to non-phishing messages.Similar to previous studies [10,11,24,34], we assume the class imbalance problem.

Data Conversion
A message, which includes letters and symbols in text format, needs to be converted into a numerical format so that it can be understood by machines.Generally, text data is converted and used in the form of a BoW or through TF-IDF conversion [10,11,24,34].To generate the BoW (X ∈ R d×N ), we define the d extracted keywords from messages written in Korean through morphological analysis as features and set the frequency of each keyword as an attribute.
Generally, texts are composed of various parts of speech, so if all the morphemes included in the collected data are used, the number of features (d) could become excessively large.Moreover, when analyzing messages written in Korean, an agglutinative language, consideration of the language's morphological elements is necessary.Since nouns and verbs in Korean often combine with particles and endings, the number of derived word units increases significantly, thereby also drastically increasing the number of features [28].To address this issue, only verbs and nouns were targeted for morpheme extraction during the feature creation phase, and all words were converted to their lemmas through lemmatization.Ultimately, this approach leads to a reduction in data dimensionality.
In the feature selection stage, when non-phishing messages are defined as "everyday conversations" the freedom of text data increases, resulting in a larger number of features.Fortunately, phishing messages exhibit a characteristic of being written with similar content and randomly sent to others, regardless of the author.Therefore, when focusing solely on phishing messages, the frequency of specific words can appear high.Figure 2 presents the extraction of major keywords with high proportions in phishing messages and the frequency of major keywords appearing in M ns .Figure 2a visualizes the distribution of major keywords (verbs or nouns) that appear in both M s and M ns using a word cloud.Overall, keywords such as "application", "goods", "consulting", and "repayment" dominate a significant portion.Figure 2b shows a histogram of the top 20 keywords with the highest frequency in M s .For the same keywords, the frequency in M ns differs significantly from that in M s .Specifically, despite the approximately 70-fold volume difference between the non-phishing and phishing classes, there is a distinct difference in the frequency of specific keywords between the two.Therefore, we construct the features of the BoW using only the features extracted from M s .

Feature Engineering and Decision Making
In phishing messages classification, there is a significant class imbalance problem where the target class, which consists of M s , has a considerably lower number of instances compared to the non-target class of M ns .Fisher's Discriminant Analysis (FDA) [35] is one of the widely used methods in classification problems.FDA aims to create a feature space that maximizes the separability between classes.However, when a specific class has a significantly larger number of instances, it becomes challenging to reflect the distribution of data from the target class.As a result, FDA generally exhibits poor performance in class imbalance problems [36].
In this paper, BDA, a generalization of FDA, is employed to address the class imbalance problem.BDA can effectively handle data from both the non-target class and the target class, which exhibit asymmetric nonlinear densities [33].In other words, it focuses on enhancing the robustness of the distribution represented by the narrow target class data, avoiding bias towards the non-target class that contains a larger amount of data.Additionally, BDA aims to find a linear transformation that minimizes the distribution of phishing data while maximizing the separation between non-phishing data and phishing data.
The data matrix and non-phishing datasets Accordingly, the BDA objective function W BDA is defined as follows .
In Equation ( 1), the matrices C s and C ns represent the scatter matrices of the phishing data and non-phishing data, respectively, and can be defined as follows.
where m s = ∑ N s i=1 x i s is the mean of all the samples belonging to the phishing class, W BDA aims to find the optimal transformation that maximizes the variance of W T C ns W and minimizes the variance of W T C s W, resulting in the maximum ratio.As a result, BDA extracts features that densely represent M s close to m s and at the same time, separates M ns far from m s , according to the objective function.Additionally, the effective dimensions of the BDA feature space, denoted as γ, provide a higher information density capacity than FDA, which has only one effective dimension, with γ = min (N s , N ns ) [33].In the context of phishing messages classification, where N ns ≫ N s , the effective dimensions correspond to C s .The column vectors of W = w 1 , .., w ′ N s are the generalized eigenvectors associated with the generalized eigenvalues, satisfying where They can be obtained by the simultaneous diagonalization of C ns and C s if C s is nonsingular.However, N s is significantly lower than d in phishing messages classification models, C s becomes singular, leading to the Small Sample Size Problem (SSSP) [37].To address this problem, Principal Component Analysis (PCA) [38] can be employed.PCA generates N s + N ns − 1 feature dimensions that maximize the variance of the data based on the covariance of X.By selecting only N ′ s or fewer eigenvectors with the largest eigenvalues, the SSSP problem can be resolved.Consequently, Equation ( 1) is redefined as follows.
C s becomes a full-rank nonsingular matrix, and as a result, diagonalization can be performed.We applied whitening to ensure that W T  Cs W in Equation ( 4) becomes the identity matrix (= I).Consequently, we select γ eigenvectors of Cns that maximize W T  Cns W when W T  Cs W = I.Nevertheless, the class imbalance problem still remains unresolved.Unfortunately, reflecting the distribution of phishing data in the phishing messages classification model is challenging due to the very limited number of collectible M s .Regularization is one method that can augment the distribution of a class with constrained data.In this paper, we addressed the class imbalance problem by balancing the asymmetrical scales between classes, achieved by increasing the variance of C s through the addition of a small value µ.
In Equation ( 5), µ is the parameter controlling the variance of C s .µ serves to solve the asymmetry between the target class and non-target class.Generally, classification models are biased towards the distribution of the non-target class, relative to the target class, which holds a small amount of data.To address this problem, regularization simply extends the scope of the target class by adding a small value to the diagonal elements (variance) of the covariance matrix.When the value of µ is 0, there is no change in the distribution of the target class.As the value of µ increases, the variance of the target class gradually expands, ultimately leading to a more robust distribution of the target class.If µ = 1 to the extreme, the target class loses the characteristics of the distribution it has in the BDA feature space.Figure 3 shows the results comparing non-regularized BDA and regularized BDA.In Figure 3a, the conventional BDA without regularization recognizes the distribution of the phishing class (target class) as part of the non-phishing class (non-target class).On the other hand, in Figure 3b, it can be seen that the asymmetrical structure of a small-scale target class and a large-scale non-target class has been mitigated when µ = 0.6.The final phase of phishing messages classification is determining whether the query data (x q ), converted into a vector, is phishing or not.x q is projected onto the BDA feature space, and then the Euclidean distance is measured between the projected data W T BDA x q and the mean vector of phishing data W T BDA m s from the training data.The phishing status is determined based on the calculated distance.
In Equation ( 6), based on the distance threshold θ, if W T BDA x q − W T BDA m s 2 2 is smaller than θ, the model classifies it as phishing; otherwise, it is classified as non-phishing (Figure 4).

Experimental Results
We configured the experimental environment with an NVIDIA GeForce RTX 4080 GPU, Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz, AMD Ryzen 9 5950X 16-Core CPU, and 32 GB DDR4 RAM.For data conversion, we utilized Python version 3.7.6 and Java version 1.8.0.Additionally, we employed KoNLPy version 0.5.2 for morphological analysis of the Korean language.Specifically, we selected MeCab from various morphological analyzers available in KoNLPy, such as HanNanum, Kkma, KOMORAN, and OKT, based on both computational speed and performance.
In the experiment, we directly collected phishing messages from the web and gathered non-phishing messages from open datasets featuring everyday conversation patterns.We explored combinations of optimal parameters that exhibit high classification performance in the BDA feature space.Finally, in order to objectively evaluate the performance of the proposed method, we conducted comparative experiments with multiple machine learningbased algorithms using Distribution-Optimally-Balanced Stratified Cross-Validation (DOB-SCV) [39].Regarding performance metrics for classification, Recall and BA were employed.
In the class imbalance problem, there is a challenge where the cost of misclassifying the target class becomes significantly higher compared to misclassifying the non-target class [31].As a result, even when focusing solely on the non-phishing class without considering the phishing class, the accuracy performance can approach nearly 100%.On the other hand, Recall emphasizes how effectively phishing messages are correctly identified as phishing.Therefore, as the classification performance improves, the Recall results also increase accordingly.Additionally, BA, which represents the average accuracy obtained from both classes [40], provides a comprehensive evaluation of correctly identifying actual phishing data as phishing and actual non-phishing data as non-phishing.Consequently, if an algorithm demonstrates high results in both metrics, it can be interpreted as having good phishing message classification performance.

Dataset
We collected 615 phishing messages and 42,594 non-phishing messages for phishing messages classification.Phishing messages associated with actual messenger phishing crimes, which involve personal information and social issues, are not disclosed due to privacy and societal concerns.In this paper, we collected a total of 615 images directly uploaded by messenger phishing victims from 2013 to 2021, converted them into text, and used them as a phishing dataset.The collected phishing dataset includes various types of crimes, such as cryptocurrency scams, advertising scams, loan scams, impersonation of public institutions, and impersonation of acquaintances. Figure 5 presents examples of the directly collected dataset.Figure 5a illustrates an example related to cryptocurrency scams, where criminals masquerade as cryptocurrency exchanges and send scam messages claiming that the victims' assets are at risk. Figure 5b showcases an example of an advertising scam, where phishing attempts are made through advertisement-like messages, such as job postings or delivery notifications.Furthermore, Figure 5c depicts a form of loan scam where criminals impersonate loan providers and deceive victims in need of money.Figure 5d demonstrates a type of fraud where criminals impersonate public institutions and induce victims to click on specific URLs, often involving scams related to COVID-19 relief funds.Lastly, Figure 5e presents an example of impersonating an acquaintance, where criminals pose as the victim's acquaintances and exploit them for financial gain.Non-phishing messages include a Twitter conversation-based dataset [41] with everyday conversational patterns, a one-shot conversation dataset [42], and a chatbot conversation dataset (2021) [43].The Twitter conversation-based dataset includes everyday conversations between two or more speakers, with 2000 messages ranging from a minimum of 1 to a maximum of 17 turns.The one-shot conversation dataset is comprised of 38,594 messages through web crawling of SNS posts and online comments.Lastly, the chatbot dataset is divided into three classes: general conversations, farewells, and love-related conversations.From the 3040 conversation data in the 'general conversation' class, we selected 2000 messages for the experiments, excluding duplicates.
To transform the collected messages into a machine-understandable format, we created a BoW by tallying the frequencies of words extracted through morphological analysis.Since BoW can exponentially increase data dimensionality by including all words as features, we removed symbols, numbers, and words with fewer than two characters irrelevant to messenger phishing crimes during the preprocessing stage.Additionally, we controlled data dimensionality by employing lemmatization to use the base form of all morphemes, extracting 1533 keywords from the dataset.

Parameters Estimation
By evaluating the performance in terms of Recall and BA for various combinations of parameters on the training dataset, the optimal parameters µ, γ, and θ for phishing messages classification can be estimated.First, selecting an appropriate µ addresses the issue of overfitting in machine learning modeling while preserving the characteristics related to the distribution of the phishing class.Increasing the value of µ leads to an increase in the variance of the relatively small amount of phishing class, ultimately making the distribution of the phishing class robust.Figure 6 illustrates the distribution of data in each BDA feature space according to different µ values.In Figure 6a-c, we can observe the improvement in the asymmetric structure between the phishing and non-phishing classes as µ increases.Particularly in Figure 6d, when µ approaches its maximum value of 1.0, the distribution of the phishing class expands beyond the non-phishing class, losing its distinctive characteristics.In [44], optimal values of µ between 0.1 and 0.2 were chosen for datasets that did not exhibit a relatively symmetric structure in class imbalance problem.However, in this paper, we are dealing with an extreme class imbalance problem, so we set µ to be at least 0.3.
Secondly, it is necessary to consider the number of BDA feature vectors, denoted as γ.Generally, eigenvalues close to 0 correspond to noise and should be excluded from selection.On the other hand, selecting eigenvectors corresponding to higher eigenvalues ensures higher classification performance.In this paper, to address the SSSP problem, we set N ′ s = 881 in the PCA step and choose the optimal γ among the maximum of 881 feature vectors that exhibit the best performance.Lastly, given µ and γ, we determine the threshold θ to classify phishing and non-phishing within the BDA feature space.θ is set to the value in the BDA feature space that achieves the highest classification performance.The optimal parameter combination (µ, γ, θ) for phishing messages classification was set to (0.65, 2, 2.4130).Figure 7 represents the results of estimating the optimal parameters.Figure 7a illustrates the difference in classification performance according to γ in the BDA feature space.In extreme class imbalance problems, a relatively small number of γ can exhibit higher performance by focusing on the structure of the non-target class.Figure 7b depicts the distribution of data in the BDA feature space when applying the optimal parameters.The distribution of the phishing dataset, which contains significantly less data in the BDA feature space, maintains its distinctive characteristics without losing them.The classification is well-executed based on θ, indicating the appropriate selection of parameters.

Phishing Messages Classification Results
In this paper, we conducted a comprehensive evaluation by comparing the objective performance of our proposed method with specific machine learning-based algorithms.The algorithms used in this paper are as follows: • Stochastic Gradient Descent (SGD) [ [58] This paper applied a range of machine learning algorithms to address class imbalance problems and assess their phishing message classification performance.We evaluated the phishing message classification performance at the algorithm level by utilizing a 5-fold DOB-SCV approach suitable for class-imbalanced datasets.Table 2 presents the results comparing the classification performance of the proposed method with the methods utilized in previous studies using the same BoW generated from the dataset.Regarding the performance metric Recall in the training phase, the classification performance ranked as follows Additionally, for the BA performance metric, the classification performance ranked as follows: DT In the testing phase, using the optimal parameters, the data that did not overlap with the training dataset was classified to determine whether it was phishing.When considering the Recall metric, the classification performance for the test dataset ranked as follows: Similarly, when considering the BA metric, the classification performance ranked as follows: The proposed method exhibited high classification performance in the dataset with a class imbalance problem, achieving 95.45% in Recall and 96.85% in BA metrics.
The detailed analysis of these results is as follows.It was found that traditional methods, including the probability-based NB and regression-based LR, generally lacked robustness against class imbalance issues.In the kNN approach, the parameter choice greatly influenced classification performance, establishing the number of neighbors as a critical determinant of classification efficacy.Both DT and RF demonstrated high classification performance during the training phase.However, there was a significant decrease in classification performance during the testing phase, indicating a potential overfitting issue within non-parametric supervised learning methods.SVM was effective in binary classification problems, exhibiting strong performance in the training and testing phases.Nevertheless, OCSVM showed inferior performance compared to SVM in both the Recall and BA metrics.
Among gradient boosting methods, LGBM displayed relatively high classification performance, yet there was a noticeable gap between training and testing performance.Conversely, AdaBoost, which combines several weak learners to create a strong learner, showed weak classification performance with unknown data and in class imbalance problems.XGBoost, designed to prevent overfitting, did not solve the class imbalance problem effectively.Regarding deep learning-based methods, traditional SGD-utilized neural networks did not exhibit good classification performance in class imbalance problems.RUSBoost, employing the Random Under-Sampling Boosting technique to tackle class imbalance issues, still demonstrated low classification performance throughout both the training and testing phases.
The CNN series generally showed low classification performance, but they had the advantage of producing generalized results due to the small gap in classification performance between training and testing.CNN combined with GRU showed low classification performance in the training phase but outperformed other algorithms in the testing phase, indicating its practical application potential.LSTM is employed in language recognition tasks.However, it has been observed that, under conditions of extreme class imbalance, the performance of models combining LSTM with CNN in classification tasks deteriorates compared to that of BiLSTM models.
During the training phase, the oversampling technique SMOTE was employed to adjust the spam to non-spam data ratio to 1:1.Using SVM as the classifier, this method nearly achieved perfect classification performance for both metrics, approaching 100%.However, in the testing phase, the model experienced a significant decrease in classification performance due to overfitting.
Finally, the method proposed in this study demonstrated lower classification performance during the training phase compared to other algorithms.However, it exhibited high classification performance in the testing phase and created a more generalized model due to the small gap in outcomes between training and testing.

Conclusions
Globally, there is a growing trend of messenger phishing crimes [5,6].Particularly in South Korea, with its notably high smartphone penetration rate, messenger phishing is emerging as a significant societal issue [59].These crimes efficiently exploit phishing messages, allowing culprits to target a broad, unspecified group with minimal effort [1].To reduce the potential damage from these phishing endeavors, proactive detection and filtering of phishing messages are crucial.
In this paper, we conducted research on classifying phishing in messages received on mobile devices in Korean.During the data conversion phase, morphological analysis (using MeCab) was carried out on all collected messages to extract features based on verbs and nouns.By measuring the frequency of each feature across all messages, a BoW of numerical data was generated.In the feature engineering phase, we employed the BDA technique, a robust biased learning method, to effectively function under severe class imbalance conditions.In this process, we estimated parameters of BDA such as the regularization parameter (µ = 0.65) and the number of BDA feature vectors (γ = 2).Importantly, the regularization parameter mitigates the asymmetrical structure between the target and non-target classes and concurrently prevents overfitting, addressing the class imbalance problem.Lastly, in the decision phase, we measure the Euclidean distance between an arbitrary data point and the average vector of phishing data, classifying the message as phishing or non-phishing based on the threshold.For the experiment, we constructed a dataset comprising 615 phishing messages and 42,594 non-phishing messages.
In an experiment involving the classification of Korean phishing messages, characterized by a data scale difference of over tenfold (commonly referred to as the class imbalance problem), our proposed method exhibited performance improvements of at least 0.49% in Recall and 0.19% in BA metrics when compared to machine learning algorithms used in prior studies, such as traditional methods, deep learning-based methods, gradient boosting methods, and non-parametric supervised learning methods.The proposed method effectively utilized the BDA algorithm to classify phishing by analyzing the linguistic differences between phishing and non-phishing messages.In particular, we addressed the class imbalance issue through a normalization strategy due to the significantly lower occurrence of phishing messages than non-phishing messages.When utilized for crime prevention, the proposed phishing message filtering method is expected to ideally lead to a reduction in the damages caused by crime.Furthermore, we anticipate that this approach could be applied to text-transcribed voice phishing-related messages, potentially enhancing its effectiveness in combating voice phishing crimes.
However, the method proposed in this paper presents several limitations.Firstly, finding an optimal combination of parameters for the objective function is challenging.While the appropriate parameter settings can lead to enhanced performance, they also increase the complexity of the algorithm, potentially hindering its practical application.Secondly, the BoW utilized to convert text data into numeric data has its drawbacks.Since the BoW employs every word in the training data as a dimension, it results in a high-dimensional dataset.Moreover, the BoW forms the dataset based solely on word frequencies, neglecting the relative importance of words.Lastly, although our approach demonstrates robustness in the face of the class imbalance problem, it does not guarantee the highest performance under general conditions.Consequently, when a sufficient amount of data is secured, traditional algorithms might achieve superior performance.
In future works, we aim to explore classification algorithms that can be utilized in extreme class imbalance situations without the need for manual parameter tuning.In addition, in this paper, we chose the BoW approach to validate the phishing messages classification performance at the algorithmic level.However, in the process of converting textual data into numerical data, we plan to incorporate state-of-the-art NLP techniques such as BERT [60] and self-attention [61] to not only measure word frequency but also consider the context of sentences.Additionally, we anticipate achieving higher classification performance by combining sampling strategies like the Synthetic Minority Over-sampling TEchnique (SMOTE) [58] with classification algorithms.If possible, with the cooperation of law enforcement agencies, we intend to collect data originating from actual crimes, as opposed to using data directly gathered for phishing messages classification, to validate our classification issues.

Figure 1 .
Figure 1.Overall procedure of the proposed method.

Figure 2 .
Figure 2. Visualization of common words in phishing or non-phishing messages.

Figure 3 .
Figure 3. Distribution difference of scatter matrix due to regularization.

Figure 4 .
Figure 4. Distribution of ideal data in the proposed method (red circle: phishing; blue circle: nonphishing).

Figure 6 .
Figure 6.Change in the distribution of the scatter matrix by µ.

Figure 7 .
Figure 7. Evaluation of the BA metric based on parameter settings.(a) Mean of BA by γ in the training phase.(b) Optimal BDA feature space projected with training data using BA (when, µ = 0.6; γ = 2; θ = 2.4130).

Table 1 .
Overview of previous studies on the classification performance of phishing messages.

Table 2 .
Experimental results of phishing messages classification performance compared with existing methods.
Summary of parameters used during the training phase for each algorithm.