1. Introduction
In call centers, the analysis of conversation data exchanged between customers and agents is routinely conducted to facilitate customer request analysis, complaint detection, purchase behavior prediction, and the assessment of customer sentiment and opinions. Consequently, these data, which will be referred to as ‘call center conversation data’ throughout this paper, represent a critical asset within call center operations. The task of classifying these data is particularly vital for the effective execution of the analyses mentioned above. In call center environments, agents manually assign a category to a call center conversation data at the end of conversations to classify the call center conversation data. However, inaccurate classifications or bulk assignments to an ‘other’ category frequently occur due to factors such as overwork, fatigue, lack of experience, or carelessness. This consequently diminishes the reliability of analyses for call center operations [
1]. As a result, automated conversation classification has remained a long-standing challenge. With the advancement of Speech-to-Text (STT) technology and the subsequent accumulation of call center conversation records, this task has become increasingly important across various research domains, including Natural Language Processing (NLP) [
2].
Also, the Korean call center industry has witnessed an increasing trend in both in-house and outsourced operations over the last five years [
3]. Not only general corporations but also local autonomous entities are establishing call centers to improve citizen convenience. Nevertheless, due to limitations in operational expertise and budget, outsourcing is favored. As of 2024, among organizations utilizing call centers, 660 are outsourcing, 385 operate in-house, and 84 employ a hybrid model of both outsourcing and in-house management. The outsourced call center market is characterized by intense competition, resulting in widespread low-price contract acquisitions and a significant sensitivity to cost. Given this environment, investment in call center IT infrastructure presents a considerable challenge. Consequently, there is a pressing need to develop a lightweight and cost-effective classification system for call center conversation data.
Call center conversation data exhibit several distinctions compared to general text data [
4]. First, the informality of spoken language is prominent, characterized by frequent grammatical errors, conversational interruptions, exclamations, and abbreviations. Second, domain-specific vocabulary, technical terminology, and brand names from industries such as finance, telecommunications, healthcare, and retail are interspersed throughout the data. Third, linguistic diversity exists, encompassing the morphological complexity of Korean along with the occurrence of dialects and multilingual expressions.
A considerably more significant technical challenge lies in the severe imbalance in data distribution. The call center conversation data used in this study also showed extreme disparities in data volume across conversation categories. While some categories contained substantial amounts of data, others possessed extremely limited samples. This imbalance leads to models becoming biased toward data-rich categories during training, thereby degrading overall classification performance [
5,
6].
Previous studies on call center conversation data classification primarily sought to automate this process using traditional Natural Language Processing (NLP) techniques such as rule-based regular expressions or TF-IDF (Term Frequency-Inverse Document Frequency) calculations. However, these approaches proved insufficient in effectively addressing the unique characteristics of call center conversation data, particularly the data imbalance problem. Furthermore, while some foundational work applied a Convolutional Neural Network (CNN) model to classify these data, this model also exhibited a limitation where the automated classification results were not fully trusted by management, largely attributable to the inherent characteristics of call center data [
2,
4].
To overcome these technical challenges, this study proposes the following innovative approaches: First, after selecting validated models, data augmentation techniques, including Easy Data Augmentation (EDA), will be applied to mitigate the imbalance between categories [
7]. Building upon this, we aim to overcome the limitations of existing language model-based classification performance by ensembling meta-information derived from Named Entity Recognition (NER) results [
8]. Through these multifaceted approaches, this study aims to extract the unique characteristics of call center conversations and develop classification models optimized for them, thereby achieving a superior classification performance even in imbalanced and unstructured data environments.
To this end, we conduct nine experiments across three stages: first, establishing baseline models using LSTM and BERT; second, fine-tuning these baseline models with performance enhancements achieved through various data augmentation techniques; and third, ensembling meta-information. Through this process, we ultimately aim to achieve an improved performance compared to previous models, while ensuring efficient operation on a cheap and lightweight computing infrastructure.
The remainder of this paper is organized as follows.
Section 2 surveys pertinent literature on call centers, automated call analysis, and text classification methodologies, encompassing foundational techniques such as LSTM, Transformer, BERT, EDA, and NER.
Section 3 defines the challenge of classifying imbalanced Korean conversational data from call centers. It further presents the overarching three-stage framework adopted in this research and elaborates on the nine distinct models specifically engineered to enhance classification efficacy.
Section 4 provides a detailed exposition of the proposed methodology and models. This includes the data preprocessing pipeline, the architectural configurations of the baseline LSTM and BERT models, the applied data augmentation strategies (i.e., under-sampling, over-sampling, and EDA), and the architecture of the ensemble model, which integrates NER-derived meta-information via CatBoost.
Section 5 showcases the empirical results obtained from all nine developed models, featuring a thorough performance analysis based on key metrics such as precision, recall, and F1 score. The discussion further investigates the influence of data augmentation and meta-information on classification accuracy, with a specific focus on bolstering performance for minority classes. Finally,
Section 6 offers concluding remarks, summarizing the principal findings and their implications. It underscores the efficacy of the proposed ensemble methodology in alleviating data imbalance issues, acknowledges this study’s inherent limitations, and proposes avenues for future research aimed at further advancing Korean text classification within call center applications.
4. Methodology
4.1. Data Preprocessing
Call center conversation data undergo a multi-step preprocessing procedure to ensure data quality and analytical relevance. First, records from the Speech-To-Text (STT) output with excessively short call durations are excluded. Calls lasting less than one minute are removed because the category labels entered by agents in such cases often do not reflect the actual conversation content and are frequently substituted by default clicks, resulting in low data reliability. Similarly, when working with transcribed text datasets, records with fewer than a certain number of tokens (e.g., less than 20) are filtered out to eliminate insufficiently informative data.
Second, portions of the conversation that are not pertinent to the analysis are removed. In call center conversations, fixed expressions such as greetings and automated guidance scripts frequently occur and are considered noise. These are eliminated by discarding the initial segment of each STT transcript, typically of a predetermined length.
Finally, a curated list of removable strings is maintained and applied in batch processing. Utterances that indicate active listening—such as backchanneling responses—are largely removed, as they contribute little to accurate category classification. However, expressions tied to specific customer issues, such as complaints, are retained for their potential informative value. For example, phrases like “I understand” or “Yes, yes” are excluded, whereas expressions such as “I’m sorry” or “My apologies,” which are indicative of customer dissatisfaction, are preserved.
4.2. Base Model
4.2.1. Model 1. [L] Long Short-Term Memory (LSTM) Model
In Model 1. [L], a fundamental LSTM-based model was used. The activation function employed is Softmax, and the loss function utilized is the CrossEntropyLoss function. The batch size was set to 64, as shown in
Table 3.
CrossEntropyLoss was selected as the loss function due to its well-established capability to deliver high accuracy in multi-class classification tasks and its prevalence as a default choice in numerous deep learning models. It exhibits stable performance across diverse datasets and models, frequently yielding favorable outcomes without requiring specific hyperparameter optimization [
22]. It is acknowledged that CrossEntropyLoss can exhibit diminished performance on minority classes when addressing imbalanced datasets. While alternative loss functions such as Weighted Cross-Entropy Loss and Focal Loss can enhance classification performance by assigning weights to imbalanced data, they were excluded to mitigate potential confounding factors in this research. The present study aims to enhance performance through data augmentation and ensembles leveraging meta-information. To minimize the introduction of additional variable factors, the category-neutral CrossEntropyLoss was retained.
4.2.2. Model 2. [B] Bidirectional Encoder Representations from Transformers (BERT) Model
In Model 2. [B], a fundamental BERT-based model was used. For the BERT model, KoBERT provided by SK T-Brain [
23] served as the pre-trained model, and hyperparameters such as the learning rate, optimization algorithm, Warmup ratio (a setting that keeps the learning rate low until a specified proportion of the total training steps), and drop-out were fine-tuned. Training took approximately six hours, which limited the exploration of extensive configuration variations. The training was initiated with a maximum sequence length of 128 and a batch size of 32, as detailed in
Table 4.
The KoBERT model employs a tokenizer trained on Korean Wiki and news datasets [
23]. The pooled output vector from the BERT architecture feeds into a classifier, which consists of a drop-out layer and a fully connected layer, to output the probability of belonging to each category [
5]. Although more advanced lightweight BERT models like ALBERT and RoBERTa are available, a model pre-trained on Korean is crucial for processing Korean call center conversation, as multilingual models have not yet demonstrated sufficient performance for practical application in this context.
Similar to our use of KoBERT, we also evaluated KoGPT2, distributed by SK Telecom (
https://github.com/SKT-AI/KoGPT2 (accessed on 15 May 2025)). However, its unidirectional decoder architecture inherent to the GPT model resulted in lower performance, leading to its exclusion from this study.
4.3. Data Augmentation
Call center conversation data characteristically exhibit significant differences in the number of instances between categories. This data imbalance generally has a negative impact on category classification performance. We utilized under-sampling and over-sampling, which are applicable to text data, along with Easy Data Augmentation (EDA), a collection of text augmentation techniques.
4.3.1. Model 3. [BU] Under-Sampling
Model 3. [BU] is a BERT model trained on under-sampled data. Category 4 (Service request), which accounts for 25% of the total data, has 5388 records, while the mid-size groups, categories 1 through 3, range from 1500 to 2500 records. During under-sampling, an attempt was made to randomly delete records from categories 0 and 4 down to the mid-group size of 2500 records. The results are shown in
Table 5.
4.3.2. Model 4-1. [BO1] to 4-3. [BO3]—Over-Sampling (A~C)
Models 4-1. [BO1] to 4-3. [BO3] are BERT models trained on over-sampled data. For data augmentation, an over-sampling method involving random sampling and simple replication of data was employed. To increase the data volume, we implemented the following approach:
Model 4-1. [BO1]: All categories were balanced by standardizing their total number of records to 5000 each (
Table 6).
Model 4-2. [BO2]: For categories 1 through 3, which represent mid-size groups, the number of samples was doubled. The number of records in the smallest category (category 5) was increased by a factor of five (
Table 7).
Model 4-3. [BO3]: The number of records in categories 1 through 3 and category 5 was doubled (
Table 8).
4.3.3. Model 5. [BE] Easy Data Augmentation (EDA)
Model 5. [BE] is a BERT model trained on data augmented using EDA. EDA provides four distinct methods for text augmentation. Synonym Replacement (SR) substitutes specific words with others of similar meaning. Random Insertion (RI) adds arbitrary words to sentences to test model robustness or expand datasets. Random Swap (RS) exchanges the positions of two words within a sentence to evaluate sensitivity to word order. Random Deletion (RD) removes random words from sentences to verify model robustness and increase data diversity [
7]. In this study, the four methods were applied with random frequencies and in random quantities, as shown in
Table 9.
The implementation of EDA for Korean texts, known as KorEDA, utilizes the Korean WordNet developed by KAIST Semantic Web Research Center [
24]. Initially, KorEDA recommended using only Random Deletion (RD) and Random Swap (RS) due to the limitations of the WordNet approach, which cannot account for contextual nuances. This recommendation aimed to ensure safety during the augmentation process. However, due to insufficient diversity in the augmented data, Synonym Replacement (SR) and Random Insertion (RI) were subsequently included as well. These four techniques from KorEDA were applied to call center conversations with random weights.
The performance of EDA stems from WordNet; therefore, we augmented it with vocabulary relevant to the call center domain. We employed Google’s gemma3:4b model via ollama (
https://ollama.com (accessed on 15 May 2025)), a solution for running large language models in a local environment. Based on six categories of call center conversation data, we generated 5374 call center-related vocabulary items as keys for the Korean WordNet. For each vocabulary item (key), we added up to 20 similar or related words.
4.4. Ensemble of BERT Inference Results and NER Results
Prior research has explored the integration of meta-information for text classification. These studies concatenated the main text of technical documents with Korean project titles, research objective summaries, research content summaries, and Korean keywords into a single input sequence for training [
5]. However, call center conversations often lack such readily available meta-information.
This study proposes a novel method to enhance the performance of call center conversation category classification by ensembling the classification predictions from a machine learning model with named entities extracted from the conversation sentences, effectively utilizing these extracted entities as meta-information. This approach aims to improve classification accuracy without relying on the computationally expensive inference of LLMs, offering a more practical and cost-effective solution for real-world call center applications.
Google’s BERT model leverages bidirectional Transformer modules, effectively capturing contextual information from both preceding and succeeding tokens, which has been shown to improve the accuracy of Named Entity Recognition (NER) and demonstrates high reliability across various NER tasks [
25]. Conversely, OpenAI’s GPT employs a unidirectional approach, interpreting sentences based solely on prior tokens. While capable of handling complex contexts and lengthy texts, GPT may exhibit lower reliability in Korean language processing compared to BERT, which simultaneously considers both forward and backward contextual cues.
4.4.1. Meta-Information Generation
The Named Entity Recognition (NER) system [
8] was developed based on the pre-trained KoBERT model provided by SK T-Brain [
23]. The entity counts for each tag as shown in
Table 1 were calculated for each conversation. It was expected that tag distributions would vary by category, with tags like DT (date), PS (person), OG (organization), and LC (location) being more frequent in categories such as delivery or reservation inquiries.
A composite dataset was created by integrating the BERT model inference results with the NER summary statistics, using the conversation index as the joining key. Columns 0–5 contained the probability scores predicted by the BERT model, columns 6–19 comprised the NER tag frequency counts, and column 20 stored the ground truth category labels, as illustrated in
Table 10. This enriched dataset was subsequently used to train the CatBoost model, enabling it to leverage both semantic features from BERT and named entity information for improved classification performance.
4.4.2. Models 6-1. [BO3E] Ensemble (A) and 6-2. [BEE] Ensemble (B)
Models 6-1. [BO3E] and 6-2. [BEE] are ensemble classifiers designed to combine category weight information derived from BERT-based conversation classifiers with tag subtotal features obtained through Named Entity Recognition (NER), utilizing the CatBoost algorithm for final classification. Both models integrate two types of meta-information: (1) the category probability distributions inferred from the best-performing BERT models and (2) the frequency counts of named entity tags per conversation. Specifically, Model 6-1 employs the BERT model (Model 4-3. [BO3]) trained on resampling-augmented data, whereas Model 6-2 utilizes the BERT model (Model 5. [BE]) trained on data augmented using Easy Data Augmentation (EDA).
CatBoost was chosen as the ensemble classifier due to its strong performance in multi-class classification tasks and its ability to handle both numerical and categorical features effectively. Unlike traditional Gradient Boosting Machine (GBM)-based algorithms, CatBoost mitigates issues such as target leakage and the complexity of processing categorical variables by introducing an innovative ordering principle and a novel technique for encoding categorical features [
26]. These capabilities enable CatBoost to process categorical values automatically without requiring extensive preprocessing, resulting in consistently robust performance compared to other Gradient Boosted Decision Tree (GBDT) implementations. Leveraging these strengths, our ensemble models achieved substantial performance gains over the baseline BERT models within a relatively short development time.
5. Experimental Results
5.1. Model 1. [L] Long Short-Term Memory (LSTM)
The final hyperparameter tuning results are presented in
Table 11. For Model 1. [L], an LSTM-based model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.5065 (
Table 12).
As shown in
Table 12, Model 1. [L] exhibits low overall accuracy. The model demonstrates a highly imbalanced performance across different classes, with notably lower precision for classes other than class 0. While the recall is generally high, this suggests a tendency of the model to predict positive instances frequently, potentially leading to a high number of false positives. Data imbalance likely has a significant impact on the model’s performance; the majority class (Class 0—Payment Inquiry) shows a relatively reasonable performance, whereas the minority classes perform poorly.
The LSTM model demonstrated insufficient classification performance; therefore, only the BERT model was utilized in subsequent experiments. While there remains potential for further tuning, this approach was deemed adequate given the paper’s primary focus on data augmentation and ensemble methods with metadata rather than exhaustive model optimization.
5.2. Model 2. [B] Bidirectional Encoder Representations from Transformers (BERT)
The final hyperparameter tuning results are shown in
Table 13. For Model 2. [B], a BERT-based model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.9117 (
Table 14).
Overall, the model demonstrates high accuracy and performs well in the text classification task. All classes show decent performance with an F1 score of 0.8 or higher. However, the F1 scores for Return Request (2) and After-Sales Service Inquiry (5) are slightly lower than those of other classes, so it might be worth considering improving the model’s performance for these two classes. In particular, the low recall for After-Sales Service Inquiry may be related to data imbalance, and it might be worth considering acquiring more data or improving learning methods for minority classes.
5.3. Model 3. [BU] Under-Sampling
For Model 3. [BU], an under-sampling model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.9067 (
Table 15). Starting with Model 3. [BU], the under-sampling model, the hyperparameter settings of Model 2. [B], the BERT model, are used without any changes.
Overall, all classes exhibit decent performance, with F1 scores above 0.85. Under-sampling resulted in a slight increase in the F1 score for the categories with fewer instances. However, the overall performance decreased somewhat compared to the base model (F1 score decreased by 0.0050). It is necessary to explore methods to improve the classification performance across all classes.
5.4. Over-Sampling
5.4.1. Model 4-1. [BO1] Over-Sampling (A)
The first over-sampling method aimed to balance all categories by setting their total number of records to 5000, increasing the smaller categories through duplication to reach a total of 5000 records per category. For Model 4-1. [BO1], an over-sampling (A) model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.9100 (
Table 16). This performance was lower than that of the base model.
Overall, the model exhibits high accuracy (0.9112) and demonstrates a decent performance across most classes, with F1 scores exceeding 0.8. The low Precision for After-Sales Service Inquiry (5) indicates that it cannot be resolved by simple replication-based over-sampling.
5.4.2. Model 4-2. [BO2] Over-Sampling (B)
To balance the number of instances across categories, the second over-sampling method doubled the number of instances for categories 1 through 3 (mid-size groups) and increased the number of instances in the smallest category (category 5) by a factor of five. For Model 4-2. [BO2], an over-sampling (B) model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.9200 (
Table 17).
Up to this point, this model demonstrated the highest overall accuracy (Accuracy: 0.9263); however, this increase was primarily driven by improvements in classes with a large number of instances, and the Precision for the AS inquiry class (5) dropped significantly. This indicates that the model frequently misclassifies inquiries that are not AS inquiries as AS inquiries. Even augmenting the AS inquiry class (which has the fewest instances) by five times its original size was ineffective. This low Precision suggests that simply replicating instances, even extensively, does not lead to a proper understanding of the characteristics of that class.
5.4.3. Model 4-3. [BO3] Over-Sampling (C)
For the third over-sampling method, the data for categories 1-3 and the data for category 5 (which had a very small number of instances) were doubled through duplication. For Model 4-3. [BO3], an over-sampling (C) model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.9220 (
Table 18).
Overall, the model demonstrates a very high accuracy and exhibits a decent to high performance across all classes. Notably, the Precision for the AS inquiry class (5), which was a major issue in previous models, significantly improved, enhancing the model’s reliability. Despite the data imbalance issue, the performance for each class appeared to be consistently maintained. It seems that with simple replication-based over-sampling, augmenting the data by more than twofold does not prevent confusion during model training.
5.5. Model 5. [BE] Easy Augmentation Data
The most successful over-sampling configuration (C), which involved doubling the number of instances in the underrepresented categories, was also applied to EDA. For Model 5. [BE], an EDA model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.9250 (
Table 19).
The F1 score and Accuracy generally increased. Although the AS inquiry (5)’s F1 score slightly decreased compared to Model 4-3. [BO3], it maintained its performance, seemingly resolving the issues of previous models. Despite the data imbalance problem, the performance across different classes remained relatively balanced. However, the Recall value for Delivery inquiry (3) continued to show less improvement compared to other major classes, even when compared to the AS inquiry class.
5.6. Ensemble Model
Simply augmenting the key fields of the Korean WordNet by 5734 entries led to a diversification of vocabulary sufficient to improve performance compared to previous attempts that showed no significant difference from common over-sampling. Consequently, the diversity of NER tags also increased, resulting in an improvement when ensembled. This outcome can also be attributed to the adoption of the CatBoost algorithm, which is known to perform better in multi-class classification tasks compared to traditional Gradient Boosting Machine (GBM)-based algorithms.
5.6.1. Model 6-1 [BO3E] Ensemble Model (A) Trained on Over-Sampled (C) Data
The level of text diversity appears to significantly impact the diversity of NER tags. When a model trained by doubling the augmentation of categories 1 through 3 and 5, which showed a good performance with over-sampling, was utilized in the ensemble, the performance actually declined.
For Model 6-1. [BO3E], an ensemble (A) model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.8969 (
Table 20).
Overall, the model’s accuracy slightly decreased, and in particular, the Precision and F1 score for AS inquiry (class 5) significantly dropped, leading to a serious issue with the model’s reliability. This indicates that the model struggles to correctly identify instances of AS inquiry (class 5) and frequently misclassifies other inquiries as AS inquiries. The performance for delivery inquiry (class 3) also still requires improvement. Further investigation is needed to determine if simple replication-based over-sampling is incompatible with CatBoost.
5.6.2. Model 6-2 [BEE] Ensemble Model (B) Trained on Data Augmented with EDA
For Model 6-2. [BEE], an ensemble (B) model, as referenced in
Figure 2, these tuned parameters led to a weighted F1 score of 0.9331 (
Table 21).
Despite the significantly smaller support for AS inquiry (5) compared to other classes, the high Precision, Recall, and F1 score indicate that the model learned the limited AS inquiry data effectively. The weighted avg. value does not show a substantial difference from the macro avg. value, suggesting that data imbalance did not significantly impact the overall performance of the model. However, due to the small number of AS inquiry data points, acquiring more data to train the model could potentially lead to further performance improvements. In conclusion, the model demonstrates an overall high performance and exhibits respectable predictive capabilities across all classes.
According to the feature importance shown in
Figure 3, the category predictions inferred by the BERT model were the most significant features. Among the NER tags, DT (date), PS (person), OG (organization), LC (location), and QT (quantity) were the most important in that order.
These results align with the retail industry call center domain, demonstrating the usefulness of NER tag information in category classification. Furthermore, we enhanced NER performance by expanding the Korean lexicon with keywords derived from the current conversation categories. We could further diversify the Korean WordNet by extracting keywords and contexts from existing conversations using LLMs. There is potential to further improve the performance of the ensemble model, and if we focus on a specific business domain rather than a general-purpose one, we could expand the WordNet more cost-effectively to enhance classification performance.
5.7. Summary of Experimental Results
The results of the nine experiments outlined in
Figure 2 are summarized in
Table 22. The experiment with the best performance was the ensemble model using the CatBoost algorithm, which utilized the subtotal of NER tags as meta-information for the BERT model trained on data augmented with EDA.
6. Conclusions
Call center consultation conversation data exhibit an imbalanced distribution across classification categories, presenting challenges in developing accurate classification models. Additionally, Korean, as an agglutinative language with complex morphological structures, poses unique difficulties in text data augmentation, further complicating the development of classification models. In this study, we developed methods for accurate classification of call center consultation conversation data with these characteristics.
Previous research on text category classification has demonstrated an improved performance of classification algorithms through the utilization of meta-information. In this study, we introduced the concept of utilizing Named Entity Recognition (NER) tag information as meta-information for call center consultation conversations. Based on this approach, we selected appropriately tuned baseline models, augmented the data, and proposed two novel ensemble models that additionally incorporate meta-information utilizing NER tag information according to the augmentation method.
To evaluate their effectiveness, we compared our models with seven existing models. Comparative results revealed that our proposed method achieved an F1 score of 0.9331, representing a 2.3% improvement over the baseline model, while accuracy similarly improved by 2% to 0.9337. Although the final performance might appear marginally incremental at the first glance, it is noteworthy when considering the data imbalance. The model trained on EDA-augmented data achieved only 0.8837 Precision, 0.8246 Recall, and an 0.8531 F1 score for the AS Service (5) category, which constitutes merely 2% of the entire dataset. In the ensemble model incorporating NER tag meta-information, the classification performance for the AS Service (5) category improved significantly to 0.9245 Precision (+4.6%), 0.9800 Recall (+4.6%), and a 0.9509 F1 score (+4.6%). The final ensemble model effectively mitigated the data imbalance problem, thereby enhancing the accuracy of the classification model.
While this research has reached a conclusion for now, there remains significant room for improvement. In Korean, simple data duplication for over-sampling is unlikely to yield significant results, and Korean language models for EDA and NER are not yet sufficient. Given Korean WordNet limitations, using LLMs can offer a quick alternative. The available Korean WordNet lacks sufficient vocabulary, EDA-based augmentation lacks diversity, and extracted NER entities do not perfectly align with the call center domain. However, recent Korean WordNet enhancements using LLMs significantly improved EDA-based augmentation diversity, leading to richer NER tags. Consequently, ensembling the BERT model with enhanced NER tag information substantially improved target model performance. Building upon these findings, future advancements in Korean text classification should prioritize developing high-quality, domain-specific Korean lexical resources using LLMs; creating NER models leveraging LLMs for more accurate entity extraction; researching efficient model lightweighting for CPU-based inference in call center environments; and exploring various ensemble techniques and meta-information strategies optimized for call center classification. In conclusion, leveraging LLMs to enhance Korean lexical resources and improve the performance of EDA and NER, based on this enhancement, shows significant potential for drastically improving the performance of call center conversation classification models. We anticipate that continuous research and development will contribute to providing more effective and cost-efficient AI-based conversation services in real-world operational environments. As each technological component utilized in this study continues to advance, cost-effective and high-performance text classification can be achieved by selectively incorporating the most rapidly evolving components, much like assembling Lego blocks.