1. Introduction
In the banking sector, timely repayment of loans granted to customers is critical to managing financial risks. The likelihood of loan repayment is one of the key factors directly impacting banks’ liquidity and overall financial health. The increasing number of individual customers, in particular, has made it impossible to assess each customer’s credit risk using traditional methods. Therefore, banks are increasingly turning to artificial intelligence and machine learning techniques to assess and manage credit risk [
1].
Effective collection processes enable banks to meet their operational needs by maintaining cash flow and reducing non-performing loan rates. This reduces risks in the financial system, supporting overall economic stability. This, in turn, increases banks’ profitability and positively impacts their capital adequacy ratios. Furthermore, compliance with regulatory requirements (BRSA rules) minimizes legal risks and strengthens customer confidence [
2].
In recent years, credit default rates in Europe have increased due to economic recessions and various macroeconomic factors. A study by Barbaglia et al. [
3] examined credit default behavior in Europe, showing that borrower characteristics, loan-specific variables, and local economic conditions have a significant impact on default. This study highlights the importance of regional risk assessment policies.
In particular, the aftermath of COVID-19 has led to a noticeable shift in customer payment behavior due to rising global inflation and interest rates. For the same reason, considering the decline in bank profitability, data-driven analyses and forecasting processes have become even more important.
This study’s literature review examined the use of machine learning and deep learning models in predicting credit risk using information obtained from tabular and textual data.
Machine learning is a powerful tool used to create predictive models by extracting meaningful information from large data sets. Deep learning methods and recurrent neural networks, such as LSTM (Long Short-Term Memory), are particularly capable of effectively analyzing time series data and natural language processing problems [
4,
5].
The importance of machine learning in collection processes has been investigated for various loan types, such as mortgages and agricultural loans, with successful results. A study by Sirignano et al. [
6], highlighted the use of deep learning models in mortgage risk analysis, achieving high accuracy in predicting borrower behavior and loan performance. This study explored the dynamics of credit default risk using large datasets.
In response to the economic downturn in the agricultural sector and concerns about farmers’ repayment capacity, the use of machine learning methods in agricultural loan delinquency estimation has made significant contributions to more accurately predicting financial stress. In this context, several studies have demonstrated the effectiveness of machine learning algorithms in predicting financial stress [
7].
The LSTM algorithm demonstrates superior performance in credit risk prediction by effectively analyzing time series data [
8]. Gao et al. achieved high accuracy in predicting credit card defaults using the XGBoost-LSTM model, which is a combination of XGBoost and LSTM models. This model achieved high classification accuracy without requiring feature engineering [
9].
In a 2011 study, Loughran and McDonald suggested that the language used in financial reports can help make better predictions about companies’ financial situations [
10].
In their research on Lending Club data, Kriebel and Stitz showed that even short user-generated texts significantly improved loan default predictions. Deep learning methods were found to be more effective at extracting valuable information from texts than traditional machine learning approaches [
11].
Fu et al. [
12] demonstrated the effectiveness of using deep learning methods to predict the default risk of online P2P platforms. Specifically, they found that BiLSTM-based models can successfully predict the default risk of platforms by extracting and using keywords from investor reviews.
Additionally, techniques incorporating deep learning architectures such as convolutional neural networks (CNNs) and transformer models like BERT and RoBERTa have been shown to be highly effective at processing and analyzing unstructured text data. These models demonstrate superior performance in loan default prediction tasks, providing banks with robust tools for managing the challenges of large datasets [
11].
Beyond structured financial data, there is growing interest in using text data to more accurately predict credit risk. Various studies in recent years have demonstrated that meaningful features can be extracted from text using natural language processing (NLP) and deep learning techniques. The findings of some prominent research in this area are summarized below.
A study by Wang, Qi, Fu, and Liu investigated how textual descriptions found in loan applications could be integrated with traditional financial data to predict default. They concluded that unstructured textual data contains clues about borrowers’ intentions and behaviors, and that this information can improve model performance. This study, as an early example, highlights the direct contribution of text-based descriptions to credit risk assessment [
13].
The 2019 article “When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications” by Netzer and colleagues explores how application letters, which specify the applicant’s purpose in applying for a loan, can be used to predict loan default. Using text mining and machine learning tools, the authors analyzed more than 120,000 loan applications and found that textual information is important in predicting loan default. The paper concludes by highlighting the value of combining textual data with machine learning in improving the accuracy of credit scoring models [
8].
By analyzing loan evaluation notes written by credit experts on small business loans, Stevenson, Mues, and Bravo demonstrated how text data can add value to structured data in predicting default. Using deep learning-based models, this study demonstrated that text input significantly contributes to the model and improves predictive performance compared to models relying solely on numerical data. This study demonstrates that text data can be an important signal carrier to consider in credit decision-making processes [
14].
Lis, Kubkowski, Borkowska, Serwa, and Kurpanik analyzed the text contained in bank reports prepared for the audit and validation of credit risk models. This analysis, conducted using embedding and clustering methods, automatically extracted model errors and structural weaknesses from the text. This approach demonstrates that text data can be effectively used not only in the forecasting process but also in processes such as model validation and auditing [
15].
Sanz Guerrero and Arroyo analyzed the descriptions written by borrowers on peer-to-peer lending platforms and investigated how large language models (LLMs) can be used to predict credit risk. Developed with the ChatGPT architecture, the model identified content related to default risk by evaluating linguistic patterns in the text and automatically generated a risk indicator based on this information. This study is noteworthy because it demonstrates that unstructured text can be used not only as a complementary but also as a standalone risk indicator. Furthermore, it provides a concrete application example of how LLMs can be used to replace or complement traditional statistical models [
16].
Alamsyah, Hafidh, and Mulya developed an alternative credit scoring method using text data obtained from social media interactions of fintech users in Indonesia. The model, created by analyzing social media text, was able to predict the risk profile of individuals without traditional credit history, thus contributing to increased financial inclusion. This study demonstrates that unstructured data can be a valuable resource, especially for individuals lacking traditional financial knowledge [
17].
In a study published in 2025, Wu, Dong, Li, and Shi compared the default prediction performance of user disclosures in loan applications, both in their original form and in versions refined using ChatGPT (OpenAI, available at
https://chat.openai.com, accessed on 15 July 2025). They found that the AI-rewritten disclosures were clearer, more consistent, and more learnable by models, resulting in higher predictive accuracy. This study demonstrates the direct impact of text quality on model accuracy and highlights the criticality of text processing in credit risk analysis [
18].
In recent years, large language models (LLMs) and Transformer-based frameworks have made significant progress in text processing. Although these models demonstrate high performance on textual tasks, the focus of the current study was to compare the performance of different embedding methods in the Turkish language, so the more standardized LSTM was chosen. This allowed us to clearly observe the effects of individual embedding methods, while controlling for variations in model architecture.
Transformer architecture was first introduced by Vaswani et al. in 2017 [
19] and multimodal models like GPT-4 have been built with scaled-down versions of this architecture [
20]. Features of LLMs such as output quality and structure awareness have been evaluated in studies such as MDEval [
21]. Quantization techniques have also been proposed to enable LLMs in resource-constrained environments [
22]. Although these studies reflect the methodological evolution in the field, a more controlled structure was preferred for the embedding comparisons that were the aim of the current study.
The data analyzed in my study was obtained from audio recordings of phone calls between a customer representative and a borrower. Automatic speech recognition systems, which are used to analyze such data, are a developing field of research, particularly for low-resource languages.
This study, conducted by Mussakhojayeva, Dauletbek, Yeshpanov, and Varol, developed multilingual speech recognition models for the Turkic languages. Monolingual and multilingual approaches were compared in models developed for ten different Turkic languages, and it was found that multilingual models, in particular, improved the character error rate (CER) by up to 50%. Furthermore, by providing an open-source corpus containing approximately 218 h of audio data for Turkish, this study made a significant contribution to speech recognition research in low-resource languages. Because Turkic languages share similar morphological structures, the data sharing and transfer learning strategies proposed in the study offer potential for analysis on Turkish speech data. In this context, the study serves as an important reference for researchers seeking to improve the quality of text data obtained from Turkish speech [
23].
Kheddar, Hemis, and Himeur’s study, “Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey,” provides a detailed review of how deep learning-based techniques (e.g., deep transfer learning, federated learning, reinforcement learning, and Transformer architectures) have been used in automatic speech recognition (ASR) in recent years. This study addresses current challenges such as data scarcity and computational cost for low-resource languages and proposes advanced methodologies to overcome these hurdles. Analyses of model architectures, data preprocessing strategies, and domain mismatches on ASR performance provide a direct link to my study’s approach to measuring the impact of embedding methods on Turkish speech text data by isolating their impact [
24].
Gormez developed a specially designed deep learning-based speech recognition system, taking into account the challenges specific to the Turkish language. In this study, acoustic models created using CNN, GRU, and LSTM layers were integrated with a language model supported by the Zemberek library. Experiments have shown that the system, supported by a language model using the Turkish Speech Corpus, provides significant improvements in metrics such as Word Error Rate (WER) and Character Error Rate (CER). Developed specifically to consider the agglutinative structure and phonetic features of Turkish, the system provides an important example in terms of both infrastructure design and language model utilization in Turkish ASR studies. This study can contribute to reducing errors that may occur in the speech-to-text conversion process and guide the production of high-quality Turkish transcripts [
25].
In a study published in 2025, Dhahbi, Nasir Saleem, Sami Bourouis, Mouhebeddine Berrima, and Elena Verdú developed an end-to-end (E2E) automatic speech recognition system for low-resource languages. The study demonstrated the feasibility of training a model directly from raw data, eliminating the need for speech feature extraction, traditional language models, and labor-intensive preprocessing steps. Synthetic speech data generation and data augmentation techniques resulted in lower Word Error Rate (WER) and Character Error Rate (CER); this approach is promising for resource-poor languages like Turkish. In this context, it provides a methodological support for your work by providing a reference where the effect of embedding methods can be observed more clearly when the model architecture is kept constant [
26].
This article, published in 2025 by Jan Karem Höhne, Timo Lenzner, and Joshua Claassen, compared the accuracy of ASR systems’ text output when open-ended questions were answered via voice in a survey conducted via smartphones in Germany. A comparison of two prominent ASR tools, Google Cloud Speech to Text API and OpenAI Whisper, found that Whisper provided more accurate translations, while the Google API had a speed advantage. Furthermore, factors such as audio quality and background noise were observed to affect the error rate in both systems. This study provides a good example in the literature that concretely demonstrates the types of errors that can occur during text transcription and how system preferences affect the results [
27].
Despite the increasing number of studies in recent years, the literature lacks sufficient studies on text data-based deep learning methods in the collections field. As the number of studies in the literature increases, collection processes will be further optimized. The high inflation and payment difficulties experienced worldwide in the post-COVID era further increase the importance of this topic. The different sources and algorithms used in the dataset will directly impact the performance of the studies to be conducted.
This study aims to improve credit risk management in the banking sector by analyzing customer conversations recorded in Turkish and predicting whether customers who promised to pay will fulfill their commitments. The study compared different embedding methods to create meaningful representations from text data and used LSTM-based deep learning models to capture long-term dependencies. The findings can help banks improve their risk management processes and make their credit assessment mechanisms more effective.
In the following sections of the study, the dataset and data processing steps will be explained, followed by a detailed description of the structure and performance analyses of the proposed model. Finally, the findings will be discussed, and the contributions made within the scope of the study and recommendations for future research will be presented.
2. Dataset
The dataset used in this study was obtained from telephone conversations with customers in default. According to the bank’s strategic decisions, customers are called after a certain number of days of overdue payment to obtain payment commitments. The audio recordings from these conversations were converted to text data using speech-to-text technology and stored in an MS SQL database. The database records each speaker’s statements sequentially and chronologically.
Data collection was conducted through the bank’s call center systems. The interviews were conducted between bank representatives and customers under natural conditions. The transcribed text data was systematically stored, and each speaker’s statements were arranged in a way that preserved the flow of the conversation. Additionally, the database includes metadata such as the date and duration of the call and customer account information.
This dataset consists of anonymized transcripts of customer service call recordings within the framework of relevant financial and legal regulations. The data has been processed in accordance with legal regulations, corporate privacy rights, and data protection legislation. No personal information is included in the data pool. Data processing is conducted in accordance with confidentiality and security policies; access to data is provided only to authorized individuals via secure storage environments. Therefore, individual data separation is not required.
Interviews typically include three main types of interactions: payment commitments received from customers, information about the product in arrears, and details of the conversation. This information provides critical information for understanding customers’ payment behavior and communication patterns.
Such data provides valuable insights for developing payment prediction models and optimizing customer communication strategies.
This dataset aims to yield more realistic and applicable conclusions about customer behavior and communication patterns because it was collected under real-world conditions rather than in controlled laboratory environments. Storing data sequentially and in detail offers significant advantages for analyzing conversation dynamics and temporal patterns. This study aims to enhance a bank’s ability to predict payment behavior and optimize customer communication strategies through detailed analysis of past interactions.
In the raw dataset, each conversation is captured sequentially between the customer contact center employee and the customer, as shown in
Table 1.
Descriptive statistics for the raw data are given in
Table 2.
The data presented in the table provide important clues for assessing the effectiveness of customer conversations and customer adherence to payment commitments. Furthermore, this broad scope of the dataset demonstrates its richness and diversity.
These descriptive statistics shed light on the bank’s efforts to optimize its customer communication strategies regarding overdue payments. Data such as the frequency and duration of conversations, word distribution, and payment commitment rates provide important information for evaluating the effectiveness of customer interactions and the performance of representatives. This data can be used to develop future communication strategies and customer management policies.
In the next section, the raw dataset will be combined and cleaned, keeping each conversation in a single line for use in analysis and modeling.
3. Materials and Methods
3.1. Overview
The research presents a framework that includes data collection, data cleaning, data digitization, modeling, and model evaluation, as shown in
Figure 1. Data collection is explained in
Section 2. During the data cleaning phase, errors in the data were corrected and the data was standardized. Data digitization was performed using BERT, FastText, and Self-Embedding methods. The LSTM (long-short-term memory) algorithm, one of the most successful deep learning methods in text mining, was used in the modeling, aiming to establish an appropriate architectural structure for estimating the target variable. Finally, the model outputs were analyzed, and the differences between customers who fulfilled their payment promises and those who did not were analyzed to infer information about customer payment performance.
3.2. Data Cleaning
Data cleaning is crucial in text mining processes [
13]. In his study on Turkish text classification, Borandağ revealed that simplification, spelling corrections and special character cleaning during the pre-processing process directly affect the model performance [
28]. Similarly, Zümberoğlu and Eren reported that linguistic cleaning and structural adjustments to the dataset they developed for Turkish sentiment analysis significantly increased model accuracy. Both studies demonstrate that models experience semantic losses when the language structure and morphological richness specific to Turkish are not carefully processed. In this context, a comprehensive data cleaning process was applied to the Turkish speech transcripts used in the current study, taking into account language distortions, repetitive patterns, indirect expressions, and systematic errors originating from the audio data [
29].
In this study, unnecessary spaces and special characters (periods, commas, etc.) were first removed from the text to transform the raw data into an analyzable format as shown in
Figure 2. This step reduced noise in the data and standardized the texts. Second, all words in the data set were converted to lowercase. Because uppercase and lowercase letters can cause the same word to be perceived differently in text mining processes, this conversion process increased the accuracy of word frequencies and consistency in the data set. Finally, the longest and most detailed step of the data cleansing process was manual corrections, including monogram and bigram checks, to correct spelling errors and grammatical errors. Data with a high number of incomprehensible words were deleted.
During monogram and bigram checks, the dataset was examined separately for those who kept their payment promises and those who did not as illustrated in
Table 3. This allowed for both a more detailed understanding of the dataset and a more detailed study of the manual data cleaning process.
The shortest and longest conversations in the data were examined, and it was determined that these conversations frequently experienced issues such as calls not being finalized due to technical issues or dropping out quickly. To ensure that the dataset learned from complete, more accurate data, outliers were identified, and conversations shorter than 0.1 percentile and longer than 0.9 percentile were removed from the dataset.
This process increased consistency in the text, creating a more robust analysis environment and positively impacting model performance.
Finally, the data was sorted according to the conversation order. Assuming no information loss would occur due to the customer contact center employees using standard text, the dataset was cleaned to retain only customer conversations to reduce the size and complexity of the dataset. Each conversation was grouped into a single line. An example phrase can be viewed in
Table 4.
After this data cleaning process, the dataset contained 1,701,782 lines. The average word count was 47.65, the minimum was 13, the maximum was 127, and the median was 40. The target variable ratio in the dataset was 57%.
Considering the complexity and size of the dataset, we decided to work with a specific sample size, and a random sample of 250,000 lines, representing approximately 15% of the total data, was selected. The dataset was divided into train and test datasets of 80% and 20%, respectively. The model will learn on 80% of the data, and performance tests will be conducted on the remaining 20%.
One of the most crucial elements during data cleansing is the proper operation of the voice-to-text converter. The quality of this tool will directly impact the accuracy of the text data and, consequently, the model’s performance. Furthermore, bank call center employees must also guide customers correctly. This study can contribute to improving call center processes by enabling inferences about customer referrals.
3.3. Embedding
Embedding is a method that represents words or text as numerical vectors. This method allows natural language processing models to process data more effectively by preserving the semantic and contextual relationships between words. This study evaluated three different word embedding methods: Self-Embedding, FastText, and BERT (Bidirectional Encoder Representations from Transformers). This section discusses the operating principles and performance results of each method.
Figure 3 illustrates an example of word embeddings, highlighting semantic relationships between words.
3.3.1. Self-Embedding Method
Self-Embedding is a method that performs word embedding using the model’s internal features (self-representation) to create word and sentence representations. This approach typically aims to better capture the meaning of texts by leveraging the internal structure and parameters of deep learning models. Unlike traditional word embedding approaches, Self-Embedding methods, instead of working in a fixed word vector space, provide context sensitivity, producing representations that vary depending on the context of word usage.
One of the main advantages of this method is that it creates more consistent and meaningful representations, especially in texts with long contexts. Thanks to context dependency, Self-Embedding can distinguish different meanings of the same word, thus achieving higher accuracy in language processing tasks. Furthermore, by leveraging the models’ self-teaching capacity, it can reduce reliance on large amounts of pre-processed data [
30].
3.3.2. BERT (Bidirectional Encoder Representations from Transformers)
BERT stands out as a language model that captures the context-dependent meanings of words. Based on a transformer architecture, this model analyzes the relationships between words and both preceding and following words in a bidirectional manner. In this context, BERT generates context-sensitive embedding vectors and captures the meaning of words in a sentence more deeply [
31]. However, BERT’s large size and high computational power requirements have caused difficulties in running the models, which have prevented comprehensive optimization of BERT.
3.3.3. FastText
FastText is a method developed by Facebook AI Research (FAIR) that enables richer text representation by using subword information in the field of word embedding [
32]. Unlike traditional word embedding methods, FastText not only considers words as a whole but also learns the meaning of these subunits by breaking them down into substrings called n-grams. This feature strengthens the representation of words that are rarely or never encountered. For example, the word “playing” is broken down into the root “play” and the suffix “-ing” to better model its meaning.
Another key advantage of FastText is its speed and efficiency. It can operate on large datasets using highly effective algorithms in both the training and inference processes. Furthermore, it provides pre-trained models for a large number of languages, facilitating implementation in different languages. This method is widely used in various natural language processing tasks such as language modeling, text classification, and semantic search. FastText stands out among word embedding methods thanks to its innovative approach to understanding the structural properties of language.
In this study, Self-Embedding, FastText, and BERT methods were evaluated for the representation of text data. The models were compared, considering the advantages and disadvantages of each method.
The Self-Embedding method, thanks to its ability to create data-specific embeddings, has demonstrated effective performance in low-resource, morphologically rich languages like Turkish. Its greatest advantage is its ability to capture nuances specific to the dataset used in this study. However, training the model requires significant time and resources, and its generalization ability to small datasets is limited.
FastText demonstrated robustness to spelling errors and morphological variation by considering word subunits. While this method appears advantageous given the structural characteristics of Turkish, it is insufficient to capture the meaning differences of words within sentences because it operates context-independently.
The BERT method stands out for its ability to learn context-specific meanings and demonstrated superior performance, particularly on sentence-level semantic relationships. However, the method’s high computational cost and the lack of sufficient pre-trained data for Turkish have led to some limitations in its application.
3.4. Modeling
Because the text data used in this study may contain temporal dependencies, the LSTM (Long Short-Term Memory) model was used because it can capture such correlations more effectively than traditional deep learning models.
Recurrent Neural Networks (RNNs) are built on a cyclic structure that passes the output of the previous step to the next step to model time dependencies in sequential data. This allows for the learning of sequential structures such as word sequences, speech streams, or time series. However, Pascanu, Mikolov, and Bengio have shown that training classical RNNs presents significant challenges, especially when trying to learn long dependencies. These challenges stem from the fact that, during error backpropagation, gradients can become excessively small (vanishing gradient) or large (exploding gradient) over time, hindering the model’s learning. Therefore, the need to develop more stable and efficient architectures for applications working with long sequences has arisen [
33].
LSTM is a type of recurrent neural network (RNN) widely used in sequential data and time series analysis. It was developed to overcome the difficulties that traditional RNNs have in learning long-term dependencies. LSTM uses three main gate mechanisms to preserve past information for longer periods: a forget gate (which controls how much past information is remembered), an input gate (which determines how new information is incorporated into the cell), and an output gate (which ensures that information in the cell is passed on to the next layer). Thanks to these gates, LSTM effectively learns both short- and long-term dependencies, making it powerful for language modeling and prediction tasks [
5].
In this study, three different models were constructed to measure the performance of three different embedding methods, all using two-layer LSTM architecture. Maintaining the model architecture constant allowed for direct comparison of the performance of different embedding methods and ensured that the results reflected only the differences arising from the embedding techniques used. This allowed for a more robust and reliable analysis of the impact of text representations on credit risk estimation.
Forget Gate (): Determines how much of the previous cell state () will be forgotten.
Input Gate (): Determines how much new information will be added to the cell state ()
Update Gate (): Updates the memory unit based on information received from the forget gate and the input gate.
Output Gate (): Determines how much of the new cell state () will be transferred to the output ().
Activation Functions: Sigmoid (σ) and tanh functions process the information in the cell and determine the output [
34].
In this study, a two-layer LSTM architecture was used for credit risk prediction. In an LSTM, each layer consists of computational units that enable the model to learn specific correlations by processing input data. Each layer captures sequential dependencies in time series data and transmits this information to the next layer.
A two-layer LSTM allows the model to learn deeper and more complex patterns. The first layer extracts underlying patterns from the input data, while the second layer transforms this underlying information into higher-level representations. This structure provides a significant advantage, particularly in capturing sequential dependencies in linguistic data. Thus, the model achieved higher accuracy rates in credit risk prediction and increased generalization ability.
The complete hyperparameter configuration of the LSTM model is presented in
Table 5. In all experiments, the architecture was kept constant, and only the embedding methods were changed for comparison purposes.
The configuration, preprocessing steps, and hyperparameters of the LSTM model used in this study are detailed in this section. To increase reproducibility, the code and model configuration files, which have undergone the necessary privacy checks, will be available to researchers upon reasonable request. However, the raw call recording data cannot be shared publicly due to privacy and regulatory restrictions. However, all preprocessing steps and statistical properties of the dataset are documented in detail in the article.
3.5. Performance Measurement
This study demonstrates that customers can predict whether they will keep their payment promises based on the phrases they use on their phone calls. The model’s classification performance was evaluated using standard metrics, including accuracy, precision, recall, and F1 score, taking into account the uneven distribution of the target variable. These metrics provide a comprehensive understanding of the model’s effectiveness in making accurate predictions. Accuracy measures the model’s overall accuracy; precision indicates the proportion of true positives among all positive predictions; and recall assesses the model’s ability to identify true positives. The F1 score, the harmonic average of precision and recall, provides a balanced evaluation measure [
35]. The relevant metrics are calculated as specified in (1)–(4).
where N is the number of classes, TPi is the number of true positives for class i and FPi is the number of false positives for class i.
where FNi is the number of false negatives for class i.
where PRi and REi are the precision and recall for class i, respectively.
Additionally, confusion matrices were created to display the model’s classification results to provide a detailed view of how well the classifier distinguishes between error classes. This visual representation is crucial for understanding the classifier’s strengths and areas for improvement.
5. Discussion
The nature of the text data obtained from phone interviews is also a significant factor. Compared to written text, this type of data is more disorganized, incomplete, and prone to colloquialisms. This negatively impacts the performance of models optimized for more formal language structures, such as BERT, while providing advantages for simpler, more data-driven methods like Self-Embedding and FastText.
Furthermore, the Self-Embedding method was optimized with parameters specific to the text data used in the study. This enabled the model to better capture the meaning of the texts and achieve more balanced performance in class prediction.
As a result, the Self-Embedding method used in the study outperformed BERT and FastText in terms of generalization capacity, thanks to its suitability for the nature of the dataset and its low-parameter structure. These findings demonstrate the crucial importance of the suitability of embedding methods used in the banking sector when selecting models working with Turkish text data.
While modeling studies often focus on a high accuracy rate, another contribution of these studies is their contribution to the study of causality. Therefore, customers who kept and did not keep their payment promises were examined in detail, taking into account the model results, and insights that could contribute to improving call center processes were thoroughly explored.
Analyzing the most significant words in the model revealed that customers who did not keep their payment promises used specific phrases and words. These expressions can be categorized to better understand the reasons why customers fail to fulfill their payment commitments.
First, a significant portion of these expressions refer to personal and family problems. Words like “mother, father, sibling, patient, hospital, health, accident, death” indicate that the customer is experiencing health problems or family crises. Such difficulties can lead to unexpected expenses, negatively impacting the customer’s ability to pay. Healthcare expenses, in particular, stand out as a significant factor straining the budget.
A second group of statements relates to emotional and religious connections. These statements appear to emphasize the accuracy of what the customer says and build trust. Such statements may indicate their continued intention to make payments despite financial difficulties.
A third group of statements involves customers expressing financial difficulties. This is a significant factor that can disrupt the customer’s payment plans. Changes in payday schedules or unemployment can particularly disrupt income flow.
Another category relates to legal issues and fraud. These situations can create both financial and psychological pressure, leading to payment difficulties.
Finally, statements like “How much do I owe? There must be a mistake. Have I not paid?” indicate that the customer lacks knowledge about their debt or believes they are facing an unexpected debt. These statements often arise due to reasons such as inadequate debt tracking, payment errors, or forgotten due dates.
This classification can help better understand the reasons why customers fail to pay and use this data to improve call center processes. In particular, considering factors such as personal circumstances, financial difficulties, and legal issues will enable the development of more effective and personalized strategies.
This model’s output can help banks more accurately predict credit risk using data from customer conversations. Furthermore, call center conversation scripts can be updated to reflect whether customers who use certain patterns are more likely to keep their promises or not, and alternative communication and offers can be designed to encourage payment from customers who are predicted to be unpaid. One of the project’s biggest gains is demonstrating the potential for such linguistic analysis to be used in customer service and risk management processes.
This study is significant because it is one of the pioneering studies conducted to predict credit risk by processing Turkish text data. In the banking sector, understanding customer payment habits and developing accurate prediction models is a cornerstone of financial risk management. The findings revealed that the two-layer LSTM architecture trained using the Self-Embedding method outperformed other methods. This was made possible by effectively modeling the contextual information specific to the dataset.
The analyzed data demonstrates significant relationships between customer payment behavior and the language they use. Customers with a high probability of making payments generally used positive, definitive, and confident expressions, while customers with a low probability of making payments cited personal problems and financial difficulties as reasons for nonpayment, and included religious expressions to establish credibility. This highlights the potential and importance of using language analysis in credit risk prediction.
The data cleaning process, one of the key stages of the project, played a critical role in making the raw data analyzable. In a morphologically rich language like Turkish, the cleaning and structuring of text obtained from call center conversations directly impacted the model’s accuracy. Furthermore, the accuracy of the program used to convert audio data to text was a key success factor for the study. Because low-quality transcripts can limit the model’s performance, each step in these processes was carefully planned.
The study’s limitations should be carefully considered regarding the generalizability of the findings. Because the dataset used was obtained from call center conversations from only one bank, the results may not be directly applicable to other banks or sectors. Furthermore, the model’s performance was tested only on a Turkish-specific dataset. Conducting similar studies in different languages or sectors could provide a broader perspective on the generalizability and accuracy of the methods.
For future studies, automating data cleaning processes and using more advanced tools for converting audio data to text is recommended. Furthermore, combining different embedding methods or developing hybrid models has the potential to improve model performance.
Additionally, the fixed LSTM architecture used in this study was intentionally chosen to more clearly compare the effects of different embedding approaches. Furthermore, future studies could also test the effects of Transformer-based models on embedding performance. Furthermore, output from large language models like GPT-4 could be analyzed more thoroughly for both explainability and formal consistency using evaluation frameworks like MDEval.
6. Conclusions
This study developed a model to predict customer loan payment behavior using Turkish text data, pioneering research in this area. The results demonstrated that the Self-Embedding method successfully learned linguistic and contextual details specific to the dataset and achieved high performance with the LSTM architecture.
The research findings reveal how text analytics can create value in credit risk prediction in the banking sector. By applying this type of model, banks can better identify customers who are unlikely to pay and plan their risk management strategies more effectively. It can also contribute to the development of new routing strategies that call center employees can use when communicating with customers. This model can also be used to identify strategic approaches that can positively influence payment behavior.
The study also provides a foundation for more comprehensive future research in credit risk prediction. Studies conducted with datasets from different sectors, countries, or languages could increase the general validity of this field. Furthermore, the scope of the study could be expanded by using larger datasets and more advanced language models. Such research is critical for furthering the applicability of text analytics, especially in low-resource languages like Turkish.
This study has several limitations that should be considered. First, the dataset was obtained from a single financial institution; therefore, the generalizability of the findings to other markets, institutions, or languages may be limited. Second, the analysis is based on a standard LSTM architecture. Although more recent transformer-based approaches (e.g., BERT, GPT-based models) could potentially yield higher performance, their application was not feasible in the present study due to computational cost, institutional restrictions on processing data outside the bank environment, and the limited availability of sufficiently large transcribed conversational datasets. Nevertheless, with access to GPU resources and larger-scale datasets, we plan to explore transformer-based architectures as part of future research. Third, the experiments were conducted without GPU acceleration using only 40 CPU cores, which limited the speed and scale of the training process. Future research could utilize GPU-based implementations to perform more comprehensive hyperparameter optimization and larger-scale experiments. Finally, while the evaluation is based on widely accepted metrics, real-world business impacts (e.g., reducing default rates or improving credit decision processes) have not yet been tested and are left for future research.
In conclusion, this study highlights the importance of call center data and language analytics in credit risk prediction and offers a pioneering approach in this area. This approach should be considered a significant step forward in advancing risk management in the financial sector.