Next Article in Journal
Effect of Natural Seawater Salinity on Stainless Steel Corrosion: Enhanced Resistance in Seawater Bittern
Previous Article in Journal
Electromagnetic Disintegration of Water Treatment Sludge: Physicochemical Changes and Leachability Assessment
Previous Article in Special Issue
Cross-Lingual Summarization for Low-Resource Languages Using Multilingual Retrieval-Based In-Context Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Turkish Telephone Conversations in Credit Risk Management: Natural Language Processing and LSTM Approach

1
Department of Statistics, Faculty of Arts & Sciences, Yildiz Technical University, Istanbul 34220, Türkiye
2
Department of Management Information Systems, Faculty of Business Administration, Marmara University, Istanbul 34180, Türkiye
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(1), 108; https://doi.org/10.3390/app16010108
Submission received: 31 August 2025 / Revised: 26 September 2025 / Accepted: 18 December 2025 / Published: 22 December 2025

Abstract

This study aims to analyze text data obtained from Turkish phone calls to manage credit risk in the banking sector and predict whether customers will fulfill their payment promises. Data cleaning was identified as a critical step to improve the quality of the texts, and various natural language processing (NLP) techniques were used. The model was built using a two-layer LSTM architecture, starting with a Self-Embedding layer, and achieved approximately 80% accuracy on the test data. The findings indicate that customers who break their payment promises often cite personal life issues such as health problems, family issues, financial difficulties, and religious beliefs to ensure reliability. These results demonstrate the importance of text data in the banking sector, the applicability of different embedding methods to Turkish datasets, and their advantages and disadvantages. Furthermore, the model built using data obtained from customer conversations can help predict credit risk more accurately and contribute to improving call center processes. Automating data cleaning processes and developing speech-to-text translation tools are recommended for future studies.

1. Introduction

In the banking sector, timely repayment of loans granted to customers is critical to managing financial risks. The likelihood of loan repayment is one of the key factors directly impacting banks’ liquidity and overall financial health. The increasing number of individual customers, in particular, has made it impossible to assess each customer’s credit risk using traditional methods. Therefore, banks are increasingly turning to artificial intelligence and machine learning techniques to assess and manage credit risk [1].
Effective collection processes enable banks to meet their operational needs by maintaining cash flow and reducing non-performing loan rates. This reduces risks in the financial system, supporting overall economic stability. This, in turn, increases banks’ profitability and positively impacts their capital adequacy ratios. Furthermore, compliance with regulatory requirements (BRSA rules) minimizes legal risks and strengthens customer confidence [2].
In recent years, credit default rates in Europe have increased due to economic recessions and various macroeconomic factors. A study by Barbaglia et al. [3] examined credit default behavior in Europe, showing that borrower characteristics, loan-specific variables, and local economic conditions have a significant impact on default. This study highlights the importance of regional risk assessment policies.
In particular, the aftermath of COVID-19 has led to a noticeable shift in customer payment behavior due to rising global inflation and interest rates. For the same reason, considering the decline in bank profitability, data-driven analyses and forecasting processes have become even more important.
This study’s literature review examined the use of machine learning and deep learning models in predicting credit risk using information obtained from tabular and textual data.
Machine learning is a powerful tool used to create predictive models by extracting meaningful information from large data sets. Deep learning methods and recurrent neural networks, such as LSTM (Long Short-Term Memory), are particularly capable of effectively analyzing time series data and natural language processing problems [4,5].
The importance of machine learning in collection processes has been investigated for various loan types, such as mortgages and agricultural loans, with successful results. A study by Sirignano et al. [6], highlighted the use of deep learning models in mortgage risk analysis, achieving high accuracy in predicting borrower behavior and loan performance. This study explored the dynamics of credit default risk using large datasets.
In response to the economic downturn in the agricultural sector and concerns about farmers’ repayment capacity, the use of machine learning methods in agricultural loan delinquency estimation has made significant contributions to more accurately predicting financial stress. In this context, several studies have demonstrated the effectiveness of machine learning algorithms in predicting financial stress [7].
The LSTM algorithm demonstrates superior performance in credit risk prediction by effectively analyzing time series data [8]. Gao et al. achieved high accuracy in predicting credit card defaults using the XGBoost-LSTM model, which is a combination of XGBoost and LSTM models. This model achieved high classification accuracy without requiring feature engineering [9].
In a 2011 study, Loughran and McDonald suggested that the language used in financial reports can help make better predictions about companies’ financial situations [10].
In their research on Lending Club data, Kriebel and Stitz showed that even short user-generated texts significantly improved loan default predictions. Deep learning methods were found to be more effective at extracting valuable information from texts than traditional machine learning approaches [11].
Fu et al. [12] demonstrated the effectiveness of using deep learning methods to predict the default risk of online P2P platforms. Specifically, they found that BiLSTM-based models can successfully predict the default risk of platforms by extracting and using keywords from investor reviews.
Additionally, techniques incorporating deep learning architectures such as convolutional neural networks (CNNs) and transformer models like BERT and RoBERTa have been shown to be highly effective at processing and analyzing unstructured text data. These models demonstrate superior performance in loan default prediction tasks, providing banks with robust tools for managing the challenges of large datasets [11].
Beyond structured financial data, there is growing interest in using text data to more accurately predict credit risk. Various studies in recent years have demonstrated that meaningful features can be extracted from text using natural language processing (NLP) and deep learning techniques. The findings of some prominent research in this area are summarized below.
A study by Wang, Qi, Fu, and Liu investigated how textual descriptions found in loan applications could be integrated with traditional financial data to predict default. They concluded that unstructured textual data contains clues about borrowers’ intentions and behaviors, and that this information can improve model performance. This study, as an early example, highlights the direct contribution of text-based descriptions to credit risk assessment [13].
The 2019 article “When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications” by Netzer and colleagues explores how application letters, which specify the applicant’s purpose in applying for a loan, can be used to predict loan default. Using text mining and machine learning tools, the authors analyzed more than 120,000 loan applications and found that textual information is important in predicting loan default. The paper concludes by highlighting the value of combining textual data with machine learning in improving the accuracy of credit scoring models [8].
By analyzing loan evaluation notes written by credit experts on small business loans, Stevenson, Mues, and Bravo demonstrated how text data can add value to structured data in predicting default. Using deep learning-based models, this study demonstrated that text input significantly contributes to the model and improves predictive performance compared to models relying solely on numerical data. This study demonstrates that text data can be an important signal carrier to consider in credit decision-making processes [14].
Lis, Kubkowski, Borkowska, Serwa, and Kurpanik analyzed the text contained in bank reports prepared for the audit and validation of credit risk models. This analysis, conducted using embedding and clustering methods, automatically extracted model errors and structural weaknesses from the text. This approach demonstrates that text data can be effectively used not only in the forecasting process but also in processes such as model validation and auditing [15].
Sanz Guerrero and Arroyo analyzed the descriptions written by borrowers on peer-to-peer lending platforms and investigated how large language models (LLMs) can be used to predict credit risk. Developed with the ChatGPT architecture, the model identified content related to default risk by evaluating linguistic patterns in the text and automatically generated a risk indicator based on this information. This study is noteworthy because it demonstrates that unstructured text can be used not only as a complementary but also as a standalone risk indicator. Furthermore, it provides a concrete application example of how LLMs can be used to replace or complement traditional statistical models [16].
Alamsyah, Hafidh, and Mulya developed an alternative credit scoring method using text data obtained from social media interactions of fintech users in Indonesia. The model, created by analyzing social media text, was able to predict the risk profile of individuals without traditional credit history, thus contributing to increased financial inclusion. This study demonstrates that unstructured data can be a valuable resource, especially for individuals lacking traditional financial knowledge [17].
In a study published in 2025, Wu, Dong, Li, and Shi compared the default prediction performance of user disclosures in loan applications, both in their original form and in versions refined using ChatGPT (OpenAI, available at https://chat.openai.com, accessed on 15 July 2025). They found that the AI-rewritten disclosures were clearer, more consistent, and more learnable by models, resulting in higher predictive accuracy. This study demonstrates the direct impact of text quality on model accuracy and highlights the criticality of text processing in credit risk analysis [18].
In recent years, large language models (LLMs) and Transformer-based frameworks have made significant progress in text processing. Although these models demonstrate high performance on textual tasks, the focus of the current study was to compare the performance of different embedding methods in the Turkish language, so the more standardized LSTM was chosen. This allowed us to clearly observe the effects of individual embedding methods, while controlling for variations in model architecture.
Transformer architecture was first introduced by Vaswani et al. in 2017 [19] and multimodal models like GPT-4 have been built with scaled-down versions of this architecture [20]. Features of LLMs such as output quality and structure awareness have been evaluated in studies such as MDEval [21]. Quantization techniques have also been proposed to enable LLMs in resource-constrained environments [22]. Although these studies reflect the methodological evolution in the field, a more controlled structure was preferred for the embedding comparisons that were the aim of the current study.
The data analyzed in my study was obtained from audio recordings of phone calls between a customer representative and a borrower. Automatic speech recognition systems, which are used to analyze such data, are a developing field of research, particularly for low-resource languages.
This study, conducted by Mussakhojayeva, Dauletbek, Yeshpanov, and Varol, developed multilingual speech recognition models for the Turkic languages. Monolingual and multilingual approaches were compared in models developed for ten different Turkic languages, and it was found that multilingual models, in particular, improved the character error rate (CER) by up to 50%. Furthermore, by providing an open-source corpus containing approximately 218 h of audio data for Turkish, this study made a significant contribution to speech recognition research in low-resource languages. Because Turkic languages share similar morphological structures, the data sharing and transfer learning strategies proposed in the study offer potential for analysis on Turkish speech data. In this context, the study serves as an important reference for researchers seeking to improve the quality of text data obtained from Turkish speech [23].
Kheddar, Hemis, and Himeur’s study, “Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey,” provides a detailed review of how deep learning-based techniques (e.g., deep transfer learning, federated learning, reinforcement learning, and Transformer architectures) have been used in automatic speech recognition (ASR) in recent years. This study addresses current challenges such as data scarcity and computational cost for low-resource languages and proposes advanced methodologies to overcome these hurdles. Analyses of model architectures, data preprocessing strategies, and domain mismatches on ASR performance provide a direct link to my study’s approach to measuring the impact of embedding methods on Turkish speech text data by isolating their impact [24].
Gormez developed a specially designed deep learning-based speech recognition system, taking into account the challenges specific to the Turkish language. In this study, acoustic models created using CNN, GRU, and LSTM layers were integrated with a language model supported by the Zemberek library. Experiments have shown that the system, supported by a language model using the Turkish Speech Corpus, provides significant improvements in metrics such as Word Error Rate (WER) and Character Error Rate (CER). Developed specifically to consider the agglutinative structure and phonetic features of Turkish, the system provides an important example in terms of both infrastructure design and language model utilization in Turkish ASR studies. This study can contribute to reducing errors that may occur in the speech-to-text conversion process and guide the production of high-quality Turkish transcripts [25].
In a study published in 2025, Dhahbi, Nasir Saleem, Sami Bourouis, Mouhebeddine Berrima, and Elena Verdú developed an end-to-end (E2E) automatic speech recognition system for low-resource languages. The study demonstrated the feasibility of training a model directly from raw data, eliminating the need for speech feature extraction, traditional language models, and labor-intensive preprocessing steps. Synthetic speech data generation and data augmentation techniques resulted in lower Word Error Rate (WER) and Character Error Rate (CER); this approach is promising for resource-poor languages like Turkish. In this context, it provides a methodological support for your work by providing a reference where the effect of embedding methods can be observed more clearly when the model architecture is kept constant [26].
This article, published in 2025 by Jan Karem Höhne, Timo Lenzner, and Joshua Claassen, compared the accuracy of ASR systems’ text output when open-ended questions were answered via voice in a survey conducted via smartphones in Germany. A comparison of two prominent ASR tools, Google Cloud Speech to Text API and OpenAI Whisper, found that Whisper provided more accurate translations, while the Google API had a speed advantage. Furthermore, factors such as audio quality and background noise were observed to affect the error rate in both systems. This study provides a good example in the literature that concretely demonstrates the types of errors that can occur during text transcription and how system preferences affect the results [27].
Despite the increasing number of studies in recent years, the literature lacks sufficient studies on text data-based deep learning methods in the collections field. As the number of studies in the literature increases, collection processes will be further optimized. The high inflation and payment difficulties experienced worldwide in the post-COVID era further increase the importance of this topic. The different sources and algorithms used in the dataset will directly impact the performance of the studies to be conducted.
This study aims to improve credit risk management in the banking sector by analyzing customer conversations recorded in Turkish and predicting whether customers who promised to pay will fulfill their commitments. The study compared different embedding methods to create meaningful representations from text data and used LSTM-based deep learning models to capture long-term dependencies. The findings can help banks improve their risk management processes and make their credit assessment mechanisms more effective.
In the following sections of the study, the dataset and data processing steps will be explained, followed by a detailed description of the structure and performance analyses of the proposed model. Finally, the findings will be discussed, and the contributions made within the scope of the study and recommendations for future research will be presented.

2. Dataset

The dataset used in this study was obtained from telephone conversations with customers in default. According to the bank’s strategic decisions, customers are called after a certain number of days of overdue payment to obtain payment commitments. The audio recordings from these conversations were converted to text data using speech-to-text technology and stored in an MS SQL database. The database records each speaker’s statements sequentially and chronologically.
Data collection was conducted through the bank’s call center systems. The interviews were conducted between bank representatives and customers under natural conditions. The transcribed text data was systematically stored, and each speaker’s statements were arranged in a way that preserved the flow of the conversation. Additionally, the database includes metadata such as the date and duration of the call and customer account information.
This dataset consists of anonymized transcripts of customer service call recordings within the framework of relevant financial and legal regulations. The data has been processed in accordance with legal regulations, corporate privacy rights, and data protection legislation. No personal information is included in the data pool. Data processing is conducted in accordance with confidentiality and security policies; access to data is provided only to authorized individuals via secure storage environments. Therefore, individual data separation is not required.
Interviews typically include three main types of interactions: payment commitments received from customers, information about the product in arrears, and details of the conversation. This information provides critical information for understanding customers’ payment behavior and communication patterns.
Such data provides valuable insights for developing payment prediction models and optimizing customer communication strategies.
This dataset aims to yield more realistic and applicable conclusions about customer behavior and communication patterns because it was collected under real-world conditions rather than in controlled laboratory environments. Storing data sequentially and in detail offers significant advantages for analyzing conversation dynamics and temporal patterns. This study aims to enhance a bank’s ability to predict payment behavior and optimize customer communication strategies through detailed analysis of past interactions.
In the raw dataset, each conversation is captured sequentially between the customer contact center employee and the customer, as shown in Table 1.
Descriptive statistics for the raw data are given in Table 2.
The data presented in the table provide important clues for assessing the effectiveness of customer conversations and customer adherence to payment commitments. Furthermore, this broad scope of the dataset demonstrates its richness and diversity.
These descriptive statistics shed light on the bank’s efforts to optimize its customer communication strategies regarding overdue payments. Data such as the frequency and duration of conversations, word distribution, and payment commitment rates provide important information for evaluating the effectiveness of customer interactions and the performance of representatives. This data can be used to develop future communication strategies and customer management policies.
In the next section, the raw dataset will be combined and cleaned, keeping each conversation in a single line for use in analysis and modeling.

3. Materials and Methods

3.1. Overview

The research presents a framework that includes data collection, data cleaning, data digitization, modeling, and model evaluation, as shown in Figure 1. Data collection is explained in Section 2. During the data cleaning phase, errors in the data were corrected and the data was standardized. Data digitization was performed using BERT, FastText, and Self-Embedding methods. The LSTM (long-short-term memory) algorithm, one of the most successful deep learning methods in text mining, was used in the modeling, aiming to establish an appropriate architectural structure for estimating the target variable. Finally, the model outputs were analyzed, and the differences between customers who fulfilled their payment promises and those who did not were analyzed to infer information about customer payment performance.

3.2. Data Cleaning

Data cleaning is crucial in text mining processes [13]. In his study on Turkish text classification, Borandağ revealed that simplification, spelling corrections and special character cleaning during the pre-processing process directly affect the model performance [28]. Similarly, Zümberoğlu and Eren reported that linguistic cleaning and structural adjustments to the dataset they developed for Turkish sentiment analysis significantly increased model accuracy. Both studies demonstrate that models experience semantic losses when the language structure and morphological richness specific to Turkish are not carefully processed. In this context, a comprehensive data cleaning process was applied to the Turkish speech transcripts used in the current study, taking into account language distortions, repetitive patterns, indirect expressions, and systematic errors originating from the audio data [29].
In this study, unnecessary spaces and special characters (periods, commas, etc.) were first removed from the text to transform the raw data into an analyzable format as shown in Figure 2. This step reduced noise in the data and standardized the texts. Second, all words in the data set were converted to lowercase. Because uppercase and lowercase letters can cause the same word to be perceived differently in text mining processes, this conversion process increased the accuracy of word frequencies and consistency in the data set. Finally, the longest and most detailed step of the data cleansing process was manual corrections, including monogram and bigram checks, to correct spelling errors and grammatical errors. Data with a high number of incomprehensible words were deleted.
During monogram and bigram checks, the dataset was examined separately for those who kept their payment promises and those who did not as illustrated in Table 3. This allowed for both a more detailed understanding of the dataset and a more detailed study of the manual data cleaning process.
The shortest and longest conversations in the data were examined, and it was determined that these conversations frequently experienced issues such as calls not being finalized due to technical issues or dropping out quickly. To ensure that the dataset learned from complete, more accurate data, outliers were identified, and conversations shorter than 0.1 percentile and longer than 0.9 percentile were removed from the dataset.
This process increased consistency in the text, creating a more robust analysis environment and positively impacting model performance.
Finally, the data was sorted according to the conversation order. Assuming no information loss would occur due to the customer contact center employees using standard text, the dataset was cleaned to retain only customer conversations to reduce the size and complexity of the dataset. Each conversation was grouped into a single line. An example phrase can be viewed in Table 4.
After this data cleaning process, the dataset contained 1,701,782 lines. The average word count was 47.65, the minimum was 13, the maximum was 127, and the median was 40. The target variable ratio in the dataset was 57%.
Considering the complexity and size of the dataset, we decided to work with a specific sample size, and a random sample of 250,000 lines, representing approximately 15% of the total data, was selected. The dataset was divided into train and test datasets of 80% and 20%, respectively. The model will learn on 80% of the data, and performance tests will be conducted on the remaining 20%.
One of the most crucial elements during data cleansing is the proper operation of the voice-to-text converter. The quality of this tool will directly impact the accuracy of the text data and, consequently, the model’s performance. Furthermore, bank call center employees must also guide customers correctly. This study can contribute to improving call center processes by enabling inferences about customer referrals.

3.3. Embedding

Embedding is a method that represents words or text as numerical vectors. This method allows natural language processing models to process data more effectively by preserving the semantic and contextual relationships between words. This study evaluated three different word embedding methods: Self-Embedding, FastText, and BERT (Bidirectional Encoder Representations from Transformers). This section discusses the operating principles and performance results of each method. Figure 3 illustrates an example of word embeddings, highlighting semantic relationships between words.

3.3.1. Self-Embedding Method

Self-Embedding is a method that performs word embedding using the model’s internal features (self-representation) to create word and sentence representations. This approach typically aims to better capture the meaning of texts by leveraging the internal structure and parameters of deep learning models. Unlike traditional word embedding approaches, Self-Embedding methods, instead of working in a fixed word vector space, provide context sensitivity, producing representations that vary depending on the context of word usage.
One of the main advantages of this method is that it creates more consistent and meaningful representations, especially in texts with long contexts. Thanks to context dependency, Self-Embedding can distinguish different meanings of the same word, thus achieving higher accuracy in language processing tasks. Furthermore, by leveraging the models’ self-teaching capacity, it can reduce reliance on large amounts of pre-processed data [30].

3.3.2. BERT (Bidirectional Encoder Representations from Transformers)

BERT stands out as a language model that captures the context-dependent meanings of words. Based on a transformer architecture, this model analyzes the relationships between words and both preceding and following words in a bidirectional manner. In this context, BERT generates context-sensitive embedding vectors and captures the meaning of words in a sentence more deeply [31]. However, BERT’s large size and high computational power requirements have caused difficulties in running the models, which have prevented comprehensive optimization of BERT.

3.3.3. FastText

FastText is a method developed by Facebook AI Research (FAIR) that enables richer text representation by using subword information in the field of word embedding [32]. Unlike traditional word embedding methods, FastText not only considers words as a whole but also learns the meaning of these subunits by breaking them down into substrings called n-grams. This feature strengthens the representation of words that are rarely or never encountered. For example, the word “playing” is broken down into the root “play” and the suffix “-ing” to better model its meaning.
Another key advantage of FastText is its speed and efficiency. It can operate on large datasets using highly effective algorithms in both the training and inference processes. Furthermore, it provides pre-trained models for a large number of languages, facilitating implementation in different languages. This method is widely used in various natural language processing tasks such as language modeling, text classification, and semantic search. FastText stands out among word embedding methods thanks to its innovative approach to understanding the structural properties of language.
In this study, Self-Embedding, FastText, and BERT methods were evaluated for the representation of text data. The models were compared, considering the advantages and disadvantages of each method.
The Self-Embedding method, thanks to its ability to create data-specific embeddings, has demonstrated effective performance in low-resource, morphologically rich languages like Turkish. Its greatest advantage is its ability to capture nuances specific to the dataset used in this study. However, training the model requires significant time and resources, and its generalization ability to small datasets is limited.
FastText demonstrated robustness to spelling errors and morphological variation by considering word subunits. While this method appears advantageous given the structural characteristics of Turkish, it is insufficient to capture the meaning differences of words within sentences because it operates context-independently.
The BERT method stands out for its ability to learn context-specific meanings and demonstrated superior performance, particularly on sentence-level semantic relationships. However, the method’s high computational cost and the lack of sufficient pre-trained data for Turkish have led to some limitations in its application.

3.4. Modeling

Because the text data used in this study may contain temporal dependencies, the LSTM (Long Short-Term Memory) model was used because it can capture such correlations more effectively than traditional deep learning models.
Recurrent Neural Networks (RNNs) are built on a cyclic structure that passes the output of the previous step to the next step to model time dependencies in sequential data. This allows for the learning of sequential structures such as word sequences, speech streams, or time series. However, Pascanu, Mikolov, and Bengio have shown that training classical RNNs presents significant challenges, especially when trying to learn long dependencies. These challenges stem from the fact that, during error backpropagation, gradients can become excessively small (vanishing gradient) or large (exploding gradient) over time, hindering the model’s learning. Therefore, the need to develop more stable and efficient architectures for applications working with long sequences has arisen [33].
LSTM is a type of recurrent neural network (RNN) widely used in sequential data and time series analysis. It was developed to overcome the difficulties that traditional RNNs have in learning long-term dependencies. LSTM uses three main gate mechanisms to preserve past information for longer periods: a forget gate (which controls how much past information is remembered), an input gate (which determines how new information is incorporated into the cell), and an output gate (which ensures that information in the cell is passed on to the next layer). Thanks to these gates, LSTM effectively learns both short- and long-term dependencies, making it powerful for language modeling and prediction tasks [5].
In this study, three different models were constructed to measure the performance of three different embedding methods, all using two-layer LSTM architecture. Maintaining the model architecture constant allowed for direct comparison of the performance of different embedding methods and ensured that the results reflected only the differences arising from the embedding techniques used. This allowed for a more robust and reliable analysis of the impact of text representations on credit risk estimation.
Key Components in Figure 4:
  • Forget Gate ( f t ): Determines how much of the previous cell state ( C t 1 ) will be forgotten.
  • Input Gate ( i t ): Determines how much new information will be added to the cell state ( C t )
  • Update Gate ( C t ): Updates the memory unit based on information received from the forget gate and the input gate.
  • Output Gate ( O t ): Determines how much of the new cell state ( C t ) will be transferred to the output ( H t ).
  • Activation Functions: Sigmoid (σ) and tanh functions process the information in the cell and determine the output [34].
In this study, a two-layer LSTM architecture was used for credit risk prediction. In an LSTM, each layer consists of computational units that enable the model to learn specific correlations by processing input data. Each layer captures sequential dependencies in time series data and transmits this information to the next layer.
A two-layer LSTM allows the model to learn deeper and more complex patterns. The first layer extracts underlying patterns from the input data, while the second layer transforms this underlying information into higher-level representations. This structure provides a significant advantage, particularly in capturing sequential dependencies in linguistic data. Thus, the model achieved higher accuracy rates in credit risk prediction and increased generalization ability.
The complete hyperparameter configuration of the LSTM model is presented in Table 5. In all experiments, the architecture was kept constant, and only the embedding methods were changed for comparison purposes.
The configuration, preprocessing steps, and hyperparameters of the LSTM model used in this study are detailed in this section. To increase reproducibility, the code and model configuration files, which have undergone the necessary privacy checks, will be available to researchers upon reasonable request. However, the raw call recording data cannot be shared publicly due to privacy and regulatory restrictions. However, all preprocessing steps and statistical properties of the dataset are documented in detail in the article.

3.5. Performance Measurement

This study demonstrates that customers can predict whether they will keep their payment promises based on the phrases they use on their phone calls. The model’s classification performance was evaluated using standard metrics, including accuracy, precision, recall, and F1 score, taking into account the uneven distribution of the target variable. These metrics provide a comprehensive understanding of the model’s effectiveness in making accurate predictions. Accuracy measures the model’s overall accuracy; precision indicates the proportion of true positives among all positive predictions; and recall assesses the model’s ability to identify true positives. The F1 score, the harmonic average of precision and recall, provides a balanced evaluation measure [35]. The relevant metrics are calculated as specified in (1)–(4).
Accuracy = Number of correct predictions Total number of predictons
Precision   ( PR ) = 1 N i = 1 N T P i T P i + F P i
where N is the number of classes, TPi is the number of true positives for class i and FPi is the number of false positives for class i.
Recall   ( RE ) = 1 N i = 1 N T P i T P i + F N i
where FNi is the number of false negatives for class i.
F1-Score = 1 N i = 1 N 2 · P R i · R E i P R i + R E i
where PRi and REi are the precision and recall for class i, respectively.
Additionally, confusion matrices were created to display the model’s classification results to provide a detailed view of how well the classifier distinguishes between error classes. This visual representation is crucial for understanding the classifier’s strengths and areas for improvement.

4. Results

The model’s performance evaluation yielded an accuracy rate of 79.54% on the test data. Detailed performance analysis is shown in Table 6 and Table 7.
To more comprehensively evaluate the discriminatory power of the models, ROC Curves and corresponding AUC scores are presented for the two best-performing architectures (Self-Embedding LSTM and FastText-based LSTM). These results provide a more holistic assessment of the models’ performance beyond the basic classification metrics presented above.
Examining Table 6 and Table 7 and Figure 5, Figure 6, Figure 7 and Figure 8, it was observed that the Self-Embedding-based 2-layer LSTM model yielded better results than the FastText and BERT-based models, and more balanced results for both target 1 and target 0.
In the BERT-based model, the model’s prediction distribution was unbalanced, and it consistently tended to predict the target class as “1.”
This may be due to several reasons. First, BERT’s large pre-trained language model may have prevented the model from addressing the class imbalance problem. The model developed a bias toward the other class by learning more about the features of the more prevalent class during training.

5. Discussion

The nature of the text data obtained from phone interviews is also a significant factor. Compared to written text, this type of data is more disorganized, incomplete, and prone to colloquialisms. This negatively impacts the performance of models optimized for more formal language structures, such as BERT, while providing advantages for simpler, more data-driven methods like Self-Embedding and FastText.
Furthermore, the Self-Embedding method was optimized with parameters specific to the text data used in the study. This enabled the model to better capture the meaning of the texts and achieve more balanced performance in class prediction.
As a result, the Self-Embedding method used in the study outperformed BERT and FastText in terms of generalization capacity, thanks to its suitability for the nature of the dataset and its low-parameter structure. These findings demonstrate the crucial importance of the suitability of embedding methods used in the banking sector when selecting models working with Turkish text data.
While modeling studies often focus on a high accuracy rate, another contribution of these studies is their contribution to the study of causality. Therefore, customers who kept and did not keep their payment promises were examined in detail, taking into account the model results, and insights that could contribute to improving call center processes were thoroughly explored.
Analyzing the most significant words in the model revealed that customers who did not keep their payment promises used specific phrases and words. These expressions can be categorized to better understand the reasons why customers fail to fulfill their payment commitments.
  • First, a significant portion of these expressions refer to personal and family problems. Words like “mother, father, sibling, patient, hospital, health, accident, death” indicate that the customer is experiencing health problems or family crises. Such difficulties can lead to unexpected expenses, negatively impacting the customer’s ability to pay. Healthcare expenses, in particular, stand out as a significant factor straining the budget.
  • A second group of statements relates to emotional and religious connections. These statements appear to emphasize the accuracy of what the customer says and build trust. Such statements may indicate their continued intention to make payments despite financial difficulties.
  • A third group of statements involves customers expressing financial difficulties. This is a significant factor that can disrupt the customer’s payment plans. Changes in payday schedules or unemployment can particularly disrupt income flow.
  • Another category relates to legal issues and fraud. These situations can create both financial and psychological pressure, leading to payment difficulties.
  • Finally, statements like “How much do I owe? There must be a mistake. Have I not paid?” indicate that the customer lacks knowledge about their debt or believes they are facing an unexpected debt. These statements often arise due to reasons such as inadequate debt tracking, payment errors, or forgotten due dates.
This classification can help better understand the reasons why customers fail to pay and use this data to improve call center processes. In particular, considering factors such as personal circumstances, financial difficulties, and legal issues will enable the development of more effective and personalized strategies.
This model’s output can help banks more accurately predict credit risk using data from customer conversations. Furthermore, call center conversation scripts can be updated to reflect whether customers who use certain patterns are more likely to keep their promises or not, and alternative communication and offers can be designed to encourage payment from customers who are predicted to be unpaid. One of the project’s biggest gains is demonstrating the potential for such linguistic analysis to be used in customer service and risk management processes.
This study is significant because it is one of the pioneering studies conducted to predict credit risk by processing Turkish text data. In the banking sector, understanding customer payment habits and developing accurate prediction models is a cornerstone of financial risk management. The findings revealed that the two-layer LSTM architecture trained using the Self-Embedding method outperformed other methods. This was made possible by effectively modeling the contextual information specific to the dataset.
The analyzed data demonstrates significant relationships between customer payment behavior and the language they use. Customers with a high probability of making payments generally used positive, definitive, and confident expressions, while customers with a low probability of making payments cited personal problems and financial difficulties as reasons for nonpayment, and included religious expressions to establish credibility. This highlights the potential and importance of using language analysis in credit risk prediction.
The data cleaning process, one of the key stages of the project, played a critical role in making the raw data analyzable. In a morphologically rich language like Turkish, the cleaning and structuring of text obtained from call center conversations directly impacted the model’s accuracy. Furthermore, the accuracy of the program used to convert audio data to text was a key success factor for the study. Because low-quality transcripts can limit the model’s performance, each step in these processes was carefully planned.
The study’s limitations should be carefully considered regarding the generalizability of the findings. Because the dataset used was obtained from call center conversations from only one bank, the results may not be directly applicable to other banks or sectors. Furthermore, the model’s performance was tested only on a Turkish-specific dataset. Conducting similar studies in different languages or sectors could provide a broader perspective on the generalizability and accuracy of the methods.
For future studies, automating data cleaning processes and using more advanced tools for converting audio data to text is recommended. Furthermore, combining different embedding methods or developing hybrid models has the potential to improve model performance.
Additionally, the fixed LSTM architecture used in this study was intentionally chosen to more clearly compare the effects of different embedding approaches. Furthermore, future studies could also test the effects of Transformer-based models on embedding performance. Furthermore, output from large language models like GPT-4 could be analyzed more thoroughly for both explainability and formal consistency using evaluation frameworks like MDEval.

6. Conclusions

This study developed a model to predict customer loan payment behavior using Turkish text data, pioneering research in this area. The results demonstrated that the Self-Embedding method successfully learned linguistic and contextual details specific to the dataset and achieved high performance with the LSTM architecture.
The research findings reveal how text analytics can create value in credit risk prediction in the banking sector. By applying this type of model, banks can better identify customers who are unlikely to pay and plan their risk management strategies more effectively. It can also contribute to the development of new routing strategies that call center employees can use when communicating with customers. This model can also be used to identify strategic approaches that can positively influence payment behavior.
The study also provides a foundation for more comprehensive future research in credit risk prediction. Studies conducted with datasets from different sectors, countries, or languages could increase the general validity of this field. Furthermore, the scope of the study could be expanded by using larger datasets and more advanced language models. Such research is critical for furthering the applicability of text analytics, especially in low-resource languages like Turkish.
This study has several limitations that should be considered. First, the dataset was obtained from a single financial institution; therefore, the generalizability of the findings to other markets, institutions, or languages may be limited. Second, the analysis is based on a standard LSTM architecture. Although more recent transformer-based approaches (e.g., BERT, GPT-based models) could potentially yield higher performance, their application was not feasible in the present study due to computational cost, institutional restrictions on processing data outside the bank environment, and the limited availability of sufficiently large transcribed conversational datasets. Nevertheless, with access to GPU resources and larger-scale datasets, we plan to explore transformer-based architectures as part of future research. Third, the experiments were conducted without GPU acceleration using only 40 CPU cores, which limited the speed and scale of the training process. Future research could utilize GPU-based implementations to perform more comprehensive hyperparameter optimization and larger-scale experiments. Finally, while the evaluation is based on widely accepted metrics, real-world business impacts (e.g., reducing default rates or improving credit decision processes) have not yet been tested and are left for future research.
In conclusion, this study highlights the importance of call center data and language analytics in credit risk prediction and offers a pioneering approach in this area. This approach should be considered a significant step forward in advancing risk management in the financial sector.

Author Contributions

Conceptualization, E.R.M.; Methodology, E.R.M. and D.Y.; Software, E.R.M.; Validation, E.R.M.; Formal analysis, E.R.M.; Investigation, E.R.M.; Data curation, E.R.M.; Writing—original draft, E.R.M.; Writing—review and editing, E.R.M., D.Y. and E.U.; Visualization, E.R.M.; Supervision, D.Y. and E.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

In accordance with the Higher Education Council (YÖK) Directive on Scientific Research and Publication Ethics (dated 29 August 2012, https://www.yok.gov.tr/documents/documents/68fb63b8b2511.pdf), provided that the original style and expression of another person are not used verbatim, the use of anonymous information, fundamental knowledge of scientific fields, and propositions such as mathematical theorems and proofs in studies shall not be considered as an ethical violation.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset consists of confidential customer service conversations obtained from a bank. Due to privacy, confidentiality, and ethical restrictions, the data are not publicly available.

Acknowledgments

The authors would like to thank their institution for supporting this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Brown, I.; Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 2012, 39, 3446–3453. [Google Scholar] [CrossRef]
  2. Poledna, S.; Hinteregger, A.; Thurner, S. Identifying Systemically Important Companies by Using the Credit Network of an Entire Nation. Entropy 2018, 20, 792. [Google Scholar] [CrossRef]
  3. Barbaglia, L.; Manzan, S.; Tosetti, E. Forecasting Loan Default in Europe with Machine Learning. J. Financ. Econom. 2023, 21, 569–596. [Google Scholar] [CrossRef]
  4. Graves, A. Generating Sequences with Recurrent Neural Networks. arXiv 2014, arXiv:1308.0850. [Google Scholar] [CrossRef]
  5. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  6. Sirignano, J.; Sadhwani, A.; Giesecke, K. Deep Learning for Mortgage Risk. arXiv 2018, arXiv:1607.02470. [Google Scholar] [CrossRef]
  7. Chen, J.; Katchova, A.L.; Zhou, C. Agricultural loan delinquency prediction using machine learning methods. Int. Food Agribus. Manag. Rev. 2021, 24, 797–812. [Google Scholar] [CrossRef]
  8. Netzer, O.; Lemaire, A.; Herzenstein, M. When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications. J. Mark. Res. 2019, 56, 960–980. [Google Scholar] [CrossRef]
  9. Gao, J.; Sun, W.; Sui, X. Research on Default Prediction for Credit Card Users Based on XGBoost-LSTM Model. Discret. Dyn. Nat. Soc. 2021, 2021, 5080472. [Google Scholar] [CrossRef]
  10. Loughran, T.; Mcdonald, B. When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. J. Financ. 2011, 66, 35–65. [Google Scholar] [CrossRef]
  11. Kriebel, J.; Stitz, L. Credit default prediction from user-generated text in peer-to-peer lending using deep learning. Eur. J. Oper. Res. 2022, 302, 309–323. [Google Scholar] [CrossRef]
  12. Fu, X.; Ouyang, T.; Chen, J.; Luo, X. Listening to the investors: A novel framework for online lending default prediction using deep learning neural networks. Inf. Process. Manag. 2020, 57, 102236. [Google Scholar] [CrossRef]
  13. Wang, S.; Qi, Y.; Fu, B.; Liu, H. Credit Risk Evaluation Based on Text Analysis. Int. J. Cogn. Inform. Nat. Intell. 2016, 10, 11. [Google Scholar] [CrossRef]
  14. Stevenson, M.; Mues, C.; Bravo, C. The value of text for small business default prediction: A deep learning approach. Eur. J. Oper. Res. 2021, 295, 758–771. [Google Scholar] [CrossRef]
  15. Lis, S.; Kubkowski, M.; Borkowska, O.; Serwa, D.; Kurpanik, J. Analyzing Credit Risk Model Problems through NLP-Based Clustering and Machine Learning: Insights from Validation Reports. arXiv 2023, arXiv:2306.01618. [Google Scholar] [CrossRef]
  16. Sanz-Guerrero, M.; Arroyo, J. Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending. Intel. Artif. 2025, 28, 220–247. [Google Scholar] [CrossRef]
  17. Alamsyah, A.; Hafidh, A.A.; Mulya, A.D. Innovative Credit Risk Assessment: Leveraging Social Media Data for Inclusive Credit Scoring in Indonesia’s Fintech Sector. J. Risk Financ. Manag. 2025, 18, 74. [Google Scholar] [CrossRef]
  18. Wu, Z.; Dong, Y.; Li, Y.; Shi, B. Unleashing the power of text for credit default prediction: Comparing human-written and generative AI-refined texts. Eur. J. Oper. Res. 2025, 326, 691–706. [Google Scholar] [CrossRef]
  19. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
  20. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar] [CrossRef]
  21. Chen, Z.; Liu, Y.; Shi, L.; Chen, X.; Zhao, Y.; Ren, F. MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models. arXiv 2025, arXiv:2501.15000. [Google Scholar] [CrossRef]
  22. Chen, J.; Li, J.; Peng, Z.; Wang, W.; Ren, Y.; Shi, L.; Hu, X. LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning. arXiv 2025, arXiv:2505.18724. [Google Scholar] [CrossRef]
  23. Mussakhojayeva, S.; Dauletbek, K.; Yeshpanov, R.; Varol, H.A. Multilingual Speech Recognition for Turkic Languages. Information 2023, 14, 74. [Google Scholar] [CrossRef]
  24. Kheddar, H.; Hemis, M.; Himeur, Y. Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
  25. Görmez, Y. Customized deep learning based Turkish automatic speech recognition system supported by language model. PeerJ Comput. Sci. 2024, 10, e1981. [Google Scholar] [CrossRef]
  26. Dhahbi, S.; Saleem, N.; Bourouis, S.; Berrima, M.; Verdú, E. End-to-end neural automatic speech recognition system for low resource languages. Egypt. Inform. J. 2025, 29, 100615. [Google Scholar] [CrossRef]
  27. Höhne, J.K.; Lenzner, T.; Claassen, J. Automatic speech-to-text transcription: Evidence from a smartphone survey with voice answers. Int. J. Soc. Res. Methodol. 2025, 28, 625–632. [Google Scholar] [CrossRef]
  28. Borandağ, E. LSRM: A New Method for Turkish Text Classification. Appl. Sci. 2024, 14, 11143. [Google Scholar] [CrossRef]
  29. Zümberoğlu, K.B.; Dik, S.Z.; Karadeniz, B.S.; Sahmoud, S. Towards Better Sentiment Analysis in the Turkish Language: Dataset Improvements and Model Innovations. Appl. Sci. 2025, 15, 2062. [Google Scholar] [CrossRef]
  30. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2227–2237. [Google Scholar] [CrossRef]
  31. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
  32. Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. arXiv 2017, arXiv:1607.04606. [Google Scholar] [CrossRef]
  33. Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training Recurrent Neural Networks. arXiv 2013, arXiv:1211.5063. [Google Scholar] [CrossRef]
  34. Li, K.; Chen, X. Machine Learning-Based Lithium Battery State of Health Prediction Research. Appl. Sci. 2025, 15, 516. [Google Scholar] [CrossRef]
  35. Górny, K.; Kuwałek, P.; Pietrowski, W. Increasing Electric Vehicles Reliability by Non-Invasive Diagnosis of Motor Winding Faults. Energies 2021, 14, 2510. [Google Scholar] [CrossRef]
Figure 1. Visualization of the process from the data collection stage to the model performance measurement stage.
Figure 1. Visualization of the process from the data collection stage to the model performance measurement stage.
Applsci 16 00108 g001
Figure 2. Example of manual data correction code.
Figure 2. Example of manual data correction code.
Applsci 16 00108 g002
Figure 3. Example of word embeddings illustrating semantic relationships.
Figure 3. Example of word embeddings illustrating semantic relationships.
Applsci 16 00108 g003
Figure 4. Illustration of the LSTM algorithm.
Figure 4. Illustration of the LSTM algorithm.
Applsci 16 00108 g004
Figure 5. ROC Curve of the LSTM model trained with Self-Embedding on the train dataset.
Figure 5. ROC Curve of the LSTM model trained with Self-Embedding on the train dataset.
Applsci 16 00108 g005
Figure 6. ROC Curve of the LSTM model trained with Self-Embedding on the test dataset.
Figure 6. ROC Curve of the LSTM model trained with Self-Embedding on the test dataset.
Applsci 16 00108 g006
Figure 7. ROC Curve of the LSTM model trained with FastText on the train dataset.
Figure 7. ROC Curve of the LSTM model trained with FastText on the train dataset.
Applsci 16 00108 g007
Figure 8. ROC Curve of the LSTM model trained with FastText on the test dataset.
Figure 8. ROC Curve of the LSTM model trained with FastText on the test dataset.
Applsci 16 00108 g008
Table 1. A raw data representation of a sample interview.
Table 1. A raw data representation of a sample interview.
Customer NoPartyStart TimeSpeech IDOriginal PhrasePhrase in English
12345Agent11176460TEŞEKKÜR EDERİM GÖRÜŞMELER KAYIR ALTINDA … KULLANDIĞINI KREDİ KATRI ON … GÜN KREDİLER ON BİR ON BİR VE 4 GÜN GECİKMİŞ … TANE BAKIYORUM HEMEN … KREDİ KARTI ASGARİSİ ON BİN ALTMIŞTHANK YOU, THE INTERVIEWS ARE UNDER RECORD … THE CREDIT YOU USE IS TEN … DAYS, THE CREDITS ARE ELEVEN AND 4 DAYS DELAYED … I’M LOOKING AT ONE … THE CREDIT CARD MINIMUM IS TEN THOUSAND SIXTY
12345Customer54506460HIHIHIHI
12345Agent43556460TAMAMOKEY
12345Customer37276460BUGÜN MAAŞ YATARSA YARIN HEPSİNİ HALLEDERIF THE SALARY DEPOSITED TODAY, I WILL TAKE CARE OF EVERYTHING TOMORROW
12345Agent486460İYİ GÜNLERGOOD DAY
12345Customer3836460BUYRUN BENİMIT’S ME
12345Agent46436460TAMAM O TAMAM BİRİ ZATEN HEPSİNİ SALI DİYORUM ON BİRİ O ZAMAN YARIN … ON ÜÇ BİN ALTI YÜZ ON ÜÇ … BİR KREDİ KARTI … YARIN ÖDEMEYİ ONAYLIYOR MUSUNUZOKAY, OKAY, ONE, ALL OF THEM, I’M SAYING TUESDAY, ELEVEN, THEN TOMORROW … THIRTEEN THOUSAND SIX HUNDRED THIRTEEN … ONE CREDIT CARD … DO YOU CONFIRM THE PAYMENT TOMORROW?
12345Customer42506460HA … YARIN DİYELİM GÜNCELLEYELİM ORADAKİ TALİMAT YANLIŞ OLMASINHA … LET’S UPDATE IT TOMORROW SO THAT THE INSTRUCTIONS THERE ARE NOT WRONG
12345Agent57506460GECİKME FAİZİ GÜNLÜK … ÖZELLİKLER KREDİLER EKSTRENİZ TAKAN … GECİKME … FİNANSAL İŞLEMİNİZE KARŞINIZA ÇIKAR … YARIN DÖRT ÖDEMEYİ … OLACAĞIZDELAY INTEREST DAILY … FEATURES YOUR LOAN STATEMENT WILL BE CHARGED … DELAY … WILL BE A HARMFUL TO YOUR FINANCIAL TRANSACTION … TOMORROW WE WILL MAKE FOUR PAYMENTS …
12345Customer56806460EVETYES
12345Agent40216460TAMAM YARIN MI DİYELİMOK, SHOULD WE SAY TOMORROW?
12345Customer71426460İYİ GÜNLERGOOD DAY
12345Agent4526460ÖNCE BABA İSMİNİZİ ÖĞRENEBİLİRMİYİMCAN I KNOW YOUR FATHER’S NAME FIRST?
Table 2. Descriptive Statistics of the data.
Table 2. Descriptive Statistics of the data.
MetricValue
Data collection period1.12.2022–16.12.2023
Total number of phone calls3,884,198
Number of customer representatives participating in the call407
Total number of customers contacted1,153,903
Average call duration2 min 52 s
Maximum call duration14 min 1
Average customer delay days33 days
Time distribution of calls97% of them are between 09:00 and 17:00 on weekdays
Distribution of calls by dayUniform
Average number of daily calls per customer representative109
Rate of calls where payment promises were received55%
Modeling populationMeetings where payment promises are received (55%)
Realized payment rate50% (half of the promises made have been fulfilled)
1 Values above the 99th percentile were considered outliers.
Table 3. Example of bigram review table.
Table 3. Example of bigram review table.
BigramsBigrams in EnglishCount
iyi gunlerGood afternoon62.724
tamam tamamOkay, okay61.115
evet evetYes, yes49.181
tesekkur ederimThank you29.072
ne kadarHow much is it?21.324
evet benimYes, it’s me.11.676
su andaRight now.11.594
buyrun benimYes, it’s me.10.577
değil miIsn’t it?9.267
kredi kartiCredit card.8.232
tamam onayliyorumOkay, I’ll confirm.7.934
ayin onten of the month7.549
cuma gunufriday6.188
uc yuzthree hundred6.128
yuz liraone hundred lira6.010
Table 4. Sample data prepared for modeling after data cleaning and grouping processes.
Table 4. Sample data prepared for modeling after data cleaning and grouping processes.
Customer NoTargetSpeech IDPhrasePhrase in English
1234516460buyrun benim bugun maas yatarsa yarin hepsini hallederim yarın diyelim guncelleyelim ordaki talimati yanlis olmasin evet iyi gunlerit’s me if the salary is deposited today i will take care of everything tomorrow let’s update it tomorrow so that the instructions there are not wrong yes good day
Table 5. LSTM model configuration and training setup.
Table 5. LSTM model configuration and training setup.
ComponentValue/Setting
TokenizerWord-level, trained on training set (fit_on_texts)
Vocabulary sizeBased on training data (not explicitly limited)
Sequence lengthMax length = 127 tokens (padding = post, truncation = post)
Embedding layerEmbedding(input_dim = tokenizer.word_index + 1, output_dim = 300, input_length = 127, trainable = True)
Recurrent layers1st: Bidirectional LSTM, 64 units, return_sequences = True
2nd: LSTM, 64 units
Dropout0.5 (applied after first LSTM)
Dense layer1 unit, activation = sigmoid
Loss functionBinary cross-entropy
OptimizerAdam
Evaluation metricsAccuracy, Precision, Recall, F1, ROC Curve, AUC Score
Batch size32
Epochs (max)25 (with early stopping)
Early stoppingmonitor = ‘val_loss’, patience = 5, restore_best_weights = True
Validation setupvalidation_data = (test_x, test_y) used during training
Hardware40 CPU cores, no GPU acceleration
SoftwarePython 3.11
Table 6. Confusion matrix showing test data performance.
Table 6. Confusion matrix showing test data performance.
Self-EmbeddingFastTextBERT
PredictedPredictedPredicted
NegativePositiveNegativePositiveNegativePositive
ActualNegative42.69%7.37%43.95%5.95%0.00%49.90%
Positive13.09%36.85%19.88%30.23%0.00%50.10%
Table 7. Performance table prepared to compare train and test data.
Table 7. Performance table prepared to compare train and test data.
Self-EmbeddingFastTextBERT
TrainTestTrainTestTrainTest
Accuracy87.62%79.54%73.42%74.17%49.93%50.10%
Precision89.24%83.33%86.11%83.55%49.93%50.10%
Recall85.57%73.79%55.76%60.33%100.00%100.00%
F1 Score87.37%78.27%67.69%70.07%66.60%66.76%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Muratlar, E.R.; Yildiz, D.; Ustaoglu, E. Turkish Telephone Conversations in Credit Risk Management: Natural Language Processing and LSTM Approach. Appl. Sci. 2026, 16, 108. https://doi.org/10.3390/app16010108

AMA Style

Muratlar ER, Yildiz D, Ustaoglu E. Turkish Telephone Conversations in Credit Risk Management: Natural Language Processing and LSTM Approach. Applied Sciences. 2026; 16(1):108. https://doi.org/10.3390/app16010108

Chicago/Turabian Style

Muratlar, Emre Ridvan, Dogan Yildiz, and Erhan Ustaoglu. 2026. "Turkish Telephone Conversations in Credit Risk Management: Natural Language Processing and LSTM Approach" Applied Sciences 16, no. 1: 108. https://doi.org/10.3390/app16010108

APA Style

Muratlar, E. R., Yildiz, D., & Ustaoglu, E. (2026). Turkish Telephone Conversations in Credit Risk Management: Natural Language Processing and LSTM Approach. Applied Sciences, 16(1), 108. https://doi.org/10.3390/app16010108

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop