Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language

: The demand for customer support call centers has surged across various sectors due to the pandemic. Yet, the constraints of round-the-clock human services and ﬂuctuating wait times pose challenges in fully meeting customer needs. In response, there’s a growing need for automated customer service systems that can provide responses tailored to speciﬁc domains and in the native languages of customers, particularly in developing nations like Uzbekistan where call center usage is on the rise. Our system, “UzAssistant,” is designed to recognize user voices and accurately present customer issues in standardized Uzbek, as well as vocalize the responses to voice queries. It employs feature extraction and recurrent neural network (RNN)-based models for effective automatic speech recognition, achieving an impressive 96.4% accuracy in real-time tests with 56 participants. Additionally, the system incorporates a sentence similarity assessment method and a text-to-speech (TTS) synthesis feature speciﬁcally for the Uzbek language. The TTS component utilizes the WaveNet architecture to convert text into speech in Uzbek.


Introduction 1.Research Context and Motivation
The growing popularity of artificial intelligence (AI) has been notably marked by the rise of voice assistants (VAs).Prominent examples include Amazon Echo, Google Assistant, Microsoft Cortana, and Apple Siri.These AI-driven voice assistants are revolutionizing how people engage with technology.McCue's research indicates that 27% of internet users globally utilize voice search [1,2], and forecasts suggest a doubling in the use of in-home voice assistants from 2018 to 2023 [3].Many specialists believe that voice assistants will complement traditional computing devices like PCs and laptops, especially for practical shopping tasks [4].Despite ongoing concerns about privacy and security, the use of voice assistants is on the rise.Thanks to advancements in natural language processing and machine learning, these assistants are capable of conducting complex conversations and handling multiple tasks simultaneously [5].The continued development of this technology is likely to bring significant changes in the way humans interact with machines.
As technology continues to advance, an increasing number of people are turning to virtual voice assistants to help them with simple tasks and answering basic questions.With the integration and development of voice-recognition and natural-learning algorithms, these systems have become increasingly automated and efficient.The ability to control multiple functionalities of smart devices through voice commands has made voice assistants such as Siri, Alexa, Cortana, and Google convenient and essential parts of our daily lives.
Automatic speech recognition and synthesis systems are widely used in various call center applications such as automatic call distribution (ACD), IVR systems, and personnel management programs.These systems enable the use of intelligent voice menus (IVR) to handle incoming calls, automatic telephone calls to customers with a voice interface, conversion of phone conversations to text, providing real-time recommendations to call center employees during conversations, recognizing customer emotions during phone conversations, and improving employee productivity.With the help of these systems, call centers can efficiently handle tasks such as evaluating customer satisfaction with the quality of the service provided.

Research Aims and Contributions
Our research indicates that combining various neural network architectures can significantly enhance the precision of automated customer service systems for the Uzbek language and its various dialects.By employing a blend of RNN encoder-decoder, DNN-CTC, E2E-transformer, and E2E-conformer models, we are able to develop both statistical and neural network-driven language models within automatic speech recognition (ASR) frameworks.These models are adept at accurately identifying customer vocal requests and delivering responses tailored to the Uzbek context [6][7][8][9][10][11][12].Such an approach is poised to not only boost customer satisfaction but also enable call centers to manage a larger influx of customer queries more effectively.
Work is no longer difficult in the era of management development through artificial intelligence in the world, there is no more difficult work.Voice programs are also useful for determining the destination, translating text, and finding the necessary information through a simple voice message.However, they are required to speak either English or Russian [13][14][15][16].
Live conversations are the norm when communicating with clients in call centers.This means that speech recognition technology needs to work in quasi-real time or even in real time.Some of the main benefits of using speech recognition systems in call center services include improving call center efficiency by reducing the time required to handle calls, enhancing customer service by providing more accurate information and faster responses, lowering the risk of errors and misunderstandings during conversations, and allowing call center operators to focus more on conversations and less on taking notes [17][18][19][20].
In call centers, communication between the operator and the client is carried out in the form of live conversations.Therefore, speech recognition must occur on a quasi-real-time scale if not in real time.The main purposes of speech recognition systems in call center services [21][22][23][24][25] include: A significant reduction in waiting time (handling time) reduces work costs.Reducing call time by 1.5-2 times by reducing the time the operator enters the information.Reduction in operators' working time for complex calls owing to the ability to automatically answer simple questions.Creating the ability to work with customers 24/7 (even on holidays).Verifying customers' voices by answering one or two simple questions is especially important in the banking sector to protect against theft of personal cards and confidential documents.Ability to work with a large number of short calls (in bookmakers call centers).The ability to replace a complex and error-prone IVR system operating in tone mode.The ability to use speech recognition as a source of additional information not only during conversations but also during further analysis of the call.In particular, this analysis helped increase the main indicator of first call resolution (FCR).Problem resolution in one call.This leads to a reduction in callbacks and, simultaneously, an increase in customer satisfaction, which in turn results in lower operating costs.Through these articulated aims, this study aims to address the gaps in the existing literature, offer efficient solutions for the present challenges, and pave the way for future advancements in the field of natural language processing (NLP).The key contributions of this study are as follows: 1.

Speaker recognition in varied environments:
The study tested a speaker recognition module in different environments, observing the accuracy of the module under varying conditions.For example, we achieved an impressive 96.4% accuracy in real-time tests with 56 participants.

2.
Use of the Deep Speech 2 model: we used the Deep Speech 2 model for extracting MFCC features from utterances.

3.
Automatic speech recognition (ASR) model: The ASR model used in the study converts consumer speech into text.It was trained using an RNN-based end-to-end speech recognition architecture on large Uzbek-automatic speech recognition training data (USC).

4.
Sentence summarizing: The system includes a sentence summarizing component that uses the BERT sentence transformer for embedding sentences and search queries, achieving an average accuracy of 85.27%.

5.
Database management: The system incorporates three types of databases-personal information database (PID), generic information database (GID), and credential information database (CID), which play crucial roles in managing user data and queries.6.

Development and implementation of an Uzbek speech synthesizer rooted in natural voice for call centers:
Objective: To seamlessly integrate a speech synthesizer calibrated to the phonetic intricacies of the Uzbek language into the telephonic interfaces of call centers.
Operational mechanism: Upon receiving a call, the synthesized voice mechanism initiates a dialogue with the caller, efficiently garnering the requisite information, and thus minimizing the preliminary conversational stages traditionally facilitated by human operators.

Anticipated impact:
The incorporation of this synthesizer is projected to considerably alleviate the operational burdens shouldered by call center representatives, rendering the process more streamlined and expeditious.

Challenges and practical implications of speech recognition in public service call centers:
Context: The burgeoning integration of speech synthesizers into public serviceoriented call centers is challenging.Their primary function often pivots on vocalizing the results stemming from voice-initiated queries, such as ascertaining the status of administrative applications.Technical nuances: While the conceptual frameworks of these systems are undeniably innovative, it is imperative to comprehend the intricacies associated with their seamless operation, ranging from linguistic variations to background noise interference.
Operational benefits: Despite the potential obstacles, the judicious deployment of such systems can increase the efficacy of voice response mechanisms, thereby ensuring that callers receive precise and timely information.

Enhancing call center efficiency through automated speech recognition and synthesis:
Premise: The swift and accurate resolution of client inquiries is at the heart of contemporary call center dynamics.Therefore, automated speech recognition and synthesis are of paramount importance.
Research scope: This study delves into the ramifications of implementing these speech technologies, particularly in contexts that require simple procedural updates, such as tracking the status of an application.Projected outcomes: Preliminary data suggest that astute deployment of these systems could precipitate a reduction in manual operator involvement by a substantial 20-25%.Furthermore, it paves the way for uninterrupted 24/7 customer service, bolstering operational efficiency and augmenting customer satisfaction.

Structure of the Paper
The structure of this paper is laid out in the following manner: Section 2 provides an overview of the current prevalent methods.In Section 3, we delve into the specific methodology employed by the Uzbek voice-controlled intelligent personal assistant.Section 4 focuses on the deployment and evaluation of our proposed system, offering a comparative analysis with existing methods.The paper concludes with Section 5, where we encapsulate the main findings and summarize the key points of our discussion.

Related Work
Recently, the use of AI-driven solutions in business operations has skyrocketed [26].This infusion of AI introduces advanced cognitive features similar to human abilities, including automation, image recognition, problem solving, and informed decision making [27].These features are brought to life through tools such as chatbots, intelligent virtual interfaces, robotic machinery, and other digital aids [28].These tools serve a dual purpose.They can elevate individual productivity by replacing human components in certain tasks.This makes them invaluable in areas, such as education, healthcare, management, and industrial production [27].For example, AI-powered data management systems can supersede conventional record management, aiding healthcare professionals in organizing and analyzing patient data for better decision making.In healthcare, robots can help with surgical procedures, attend to elderly patients, and oversee medication regimes [29].Within the industrial realm, AI adoption can streamline production processes leading to higher output [30].In terms of handling data, AI's capacity to swiftly process and depict intricate data boosts organizational efficacy and simplifies decision-making processes [31].The swift processing ability of AI systems surpasses human limitations [32].However, the growing dependency on machines and AI has ushered in debates about their ethical and moral consequences [33].
Schwenk et al. [34] pioneered the application of artificial neural networks (ANNs) in language modeling, contrasting an ANN-driven n-gram model with a refined Kneser-Ney smoothed approach, informed by a corpus exceeding 550 million words.Instead of utilizing the entire vocabulary, they focused on the most frequently used words for their ANN-based language model (LM).Their approach involved training a neural network on a large dataset by randomly selecting text segments for each training iteration.For speech recognition, an n-gram LM was employed, while a neural network-based LM was used for reevaluating word sequences, achieving a 0.5% decrease in word misrecognition.Mikolov et al. [35] introduced an RNN-based language modeling approach to streamline training.They categorized less common words into unique groups based on frequency of occurrence.In their speech recognition experiments, they utilized a 5-g LM with Kneser-Ney smoothing as a baseline, then reevaluated the top 100 predictions using an RNN-based LM.This RNN implementation resulted in an 18% reduction in word error rate (WER) compared to the 5-g LM, simultaneously simplifying the model's complexity by a factor of 5%.
Huang et al. [36] developed a recurrent neural network (RNN) based language model (LM) for the initial decoding stage in Bing's voice search.They recommended employing the RNN-based LM particularly when the n-gram LM's predictions were significantly accurate.To boost processing efficiency, they incorporated a key-value hash table cache.This approach lowered the word error rate (WER) from 25.3% to 23.2%.Additionally, they enhanced the system by reweighting recognition lattices using the RNN-based LM.Optimal results were achieved by combining the RNN-based LM with a foundational 4-g model for lattice generation, followed by rescoring with a similar model, which brought the WER down to 22.7% with an interpolation coefficient of 0.3.Sundermeyer et al. [37] investigated the efficacy differences between LMs using feedforward artificial neural networks (ANNs) and RNNs.They tested three neural network LM setups: (1) a feedforward ANN built with LIMSI-2013 software, focusing on commonly used words; (2) a clustering approach with a feedforward ANN using the complete word pair; and (3) clustering with an RNN.These LMs were trained on a 27 million word corpus, forming 200 classes for ANN clustering based on word frequency, with hidden layer sizes ranging from 300 to 500 units, adjusted according to validation data results.They used an n-gram model for deriving the LM from an ANN system, achieving a WER reduction of 1.5% in training and 1.4% in testing.In these evaluations, feedforward ANNs were less effective than RNNs, with the RNN showing a 0.4% enhancement over the feedforward ANN in test scenarios.Morioka et al. [38] introduced an LM that utilizes variable length contexts.In speech recognition tests with an extensive dictionary, this model demonstrated a reduction in both perplexity and WER.
In a seminal research endeavor, Hilda et al. [39] proposed a dialogue management system characterized by its ability to facilitate spoken language interactions with users.This system, which seamlessly integrates automatic speech recognition, text-to-speech synthesis, a sophisticated dialogue manager, and an expansive information database, holds promise for revolutionizing telephone-based automated customer service paradigms.In a subsequent scholarly investigation, Zweig et al. [40] introduced an avant-garde quality monitoring apparatus tailored for call centers.This innovative mechanism amalgamates the advantages of the speaker recognition module, maximal entropy classification, state-of-the-art pattern recognition technology, and automatic speech recognition, thereby promising unparalleled robustness.Venturing further into the realm of customer experience, Mclean et al. [41] embarked on an empirical study that leveraged a web-based survey methodology, garnering insights from 302 participants.This study seeks to unravel the intricate tapestry of the determinants underpinning customer satisfaction in real-time chat service encounters.In another academically rigorous study, Warnapura et al. [42] proposed an AI-infused architecture engineered to deliver diverse information modalities to customers by spanning texts, voice outputs, and emails.Harnessing the power of sentiment analysis for user response classification in conjunction with the capabilities of natural language processing (NLP) and automatic speech recognition, they conceived an automated system of remarkable resilience and efficacy.
Mansurov and his team [43] recently introduced UzBERT, a BERT-based model tailored for the Uzbek language.They developed this model using a specially compiled news corpus of over 142 million words.The model's effectiveness in masked language modeling was benchmarked against the multilingual BERT (mBERT).UzBERT was trained with objectives like masked language modeling (MLM) and next sentence prediction (NSP), and it incorporated hyperparameters such as a dropout probability of 0.1 and a GeLU activation function.The model's architecture mirrored the original BERT design, featuring 12 layers, 768 hidden units, 12 attention heads, and a total of 110 million parameters, with a vocabulary size of 30,000 tokens.Out of the 142 million words in the dataset, 140 million were allocated for training, while the remaining 2 million were reserved for validation.
Despite the growing popularity and advancements in hybrid CTC/attention ASR systems, particularly in low-resource languages, their application to Central Asian languages like Turkish and Uzbek remains limited.Ren et al. [44] introduced a novel feature extraction method using CNNs, termed multiscale parallel convolution (MSPC).This technique utilizes convolution kernels of varying sizes to capture features at different scales, combined with a bidirectional long short-term memory (Bi-LSTM) network to boost the accuracy and stability of the end-to-end model.They also incorporated a fine-tuned BERT model to initialize their RNN language model, integrating it during the decoding phase.
Further exploring the End2End approach for the Uzbek language, Mamatov et al. [45] developed an Uzbek speech recognition system.Their approach involved evaluating existing speech recognition methods to identify the most effective one.They trained various models using a diverse dataset, including 432 h of audiobook recordings and 72 h of audio clips featuring sayings and maxims, voiced by a total of 174 speakers, to create an extensive database.
Several methods, such as Doc2Vec and Word2Vec, have been used by scholars to gauge the similarity between sentences.Doc2Vec, built on the foundation of the Word2Vec model, is adept at encapsulating the semantic essence of sentences or paragraphs [46].
Similarly, Word2Vec is a neural network-driven model that is proficient in depicting words within a high-dimensional space, thus encapsulating their semantic nuances and contextual significance [47].
In conclusion, despite the substantial academic attention garnered by natural language processing (NLP) and speech recognition disciplines, their empirical integration into callcenter milieus remains conspicuously under-examined.However, it is plausible to posit that research findings from cognate domains, when judiciously adapted to the nuances of call center operations, could offer valuable insights.

Workflow
Our methodology was meticulously designed to optimize results and achieve our set goals.The project was divided into four principal segments: speech recognition, text summarization, sentence similarity analysis, and text-to-speech (TTS) conversion.For the speech recognition part, we utilized an advanced deep learning framework suitable for this purpose (Deep Speech 2).Text summarization was handled using the seq2seq model, while the doc2vec model was instrumental in assessing sentence similarity.Lastly, for converting text into spoken words, we employed TTS technology, specifically using the WaveNet deep learning framework, which is known for its high-quality speech synthesis capabilities (as shown in Figure 1).

Recognition of Speakers
In our speaker recognition approach, we utilized the Mel-frequency cepstral coefficients (MFCC) method [49] for extracting features from audio signals, focusing primarily on low-frequency components.Additionally, we developed a feature-matching technique Text-to-speech (TTS) technology simulates human-like speech by transforming written text into audible sound through advanced machine learning methods.This technology is particularly useful for developing voice-operated robots and interactive voice response (IVR) systems, offering businesses a cost-effective and efficient solution by automating sound generation and eliminating the need for manual audio recording and editing.
The quality of TTS-generated speech has significantly improved, achieving a natural sound through meticulous refinement of various elements such as tone, smoothness, accent placement, pauses, and intonation.There are two primary methods for achieving this: Concatenative TTS, which stitches together pre-recorded audio snippets, and Parametric TTS, which uses a probabilistic model to determine the acoustic characteristics of a sound signal based on the input text.Concatenative TTS is known for its high-quality output but requires extensive data for training the machine learning models.Parametric TTS, in contrast, can produce speech that closely resembles human speech with less data requirement [48].

Recognition of Speakers
In our speaker recognition approach, we utilized the Mel-frequency cepstral coefficients (MFCC) method [49] for extracting features from audio signals, focusing primarily on low-frequency components.Additionally, we developed a feature-matching technique that works in tandem with a maximization algorithm based on the Gaussian mixture model (GMM) [50].For the purpose of speaker adaptation, our system collects a 20-s voice sample from users.It then employs feature extraction techniques to create a unique GMM profile for each individual, which aids in making precise similarity assessments.Furthermore, we implemented a threshold mechanism to identify and differentiate unregistered speakers.This significantly improves the system's accuracy by reducing the chances of misidentification.The functioning of the speaker recognition module is depicted in Figure 2.

Automatic Speech Recognition
We utilized cutting-edge technology to ensure the accurate and efficient transcription of consumer speech.We employed an ASR model trained on a publicly available annotated Uzbek voice corpus dataset using a state-of-the-art architecture called Deep Speech 2 [51].This architecture utilizes HPC techniques and batch normalization to achieve 7×

Automatic Speech Recognition
We utilized cutting-edge technology to ensure the accurate and efficient transcription of consumer speech.We employed an ASR model trained on a publicly available annotated Uzbek voice corpus dataset using a state-of-the-art architecture called Deep Speech 2 [51].This architecture utilizes HPC techniques and batch normalization to achieve 7× faster training compared to its predecessor, while employing a unique optimization curriculum known as SortaGrad [50].
To synchronize the text transcription with the input frames, we employ the function Align(x,y), which maps every potential pairing of characters from the transcription y with frames in the input x.

Summarization of Sentences
We developed a system that applies an Uzbek summarization of sentences method, aimed at shortening sentences and distilling key information from customer replies.This system harnesses the Seq2Seq [52] summarization method, enhanced with an attention mechanism, which successfully brought down the training loss to 0.001.The efficacy of this method was confirmed through our experimental evaluations, showing promising outcomes.The architecture of our summarization of sentences system is depicted in Figure 3 [52].The Uzbek Bank speech corpus (UBSC v1.0) is a comprehensive dataset used for speech-to-text conversion.It includes 108 h of recorded Uzbek speech data in .wavformat from 863 speakers of different ages, genders, dialects, education levels, and accents.This dataset was also used for text summarization which included 35 k short articles and 24 k short summaries for evaluation purposes.Additionally, 322 questions were generated from the City Bank website and social network pages to measure sentence similarity.
After collecting the data, we processed them in four steps.First, the sentences were broken down into a series of tokens.Next, we expanded the contractions to their full forms and eliminated all stop words and punctuation marks.Following this, we implemented lemmatization to transform the words to their base forms.Subsequently, we categorized the words according to their parts of speech.These steps enhanced our ability to analyze the data thoroughly and extract significant insights.
(2) Model Architecture During the training process, we employed an RNN encoder-decoder in tandem with the Seq2Seq approach, augmented by an attention mechanism, to effectively condense the articles.This architectural design comprises three primary segments: encoder, attention, and decoder modules.The function of the encoder is to transform a sequence into a consistent context vector, capturing the semantic abstraction of the entire article.This context vector acts as the foundational state for the decoder, interfacing with its hidden layers despite disparities in the timestamps of the encoder and decoder.Given that a represents an extended input sequence, where 'a' signifies the target set of sentences and 'b' the source set, the highest likelihood for the word vector sequence is given by: arg(maxbp(a|b)) ( A sequence-to-sequence framework fortified with the Bahdanau attention mechanism was applied to align the fixed-length output.For the embedding process, we harness a pre-established Uzbek word vector, named the "uz w2c model", which transmutes words into their numeric counterparts.The terminal representation of each word is vectorized, which is pivotal for model training.

Text-to-Speech Synthesis
Our proposed model was designed to interact with customers by responding to a range of verbal inputs in an audio format.Given our focus on the state services center sector in Uzbekistan, it is noteworthy that the majority of users communicate in the Uzbek language.To convert our text into audible speech, we employed the TTS.Central to its efficacy is the use of DeepMind's WaveNet [53], renowned for its optimal accuracy.The architecture of the WaveNet model is shown in Figure 4.In addition, the model can generate speech responses in real time, making it extremely useful for real-world applications.
WaveNet is an advanced neural network designed to produce raw audio and was developed by DeepMind, an AI company based in London.Presented in a September 2016 paper, WaveNet can create human-like voice sounds that are quite realistic by using a neural network trained on actual spoken voice data.When tested with American English and Mandarin, it surpassed Google's top text-to-speech systems at the time.However, as of 2016, the synthesized speech from WaveNet was still not as authentic sounding as genuine human speech.The capability of WaveNet to generate raw audio waveforms allows it to replicate various types of sounds, encompassing both speech and music.
Creating a top-tier synthetic TTS database necessitates ensuring that the speech output from the source TTS model aligns precisely with phonetic pronunciations.To achieve this, we implemented a Tacotron 2 decoder, equipped with a phoneme alignment method [54].This approach is adept at precisely synchronizing phoneme sequences with their corresponding acoustic features.In this setup, an external model dedicated to duration prediction determines the length of each phoneme based on linguistic attributes.Following this, the Tacotron 2 decoder is responsible for producing the relevant acoustic features.Subsequently, these features are transformed into speech signals by a WaveNet-based neural excitation vocoder.Within this vocoder, a WaveNet-based mixture density network [55] operates, adhering to the principles of human speech production mechanisms [56].This results in the stable and accurate generation of speech signals [53,57].
In setting up WaveNet, the dilation factors were arranged in a sequence from [2 0 , 2 1 …, 2 9 ] and this sequence was repeated thrice.This configuration led to the formation of 30 layers of residual blocks and a receptive field comprising 3067 samples.Within each o these residual blocks, convolution layers with 128 channels were utilized.The system was designed to output two dimensions, specifically to calculate the mean and the standard deviation for a Gaussian distribution.Additionally, a weight normalization method was employed, ensuring that all weight vectors were normalized to a standard length [5].
Enhancing the spectral definition of the synthesized speech involved applying a spec tral domain sharpening filter, set with a coefficient value of 0.95, as a post-processing measure.Furthermore, for producing clearer speech audio, the scale parameter generated by WaveNet in the voiced sections was decreased by a ratio of 0.87.Within the TTS framework, the advanced time-frequency trajectory excitation vocoder functions by capturing a range of features every 5 milliseconds [58].This includes a diverse array of elements: line spectral frequencies spanning 40 dimensions, along with the fundamental frequency, energy levels, a voicing indicator, a 32-dimensional waveform exhibiting gradual evolution, and a rapidly altering 4-dimensional waveform.Collectively, these components constitute a detailed 79-dimensional feature vector.
The source TTS's acoustic model is composed of three distinct parts: a context analyzer, a context encoder, and a Tacotron decoder.Initially, the context analyzer processes the input text to extract 354-dimensional phoneme-level linguistic feature vectors, which include 330 categorical and 24 numerical contexts.Following this, a duration predictor, comprising three fully connected layers with unit counts of 1024, 512, and 256, and an LSTM layer with 128 memory blocks, calculates the duration of each phoneme.These phoneme-level features are then scaled up to match frame-level dimensions.The context encoder further refines these features by passing the frame-level linguistic features through three convolution layers with 10 × 1 kernels and 512 channels, a bidirectional LSTM with 512 memory blocks, and fully connected layers with 512 units each.Subsequently, the Tacotron decoder, which includes a PreNet, PostNet, and a primary unidirectional LSTM, takes over to produce the acoustic features.The PreNet, consisting of two fully connected layers with 256 units each, processes the previously generated acoustic features.These outputs, along with those from a context-embedding module, are then routed through two unidirectional LSTM layers with 1024 memory blocks and followed by two projection layers with 79 units to create the acoustic features.Lastly, the PostNet, which is made up of five convolution layers with 5 × 1 kernels and 512 channels, incorporates residual elements into the acoustic features to enhance the precision of generation.
In setting up WaveNet, the dilation factors were arranged in a sequence from [2 0 , 2 1 , . .., 2 9 ] and this sequence was repeated thrice.This configuration led to the formation of 30 layers of residual blocks and a receptive field comprising 3067 samples.Within each of these residual blocks, convolution layers with 128 channels were utilized.The system was designed to output two dimensions, specifically to calculate the mean and the standard deviation for a Gaussian distribution.Additionally, a weight normalization method was employed, ensuring that all weight vectors were normalized to a standard length [5].
Enhancing the spectral definition of the synthesized speech involved applying a spectral domain sharpening filter, set with a coefficient value of 0.95, as a post-processing measure.Furthermore, for producing clearer speech audio, the scale parameter generated by WaveNet in the voiced sections was decreased by a ratio of 0.87.

Database
The system utilized three distinct databases.The first, known as the personal information database (PID), gathers various user details such as name, national identity (NID) number, mobile number, date of birth, recorded response, speech-tailored GMM-based model, and the time of the call.This collected data is subsequently utilized for verification, where the system cross-references it with the personal information submitted by the user.Table 1 illustrates a sample layout of a PID.Another component of the system is the generic information database (GID), which houses a collection of commonly asked questions (FAQs) pertinent to the e-commerce sector.New users, who do not yet have access to the credential information database (CID) that safeguards sensitive information, primarily interact with the GID.Tables 2 and 3 display examples of the GID's contents and the layout of the CID, respectively.

User Response Solution
Karta PIN kodini unutdim?(Forgot your card PIN?) Karta ochilgan bank filialiga shaxsingizni tasdiqlovchi hujjat bilan murojaat qilishingiz lozim (You should apply to the branch of the bank where the card was opened with an identity document).

WaveNet Model Accuracy
In this study, we employed the character error rate (CER) as a benchmark for evaluation.The CER gauges the efficacy of an automatic speech recognition (ASR) system, reflecting the proportion of characters inaccurately identified.A lower CER signifies superior performance, with a rate of zero indicating flawless results.Following a 24 h training period on the WaveNet model via Collab Pro+ using an Nvidia A100 GPU, the recorded CER was 0.064907.The most favorable CER value was 0.064907 at the 16,400th step.The observed training and validation losses are 0.105120 and 0.332718, respectively (Table 4).

CER = S + D + I N
Constraints related to GPU capabilities limited us to a 50 h session on Collab Pro+.To compensate for this, we augmented our training sets to include 10,000, 15,000, and 25,000 sample inputs, respectively (Table 5).The intent was to demonstrate the correlation between an increased number of sample inputs and enhancements in training loss, validation loss, and CER.The premise was that longer training durations lead to better outcomes across all metrics.

Seq2Seq Model-Based Summary Prediction
We sourced our sample data from the Uzbek dataset for Uzbek text summarization.This visualization presents the word count distribution for both the articles and their summaries.Articles peaked at sixty words, whereas summaries typically ranged between five and ten words.During the training period of the seq2seq model, the step loss was 3.3613 and the value loss was 2.9232 (Figure 5).

Seq2Seq Model-Based Summary Prediction
We sourced our sample data from the Uzbek dataset for Uzbek text summarization.This visualization presents the word count distribution for both the articles and their summaries.Articles peaked at sixty words, whereas summaries typically ranged between five and ten words.During the training period of the seq2seq model, the step loss was 3.3613 and the value loss was 2.9232 (Figure 5).Our developed speaker recognition system underwent testing in two distinct settings: one with background noise and another with studio-level sound quality.We conducted tests using 56 samples across eight sequential stages for each setting, monitoring the system's accuracy as the number of samples grew.In both scenarios, the system successfully identified all seven individuals when the sample count was limited to seven.However, with an increase in sample size to 14, the system encountered a recognition issue with one individual in the noisy setting, leading to a slight drop in accuracy to 96.4%.While we recorded a 96.4% accuracy rate in real-world conditions, it's important to note that the system's effectiveness might diminish in larger-scale, real-time environments Our developed speaker recognition system underwent testing in two distinct settings: one with background noise and another with studio-level sound quality.We conducted tests using 56 samples across eight sequential stages for each setting, monitoring the system's accuracy as the number of samples grew.In both scenarios, the system successfully identified all seven individuals when the sample count was limited to seven.However, with an increase in sample size to 14, the system encountered a recognition issue with one individual in the noisy setting, leading to a slight drop in accuracy to 96.4%.While we recorded a 96.4% accuracy rate in real-world conditions, it's important to note that the system's effectiveness might diminish in larger-scale, real-time environments (Figure 6).

Text Summarization Using the Seq2Seq Model
Through our evaluation employing the cosine similarity algorithm, we assessed the degree of resemblance between the customer's inquiry and our existing dataset.We compared each question in the dataset with a customer query to determine the most similar questions and answers.This process takes approximately one-two minutes to complete.Our system was designed to handle different ways of asking questions; therefore, we tested it using various question formats to ensure its accuracy.However, we acknowledge that there is always room for improvement, and we are continually working to enhance the system with the available resources.We are confident that our system generates accurate output and provides answers that are relevant to the questions asked.Below are some examples of how it works in practice in Tables 6 and 7.

Text Summarization Using the Seq2Seq Model
Through our evaluation employing the cosine similarity algorithm, we assessed the degree of resemblance between the customer's inquiry and our existing dataset.We compared each question in the dataset with a customer query to determine the most similar questions and answers.This process takes approximately one-two minutes to complete.Our system was designed to handle different ways of asking questions; therefore, we tested it using various question formats to ensure its accuracy.However, we acknowledge that there is always room for improvement, and we are continually working to enhance the system with the available resources.We are confident that our system generates accurate output and provides answers that are relevant to the questions asked.Below are some examples of how it works in practice in Tables 6 and 7.
The sentence summarization model was trained with meticulous attention to detail, utilizing an RNN size of 256, batch size of two, learning rate of 0.001, and probability rate ranging from 0.65 to 0.75 with the "Adam" optimizer.The model was rigorously tested on the Uzbek dataset, and the results were impressive, with an average loss of 0.004.These parameters were carefully chosen by the main author to ensure optimal performance and accurate summarization of sentences.Overall, the Sentence Summarization model is a highly effective tool for summarizing large amounts of text into concise and informative summaries.
According to Table 8, we utilized the weight of the BERT sentence transformer called paraphrase-mpnet-base-v2, which was trained on USC, CC100-uzbek, voice-recognition-Uzbek, and xls-r-uzbek-cv8 datasets with an average accuracy of 85.27%.To assess the precision of our system, we randomly selected ten questions and constructed a minimum of four variants for each question to evaluate the system responses (Figure 7).Our observations revealed that the system performed exceptionally well for certain queries and provided appropriate responses.However, it failed to deliver accurate answers to the others.The outcomes are summarized in the table below.Based on our evaluation, we estimated the system accuracy as 82.5%.We are confident that increasing the number of variable questions will further improve the accuracy of our system.To assess the precision of our system, we randomly selected ten questions and constructed a minimum of four variants for each question to evaluate the system responses (Figure 7).Our observations revealed that the system performed exceptionally well for certain queries and provided appropriate responses.However, it failed to deliver accurate answers to the others.The outcomes are summarized in the table below.Based on our evaluation, we estimated the system accuracy as 82.5%.We are confident that increasing the number of variable questions will further improve the accuracy of our system.In our experiments, we have used various techniques to expand the corpus artificially.These techniques include adding noise (AN), changing the audio reading speed (SP), and masking (SA) in the spectral domain along the time and frequency axes.Additionally, we have evaluated the effectiveness of using language models (LM) at the decoding stage.
For adding noise, we have selected Gaussian noise with a noise amplitude of  = 0.01.We have also increased the audio playback speed to 0.9, 1.0, and 1.1 times.In the spectral domain, we have used 2 masks with a maximum time width of T = 40 and 2 masks with a maximum time width of F = 30 in the frequency axis for masking on the time axis.The results of these experiments are presented in Table 9.In our experiments, we have used various techniques to expand the corpus artificially.These techniques include adding noise (AN), changing the audio reading speed (SP), and masking (SA) in the spectral domain along the time and frequency axes.Additionally, we have evaluated the effectiveness of using language models (LM) at the decoding stage.
For adding noise, we have selected Gaussian noise with a noise amplitude of σ = 0.01.We have also increased the audio playback speed to 0.9, 1.0, and 1.1 times.In the spectral domain, we have used 2 masks with a maximum time width of T = 40 and 2 masks with a maximum time width of F = 30 in the frequency axis for masking on the time axis.The results of these experiments are presented in Table 9.
During the experiments, the Deep Speech 2 model demonstrated superior performance in terms of word error rate (WER) and character error rate (CER).On a test set comprising 5474 samples, the Deep Speech 2 model achieved a WER of 13.8% and a CER of 5.22%, indicating its effectiveness in accurately processing and transcribing speech.

Conclusions and Future Work
UzAssistance represents a groundbreaking step in Uzbekistan's banking sector.It offers a transformative approach to how banking services are accessed and utilized, enhancing efficiency, convenience, and inclusivity for its future implementation and growth.This automated voice chat system stands to elevate the customer experience, broaden financial inclusion, and reduce operational expenses for banks by offering round-the-clock client services without the need for additional staff.
A key benefit of UzAssistant is its ability to facilitate customer interactions in their native Uzbek language, greatly enhancing user satisfaction.However, automated voice chat systems do face certain challenges that could affect the user experience, including a limited range of vocabulary, challenges in speech recognition accuracy, constraints in multi-modal interaction, a somewhat impersonal or mechanical tone, and potential technical issues.
Despite these challenges, the potential for further advancement is significant, particularly in the Uzbek market.There are numerous avenues for future research, such as incorporating more multi-modal interaction capabilities, employing advanced machine learning methods and larger datasets for training, enabling real-time response generation, adding additional functionalities and features, and customizing experiences based on users' past interactions and behaviors.
Recent progress in text-to-speech (TTS) technologies has made it possible to produce more lifelike automated voices in Uzbek.By integrating additional functionalities and features, such as managing intricate transactions and requests or offering tailored recommendations based on a customer's banking history, this technology can be significantly enhanced.Implementing feedback mechanisms within the system to gather client opinions on its precision and effectiveness can provide valuable insights for further improvements.Automated Uzbek voice chat systems in banking can address these challenges and explore these future research directions to create a more immersive and effective customer experience, ultimately benefiting both the customers and the banking institutions.

Figure 2 .
Figure 2. Procedure for the speaker recognition system.

Figure 2 .
Figure 2. Procedure for the speaker recognition system.

Figure 3 .
Figure 3. Structure of the sentence summarization system.(1) Data Collection and Processing Techniques The Uzbek Bank speech corpus (UBSC v1.0) is a comprehensive dataset used for speech-to-text conversion.It includes 108 h of recorded Uzbek speech data in .wavformat from 863 speakers of different ages, genders, dialects, education levels, and accents.This

Figure 5 .
Figure 5. Training and testing value loss.

Figure 5 .
Figure 5. Training and testing value loss.

Figure 6 .
Figure 6.Comparison of the speaker recognition module's accuracy in two distinct environments.

Table 5 .
Comparison between various sample inputs.

Table 5 .
Comparison between various sample inputs.

Table 6 .
Variations in questions and corresponding system responses.

Question: Siz Kredit Berasizmi? (Do You Give Credit?)
CorrectKreditingiz haqida bilmoqchimisiz?(Want to know about your credit?)Agar bizning bankda hisob raqamingiz bo'lsa, siz debit karta ochishga ariza berishingiz mumkin.(If you have an account number in our bank, you can apply for opening a Incorrect

Table 6 .
Variations in questions and corresponding system responses.

Table 7 .
Variations in questions and corresponding system responses.

Table 8 .
The proficiency of the "paraphrase-mpnet base-v2" sentence transformer in creating top-tier sentence embeddings and its accuracy in embedding paragraphs for search queries.