Automatic Translation between Mixtec to Spanish Languages Using Neural Networks

: This paper introduces a novel method for collecting and translating texts from the Mixtec to the Spanish language. The method comprises four primary steps. First, we collected a Mixtec–Spanish corpus that includes 4568 sentences from educational and religious domain texts. To enhance the parallel corpus, we generate synthetic data with GPT-3.5. Second, we cleaned the data with a semi-automatic approach followed by preprocessing and tokenization. In preprocessing, we removed stop words, duplicated sentences, special characters, and numbers and converted them to lowercase. Third, we performed semi-automatic alignment to find the correspondence of Mixtec–Spanish sentences to generate sentence-level aligned texts necessary for translation. Finally, we trained automatic translation models based on recurrent neural networks, bidirectional recurrent neural networks, and Transformers. Our system achieved a BLEU score of 95.66 for Mixtec-to-Spanish translation and 99.87 for Spanish-to-Mixtec translation. We also obtained a translation edit rate (TER) of 0.5 for Spanish-to-Mixtec and a TER of 16.5 for Mixtec-to-Spanish. Our research stands out as a pioneering effort in the field of automatic Mixtec-to-Spanish translation in Mexico, filling a gap identified in the current literature.


Introduction
Automatic translation involves finding the equivalence of a text from a source language to a target language.The text generated after these tasks is called the target text.It is a valuable resource for supporting human translators [1], linguistic studies [2], automatic dictionary construction [3], automatic language analyzers [4], information retrieval systems [5], and resources for multilingual translation support in other languages [6].The purpose of generating parallel corpora, automatic tokenization, and the development of the Mixtec-Spanish translation methodology is to contribute to the preservation of the Mixtec language, an indigenous language spoken by the Mixtec people in southern Mexico.The main challenge in this work is finding translations of texts written in Mixtec into Spanish, coupled with the difficulties arising from corpus scarcity, as the number of collected texts is limited.These texts are available physically or are not accessible online in digital format.The complexity of Mixtec grammar, such as manual construction and implementation of rule-based techniques, can take much work.Challenges include orthographic normalization and dealing with dialectal variants in Mixtec, in addition to the need for lemmatization tools, grammatical taggers, and tokenizers.
According to the catalog of Mexico's National Institute of Indigenous Languages (INALI) [7], there are 68 linguistic groups and 11 linguistic families.Linguistic groups form a set of linguistic variants related to an indigenous community.A linguistic family comprises languages that share a similar structure and lexicon, as well as a common historical origin.Mixtec belongs to the Oto-Manguean family, sharing characteristics with languages such as Triqui, Mixe-Zoque, Seri, Pame, Tlapaneco, Zapotec, and Mazahua, among others.
As per the census by the National Institute of Statistics and Geography (INEGI) [8], in Mexico, 7,364,645 people aged three years or older speak an indigenous language, which is equivalent to 6% of the population.The languages with the highest number of speakers are Nahuatl, Maya, Tseltal, Tsotsil, and Mixtec.Mixtec has a total of 526,593 speakers and is the fifth most spoken language in Mexico.In accordance with the National Institute of Indigenous Peoples (INPI) [9], Mixtec communities are located in the states of Oaxaca, Puebla, and Guerrero.
As per the catalog of (INALI) [7], the Mixtec language has 81 variants in Mexico.There could be even more variants, as each town may have its own Mixtec variant.
In this work, we address the translation of texts from the Mixtec variants of the states of Guerrero and Oaxaca since the majority of the texts in the collected corpus are from these variants.
The proposed methodology is applicable to any pair of languages with low digital resources.However, in this project, we focused on Mixtec-Spanish.
A survey on the topic points out that there are no scientific publications regarding Mixtec-Spanish translation, causing a need for more methodology for Mixtec-Spanish translation.The objectives of this study are as follows: • Propose a methodology for Mixtec-Spanish language translation. 1.
Corpus compilation.We performed the search for parallel corpora, manually corrected, and generated synthetic data.2.
Corpus preprocessing.This step involved cleaning special characters, numbers, duplicated sentences, and empty sentences.

3.
Semi-automatic alignment.The texts are aligned semi-automatically to generate paired parallel corpora at the sentence level.4.
Translation training.We trained recurrent neural networks, bidirectional recurrent neural networks (BRNNs), and Transformers.
• Generate translation memories.• Create a tokenized corpus at the word level.• Generate synthetic Mixtec data with GPT-3.5.

•
Evaluate the translation using the BLEU and TER metrics.

•
Identify challenges in Mixtec-Spanish translation.
This work focuses on the translation of educational and religious texts, as these constitute the majority of our current corpus domain.Our proposal represents the first methodology for Mixtec-Spanish translation in Mexico.
This paper is structured as follows.Section 1 presents the introduction.Section 2 outlines the theoretical foundations of neural networks.Section 3 describes relevant works regarding translation.Section 4 describes our methodology for Mixtec-Spanish translation.Section 5 presents the results, followed by a discussion and conclusion in Sections 6 and 7.

Theoretical Foundation
Before describing the proposed methodology, it is essential to address the fundamentals of recurrent neural networks, Transformer networks, the BLEU evaluation metric, and Translation Edit Rate (TER).

Recurrent Neural Network
Recurrent Neural Networks (RNNs) are networks that generate sequences, such as texts and videos.RNNs maintain the context of previous sequences and the subsequent sequence, making them an ideal technique for text analysis.Additionally, RNNs have long short-term memory (LSTM) capabilities, which aid in handling sequences.
There are three types of recurrent neural networks: One-to-Many, Many-to-One, and Many-to-Many.The Many-to-Many type allows for handling sequences of inputs and outputs, enabling tasks such as automatic translation, text-to-speech conversion, or speechto-text.
RNNs are suitable for generating short sequences with a limited number of characters, which is known as short-term memory.This poses a challenge when dealing with long sequences, as RNNs may not effectively manage large amounts of data.To address this issue, LSTM [10] was developed to overcome the limitations of recurrent networks.Figure 1 explains the architecture of an LSTM neural network.[10].The diagram illustrates the data flow through an LSTM unit for sequence learning tasks.Input gates (g) control the flow of input activations into the memory cell.Output gates (h) manage the output flow of cell activations into the rest of the network.W terms represent the weight matrices associated with each gate and cell state (S cj ) for current (c) and previous time steps (in).Net input and output (net inj ) and (net outj ) activations are also depicted, along with the cell output (y cj ).This structure enables the LSTM to capture long-range dependencies in data.It is widely used in complex sequence modeling tasks such as natural language processing, speech recognition, and time-series prediction.
The architecture allows a constant error flow through the units.It includes a multiplicative input unit to preserve the memory data stored in j from alterations.Another multiplicative output unit protects the irrelevant data stored in j.

Transformer Network
Vaswani et al. [11] proposed the transformer network, a new approach that includes attention mechanisms and positional encoding.The transformer network is based on an encoder-decoder architecture with attention mechanisms.Figure 2 illustrates the general architecture of a Transformer.The components of a Transformer network are detailed below.

•
Encoder: The encoder consists of a multi-head attention mechanism followed by a feed-forward neural network.Each sublayer within the encoder has a residual connection around it, followed by layer normalization.The term "flows" refers to the data flowing through these residual connections.• Decoder: The decoder's architecture mirrors that of the encoder, with the addition of masking to the input.This masking prevents the decoder from attending to subsequent positions, ensuring that predictions for a given position can only depend on known outputs at previous positions.[11].The encoder (left) transforms inputs into embeddings, augmented with positional encodings, and processes them through N x layers of Multi-Head Attention and feed-forward networks.The decoder (right) uses these encodings, applying Masked Multi-Head Attention to prevent future position information leakage and generating output through subsequent N x layers.The process culminates in a linear and softmax layer that produces the sequence's output probabilities.This design allows for efficient parallel processing and is fundamental for tasks like machine translation.

Bilingual Evaluation Understudy (BLEU)
As stated by Papineni et al. [12], BLEU is an automatic metric for evaluating the quality of translation in comparison to a reference.The metric is language-independent and fast.
The reference is the correct translation proposed by a human.BLEU scores range from 0 to 1, where a score of 0 indicates no match with the reference, and 1 indicates an exact match.In the evaluation, two approaches are proposed, as outlined in Equations ( 1) and (2): • Modified Precision.The BLEU metric is implemented at the phrase level to obtain precision using n-grams of the candidate translation and the reference translation.However, precision has some drawbacks, hence the proposal of a modified precision P n .• Brevity Penalty.In the evaluation of the candidate translation compared to the reference translation, there can be variations in length.This impacts the precision obtained.
The candidate translation may have more n-grams that are not present in the reference translation, resulting in lower precision.Conversely, the candidate translation may be shorter while the reference translation is longer.To address this, the brevity penalty BP is proposed in Equation (2).
In Equation ( 1), the geometric mean of the modified precision of n-grams p n is obtained using n-grams of length N and the positive weights are incremented by one w n .Here, Equation ( 2) is shown below, where c is the length of the proposed translation, and r is the length of the reference translation.

Translation Edit Rate
Snover et al. [13] refers to TER as an automatic evaluation metric for machine translation that allows calculating the number of edits necessary to modify a hypothesis in such a way that it matches the reference.Only the number of editions to the closest reference is measured.Below is Equation (3), used to obtain the TER metric.

TER =
# o f edits average # o f re f erence words Edits consist of the insertion, deletion, and substitution of individual words and changes in word sequences.The change moves an immediate sequence of words within the hypothesis to another location.

Related Work
As part of our research into the latest developments in the field, we came across some noteworthy studies on the automatic translation of digital languages that have a limited amount of resources available.We examined these studies and grouped them into three distinct categories based on the types of corpora they used for translation.These categories are parallel corpora, pseudo-parallel corpora, and corpus combination.
Parallel corpora aid in translation by providing equivalent translations between a source and target language.The advantages include error correction and contextual usage of words in both languages.Disadvantages include unavailability for low-digital-resource languages, noise in parallel texts, and linguistic variations causing translation difficulties.
Rule-based translation refers to the use of a set of grammatical rules to translate content from one language to another.These rules are created by language experts and are used to develop the translation system.One advantage of this type of translation is that it does not require a corpus collection for the translation process.Additionally, users can add their own terminology to a dictionary to improve the quality of the system.However, there are some disadvantages to rule-based translation, such as the difficulty in defining grammatical rules, especially in languages with many linguistic variants.Another disadvantage is the lack of a bilingual dictionary of considerable size, which can affect the quality of translations generated by the system.
Translation based on pseudo-parallel corpora involves using a monolingual corpus in the opposite direction of the intended translation.The advantage of this type of translation is the availability of large quantities of data in either the source or target language, allowing for subsampling of the majority class samples.However, a disadvantage is that data are only available in a single language, meaning that synthetic data must be generated at either the source or destination in a translation scenario with parallel data.
Combining parallel, rule-based, and pseudo-parallel corpora presents advantages, since the use of a combination of these approaches strengthens the training data and improves the quality of the results.The use of parallel corpora in a pseudo-parallel combination helps to improve the accuracy of the results and generate synthetic data in both the source and destination directions.
Further details on each of these categories can be found below.

Parallel Corpora
Table 1 shows studies in the field of translation using parallel corpora.The table includes studies such as [14], which explores Wixarika-Spanish translation using statistical methods and evaluates the translations with WER (Word Error Rate) and TER (Translation Edit Rate) metrics.In this work, the authors also incorporated techniques like morphological segmentation and tagging-based segmentation.Notably, Zacarías-Márquez and Meza-Ruiz [15] applied Transformer networks to Ayuuk-Spanish translation, evaluated using the Bilingual Evaluation Understudy (BLEU) metric [12].Further, Mager et al. [16] conducted experiments across multiple languages, including Ashaninka, Aymara, Bribri, Guarani, Náhuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, Wixarika, and Spanish, using BLEU and ChrF (Character Level F-score) for assessment.Another work by Mager et al. [17] is notable for its experiments in Mexicanero, Purépecha, Yorem Nokki, Wixarika, Náhuatl, and Spanish languages using statistical and neural translation techniques, with BLEU as the evaluation metric.Knowless and Littell [18] proposed neural translation with translation memories for digitally low-resource American languages, such as German, Upper Sorbian, English, Inuktitut, Wixarika, Raramuri, Náhuatl, Guaraní, and Spanish, using ChrF and BLEU for evaluation.Additionally, Gezmu et al. [19] discussed translation involving preprocessing, segmentation, and alignment in an Amharic-English parallel corpus, evaluated using neural networks with BLEU, BEER [20], and characTER metrics [21].Dione et al. [22] revealed approaches in neural translation with back-translation using French to Wolof and Wolof to English parallel corpora, employing BLEU for evaluation.Finally, Rubino et al. [23] presented neural translation in low-resource Asian languages, including English, Japanese, Lao, Malay, and Vietnamese, evaluated using the ChrF and BLEU metrics.Transformer Not specified BLEU 37.5 [23] Transformer, SMT Fairseq,Moses BLEU 21.9, ChrF 0.484

Pseudo-Parallel Corpora
Table 2 outlines studies on translation with low-resource languages using pseudoparallel corpora.In their research, Kumar et al. [24] explored semi-supervised learning for transfer translation with a pseudo-corpus.They delved into languages such as Bhojpuri, Magahi, Hindi, and Magah, utilizing BLEU, ChrF, and TER for performance assessment.On the other hand, Imankulova et al. [25] demonstrated that the application of filtering techniques to a pseudo-parallel corpus can enhance neural machine translation.This study concentrated on a diverse linguistic range, including Russian, Japanese, Malagasy, German, and English, with the evaluation metrics being BLEU, AAS, and MAS.

Combination of Corpora
Table 3 presents studies related to machine translation using combined corpora.Hlaing et al. [26] showed translation leveraging POS tagging in Thai, Myanmar, and English languages.They utilized BLEU, ChrF, and WER as evaluation metrics.Hujon et al. [27] explored transfer translation using transformer networks with a shared, subword-tokenized vocabulary, focusing on English and Khasi languages.The evaluation employed BLEU, precision, recall, and F1 metrics.Meetei et al. [28] proposed a multimodal translation approach using neural networks.Their corpus included images with titles translated from English to Hindi, evaluated using BLEU and ChrF metrics.Meanwhile, Maimaiti et al. [29] emphasized the use of a shared lexicon and back-translation, introducing transfer translation with lexicon embedding pretraining for Azerbaijani and Uzbek.BLEU was the chosen evaluation metric.Sethi et al. [30] presented a hybrid translation method that integrated bilingual dictionaries, grammatical corpus, and syntactic rules between Sanskrit and Hindi.They evaluated their approach using the BLEU metric.Pirinen [31] discussed developing North Sámi-German translation utilizing morphosyntactic parsers and rules, with WER as the evaluation metric.Karakanta et al. [32] described a translation process through transliteration, transforming a high-resource language corpus into a low-resource language.This approach involved training transliteration models with Wikipedia texts and generating a parallel corpus through back-translation, using BLEU, BEER, and ChrF for evaluation.Tonja et al. [33] introduced a novel approach for neural translation in low-resource lan-guages like Ethiopian, employing parallel texts and the BLEU metric for evaluation.Finally, Singh et al. [34] reported on improving translation quality in low-resource settings for English-Manipuri languages through semi-supervised neural machine translation, using BLEU and ChrF as metrics.
Table 3. Research summary on translation methodologies utilizing a combination of corpora, highlighting the integration of Statistical Machine Translation (SMT), Recurrent Neural Networks (RNNs), and other techniques.The table outlines the reference works, the translation techniques applied, the computational tools used, and a range of metrics for evaluating translation performance, including Precision, Recall, F-Measure, BLEU score, Word Error Rate (WER), and character F-measure (ChrF).

References
Translation Techniques Tools Metric [

Translation Software Including Low-Resource Languages
Table 4 presents software that performs translation for low-digital-resource languages such as Maya, Wixarika, Otomi, Guarani, Aymara, and Indi, when the majority of the languages included are high-digital-resource languages.

Methodology
This paper introduces a methodology for the automatic translation of Mixtec-Spanish texts.The proposed approach is divided into four phases (shown in Figure 3), each addressing specific aspects of the translation process.

Phase 1: Collection of Parallel Texts
Based on the review of the construction of parallel corpora in low-resource languages [31,35,36], we performed corpora search, digitization, and manual correction.Given the low-resource nature of the problem, we implemented techniques to increase the corpora by generating synthetic data with back-translation.The following section describes the search for parallel corpora, generation of synthetic data in Spanish with back-translation, generation of synthetic data with GPT3, digitization, and manual correction.

Corpus Search
We collected texts through in-person visits and via official indigenous language websites in Mexico.

•
Searches for parallel texts in person.We carried out the in-person searches at the Faculty of Philosophy and Political Sciences of the Autonomous University of Querétaro.In the in-person searches, we obtained a total of 8 books.

•
Searches for parallel texts on official websites.Also, we conducted web searches on the official website gob.mx,INPI, and the Federal Telecommunications Institute.We converted texts found on websites in PDF format into plain text format.
While parallel texts are available, there is a significant shortage of Mixtec language texts necessary for training translations.With this in mind, we generate synthetic data, as described in the next section.
4.1.2.Generation of Synthetic Data in Spanish-Mixtec with GPT-3.5 We used GPT-3.5 to create a Mixtec language dataset, as shown in Figure 4.

Input Phases
Target text in Mixtec GPT3 The Spanish corpus, obtained from back-translation, is fed into the Mixtec synthetic data generation architecture.GPT-3.5 then translates from Spanish to Mixtec, focusing on the Guerrero state variant.The main author, being a native Mixtec speaker, manually checked that GPT-3.5 results were correct.As a result of this procedure, we produced a total of 326 Spanish-Mixtec sentence pairs.
After the data collection from the web search, in-person searches, and synthetic data generation, we performed digitization and manual correction.

Digitization and Manual Correction
Digitization was performed only for the physical books.We scanned each page of the books and saved them in document image format.Then we use the Abbyy FineReader software (https://finereader.en.softonic.com/)to convert the image documents into plain text.The document image reading in Abbyy FineReader was performed in Spanish, as the software does not include Mixtec in its language catalog.During the document reading, two main issues were identified: first, unreadable words in the Mixtec language, and second, confusion in selecting texts adjacent to images.The strategies implemented in text correction are detailed below.

•
Identify and Modify Unreadable Words in Mixtec: During the document image reading, we encountered Mixtec words that were unreadable.In these cases, we manually modified the text to the correct word in Mixtec.Additionally, we added the characters identified as causing difficulties in reading the document image to Abbyy FineReader's alphabet.Then, we re-read the document with the added characters, allowing us to correct these issues.Below is the list of characters added to the Abbyy FineReader dictionary: A, a, E, e, I, i, O, o, U, u, Ā, ā, Ē, ē, Ī, ī, Ō, ō, Ū, ū.Table 5 gives the words that Abbyy FineReader could not recognize and their corrected forms.

•
Confusion in Selecting Texts Adjacent to Images: Sometimes, during the digitization of documents, the software selected both the image and the text, making it difficult to separate the text from the image.In these cases, the solution was to move the selection area over the text and then remove the image from the text.We compiled a total of 30 books (8 physical and 22 digital), identifying texts from different categories.Table 6 presents the categories and the total number of texts compiled in each category.Regarding the number of sentences, there is a total of 3424 pairs of Spanish-Mixtec sentences, of which the synthetic data with GPT-3.5 correspond to 326 pairs in total.The categories listed in Table 6 are based on the texts that are most abundant in the corpus.In the case of parallel educational texts, these categories include common phrases, daily conversations, greetings, names of fruits and animals, cooking recipes, and grammar rules for verbs and nouns.The Mexican national anthem is also included in this corpus.We believe that generating translations in the educational domain can help Mixtec speakers to communicate with Spanish speakers.Additionally, by incorporating parallel corpora from other domains like medicine or law, the translation system can be expanded to allow for greater generalization.This method enables the use of various corpus domains to generate accurate translations.

Phase 2: Developing and Applying the Corpus Processing Module
The research conducted by Tran et al. [36] and Karunesh et al. [37] suggests that corpus processing should be executed in three critical stages: (1) removal of junk words, (2) exclusion of sentences that are not genuinely parallel, and (3) elimination of duplicated sentences.In our project, we focused on stages (1) and (3).We opted not to remove sentences that were not perfectly parallel due to the implementation of a semi-automatic alignment process.Additionally, we introduced a processing stage that involved the removal of special characters and numbers, as well as converting all text to lowercase.

1.
Remove junk words.Parallel texts sourced from websites and digitized documents included irrelevant phrases or words, such as editorial information found at the end of texts and chapter titles from the Bible positioned at the beginning of sentences.These were removed from the texts.Additionally, proper names like "IGNACIO", "JUAN", and "TIBURCIO", typically found at the beginning of sentences within dialogues, were also deleted as they were not pertinent to the parallel corpus.

2.
Suppress duplicated sentences.The parallel corpus, including synthetic data generated through back-translation and GPT-3.5, contained redundant sentences.These unnecessary repetitions were removed to improve the quality of the parallel corpus.

3.
Remove special characters.The corpus featured various special characters, such as parentheses, square brackets, hyphens, and angular quotation marks.These characters were eliminated from the texts because they did not contribute meaningfully to the parallel corpus.

4.
Delete numbers and convert to lowercase.Texts retrieved from web searches contained numbers, such as those in prayer lists and Bible verse numbers, which were irrelevant to the parallel corpus.To achieve text standardization, all texts were converted to lowercase.

Automatic Module for Processing Parallel Mixtec-Spanish Corpus
Before processing the parallel corpus, we compiled it into three categories: training, validation, and testing.We generated a total of 6 text files, with each category having two files-one for the Mixtec language and one for Spanish.Figure 5 shows the total number of words (tokens) and sentences in the corpus.In the input file format for processing, the Mixtec language file contains one sentence per line, separated by line breaks.Similarly, the Spanish language text input follows the same format, with one sentence per line, separated by line breaks.Both text files are encoded in UTF-8 for better processing in the translation module.Table 7 provides an example of the Mixtec and Spanish texts contained in the corpus.

Spanish Mixtec
Las cuatro de la tarde Ka kumi xkuaà A las diez de la noche Ka uxi te ñuú A las tres de la mañana Ka uni te na a After we had split the corpus into training and test sets, we developed and implemented a corpus processing module.This module performed several key activities, including the removal of punctuation marks (such as periods, commas, and accents) and converting all text to lowercase to facilitate more effective processing in translation tasks.We outline these activities in pseudocode within Algorithm 1.Our process began with a UTF-8 encoded text file as input and produced a cleaned text file as output.We utilized regular expressions, specified in lines 5 and 10 of the algorithm, for text cleaning.These expressions were designed to remove a variety of characters, including parentheses, brackets, dashes, quotes, numbers, periods, and commas.We applied this text cleaning approach consistently to both the Spanish parallel texts and the Mixtec texts, ensuring an equivalent standard of preprocessing was applied across languages.Open output file S to write the content P Convert the text to lowercase 20: end function 21: End Method WriteFile 22: End Table 8 showcases examples of text that have been processed using Algorithm 1.The "Input" column contains the original, unprocessed texts, while the "Output" column displays the texts after they have undergone processing.Alignment entails linking a sentence in a Spanish text with its equivalent Mixtec translation, utilizing separators such as tabs, hashtags, or vertical bars, or by creating distinct files for the source and target languages.
In this project, our approach centered on aligning sentences by generating separate files for the source and target languages.Given the nature of the available texts, where a Mixtec sentence might be immediately followed by its Spanish translation within a single file, we devised an algorithm to extract these sentences.This process resulted in two distinct files: one for Mixtec and another for Spanish, with each line in the Mixtec file corresponding directly to a line in the Spanish file.
Following the alignment process, a meticulous manual review was performed by an expert in the Mixtec language to verify the accuracy of the sentence pairings.

Phase 4: Development and Application of the Translation Algorithm
In this phase, we developed and applied the algorithm for the automatic translation of texts written in Mixtec-Spanish.
According to Sennrich et al. [38] and Tonja et al. [39], the procedures to improve translation with monolingual data consist of the following steps: 1.
The translation model was trained in the Spanish-to-Mixtec direction with the authentic parallel corpus to generate synthetic data in Mixtec.

2.
After training the model with the authentic corpus, the model was then trained with synthetic data in Mixtec.

3.
Finally, the synthetic and authentic data were combined to train the model in the Spanish-Mixtec and Mixtec-Spanish directions.

Vocabulary Generation
After cleansing the corpus, we proceeded with tokenization, a process that segments sentences into individual words or subwords.This study specifically concentrated on word-level tokenization, which, as our experiments demonstrated, outperforms other tokenization strategies, especially in languages with scarce digital resources.We implemented word-level tokenization across both recurrent neural network (RNN) and transformer network architectures.
Tokenization results in a collection of words, each associated with its respective tensors or vectors.This conversion of words in a sentence into vectors is referred to as word embedding.Word embedding facilitates the representation of tokens in a multidimensional space, allowing for the clustering of semantically similar terms.We denote the collection of segmented words and their corresponding tensors as vocabularies.In the following sections, we elaborate on the creation of vocabularies for both Mixtec and Spanish languages.
The vocabulary is composed of two files, one for Mixtec and another for Spanish.The format of the vocabulary file is as follows: Token+space+tensor identifier, where token corresponds to a word from the vocabulary followed by a blank space and the numerical value of the token.Table 9 shows examples of the vocabulary in Mixtec and Spanish.The format for generating vocabularies is applied in recurrent neural networks and transformers.Table 10 shows the tokenization of words by language.Next, we detail the proposed architectures of the recurrent neural network and transformer network, in addition to the tokenization and vectorization of words.

Recurrent Neural Network
Figure 6 shows the architecture of the proposed recurrent neural network.Table 11 shows the number of layers and hyperparameters.Each element X i indicates the sentence in Mixtec and Spanish in the input layer.In the recurrent neural network, we assigned a word embedding length of 128.

Transformer Network
Figure 7 shows the proposed transformer architecture.Each X i in the input layer denotes a sentence in Mixtec and Spanish, and subsequently, each word of the sentence was converted into an embedding.For the Transfomer model, we use word embeddings of size 512.
We utilized the OpenNMT-py library due to its capabilities in sequence generation, support for transformer models, recurrent neural networks, and features for automatic evaluation.Our work was conducted within the Google Colab environment.The parameters listed in Table 11 are specified within a YAML configuration file.This file includes paths for the training and validation corpora, along with paths for logs that detail the training process.

Results
In this section, we explore the results of the Mixtec-Spanish translation experiments.The evaluation is twofold; it includes an automatic assessment utilizing the BLEU metric and a manual review conducted by an expert in the Mixtec language.
As detailed by Papineni et al. [12], the BLEU metric facilitates the automatic comparison of a translation system's output against a reference translation, ideally crafted by a human expert.BLEU scores range from 0 to 1, with 0 indicating no resemblance between the candidate and reference translations and 1 signifying a perfect match.On the other hand, the resulting TER score represents the edit distance between the translated text and the reference text.A lower TER score indicates a translation that is closer to the reference and thus generally of higher quality.Conversely, a higher TER score suggests that more edits are needed to match the reference translation, indicating a lower-quality translation.
In the experiments conducted, a total of 4568 Mixtec-Spanish sentence pairs were utilized.The distribution was as follows: 70% (3197 pairs) were allocated for training, 15% (685 pairs) for validation, and the remaining 15% (685 pairs) for testing.The focus was specifically on educational texts, which included materials such as the Mexican national anthem, everyday dialogues, descriptions of objects, and sentences in the past tense.
Table 12 shows the outcomes of the automatic evaluation, employing both the BLEU and TER metrics, for translations from Spanish to Mixtec.The highest scores obtained in the experiments are highlighted in bold.Table 13 shows the results of the automatic evaluation with the Mixtec-Spanish BLEU and TER metrics.Table 14 provides example predictions from Mixtec to Spanish.Each sentence from the first column represents the input to the model to predict the output shown in the second column.The best Spanish-to-Mixtec translation model was the Transformer with six layers and Dropout of 0.1, achieving a BLEU score of 99.87 and a TER of 0.5.
For the Mixtec-to-Spanish translation, the best model was the same Transformer architecture with six layers and a Dropout of 0.1, achieving a BLEU score of 95.66 and TER of 16.5.
In terms of training durations, the RNN and BRNN models completed their training in an average of 30 min, whereas the training time for the Transformer models extended to four hours.
Table 15 offers a comparative analysis between our translation approach for Mixtec-Spanish and related low-resource language work highlighted in the literature review.These results demonstrate a significant improvement, showing a 28% increase in BLEU score with respect to scores obtained in related studies.[17] 23.47 Knowles et al. [18] 16.2 Dione et al. [22] 37.5 Rubino et al. [23] 26.8Our approach 52.39 * * Average BLEU score between BRNN, RNN, and Transformer.

Discussion
Our review of the state of the art in machine translation revealed several challenges in translating low-resource languages, including the absence of natural language processing tools, limited text availability, and the complexity of tonal variations.The following points detail these challenges: • Lack of Natural Language Processing Tools: Resources such as grammatical taggers, normalizers, and automatic tokenizers, crucial for processing these languages, are often unavailable.

•
Restricted Corpus Size: Corpora in low-resource languages are typically small, leading to a lack of sufficient texts for experimentation and validation in translation processes.• Tonal Complexity: The diversity of tonal variations in low-resource languages adds complexity to the application of natural language processing algorithms, including machine translation and grammatical tagging.• Inconsistency in Texts: A notable challenge is the inconsistency in texts within parallel corpora used for training.Such inconsistencies can adversely affect machine translation efficiency, as highlighted by Mager et al. [40] and Matos Veliz et al. [41] in their research on neural network-based normalization for low-resource language environments.
• Lack of keyboards: In low-resource languages, the vast majority of them do not have the characters defined in a keyboard that allows writing texts in these languages.• Lack of optical character recognition: The vast majority of languages with low digital resources are not incorporated into the language catalogs of the recognition tools, which allow identifying and correcting the characters in these languages.

Conclusions
In this study, we successfully developed the first methodology for Mixtec-Spanish and Spanish-Mixtec translation in Mexico.Our approach achieved a remarkable 28% improvement in BLEU scores compared to existing works.This enhancement was possible through the integration of neural network techniques, corpus processing, synthetic data generation, and back-translation.We found that preprocessing the corpus by normalizing it significantly enhanced translation accuracy and improved BLEU scores.During this phase, we encountered and overcame challenges with the transformer network in decoding Mixtec texts.We innovated an algorithm to effectively remove or replace unique characters in Mixtec syntax, which was problematic for translation.
Furthermore, the use of synthetic data generation strategies significantly expanded our parallel corpus.The methodology we developed holds excellent promise for adapting text translations in other indigenous languages such as Amuzgo, Tlapaneco, and Zoque.We also created Mixtec-Spanish translation memories, which will be an asset for the creation of voice recognition, grammatical tagging, linguistic studies, multilingual translation involving other Mexican indigenous languages, automatic summary generation, speech-totext conversion, sentiment analysis, and language detection.
In future work, we plan to expand the parallel corpus and refine the results further by incorporating synthetic data generation with more advanced models such as GPT-4 [42], Gemini Ultra [43], or Claude 3 Opus [44], as well as adding other augmentation techniques such as noise injection or transfer learning from high-resource language models.

Figure 1 .
Figure 1.Schematic representation of an LSTM cell based on[10].The diagram illustrates the data flow through an LSTM unit for sequence learning tasks.Input gates (g) control the flow of input activations into the memory cell.Output gates (h) manage the output flow of cell activations into the rest of the network.W terms represent the weight matrices associated with each gate and cell state (S cj ) for current (c) and previous time steps (in).Net input and output (net inj ) and (net outj ) activations are also depicted, along with the cell output (y cj ).This structure enables the LSTM to capture long-range dependencies in data.It is widely used in complex sequence modeling tasks such as natural language processing, speech recognition, and time-series prediction.

Figure 2 .
Figure 2. Architecture of the Transformer model's encoder-decoder framework based on[11].The encoder (left) transforms inputs into embeddings, augmented with positional encodings, and processes them through N x layers of Multi-Head Attention and feed-forward networks.The decoder (right) uses these encodings, applying Masked Multi-Head Attention to prevent future position information leakage and generating output through subsequent N x layers.The process culminates in a linear and softmax layer that produces the sequence's output probabilities.This design allows for efficient parallel processing and is fundamental for tasks like machine translation.

Figure 3 .
Figure 3. Overview of the automatic translation methodology.Phase 1 involves actively searching for parallel texts in both physical and digital formats as well as digitization and manual correction.Phase 2 encompasses the cleaning and normalization of the parallel corpus.Phase 3 performs a semi-automatic alignment of Mixtec-Spanish texts.Finally, phase 4 consists of developing and applying the translation algorithm.

Figure 4 .
Figure 4. Synthetic data generation architecture.Input is marked in green, the translation phases with GPT-3.5 are in blue, and the final output is indicated in gold.

Figure 5 .
Figure 5. Comparative analysis of token and sentence Counts in processed corpus dataset.The bar chart displays the number of tokens and sentences in the training, validation, and test subsets for both Spanish and Mixtec languages.

Figure 6 .
Figure 6.Architecture of the recurrent neural network.Described as X are the input sentences in Mixtec and Spanish, followed by LSTM cells for the recurrent neural network, while Y denotes the target or output text of the architecture.

Figure 7 .
Figure 7. Architecture of the Transformer network.X denotes the corpus in Mixtec and Spanish.Next, we use the Transformer with encoder and decoder layers.The variable Y denotes the output of the Transformer.

Algorithm 2 Algorithm 2
presents the pseudocode for the training process.The instructions on lines 2, 3, and 4 utilize commands provided by OpenNMT-py, with the input parameters specified in a configuration file.Training algorithm 1: Start 2: Input: Text file in Mixtec and Spanish in utf8, YAML configuration file 3: Output: Meta text file 4: Specify the path of the YAML configuration file.5: Call onmt build vocab to generate vocabularies in Mixtec and Spanish.6: Call onmt train to initiate the training process.7: Call onmt translate to conduct translation tests.8: End

Figure 8 Figure 8 .
Figure 8 shows in more detail the architecture of the translation training algorithm.The result is called the target text or translated text.The target text is generated in a plain text file.
which scales the dot products of the queries and keys by the inverse square root of the dimension of the keys to facilitate training stability.The weights are then applied using a softmax function to ensure they sum to one.
• Attention Mechanism: The attention mechanism is built around a function that computes outputs based on four types of vectors: query, key, value, and output.The mechanism operates by calculating a weighted sum of the values, where the weight assigned to each value is determined by a compatibility function of the query with the corresponding key.This process involves a scaled dot-product attention function,

Table 1 .
Comparison of various translation studies utilizing parallel corpora within StatisticalMachine Translation (SMT) and Transformer models.It details the references of the studies, the translation techniques employed, the specific tools used for the translation, and the metrics applied to evaluate translation accuracy, such as Word Error Rate (WER), BLEU score, and character error rate (ChrF).

Table 2 .
Evaluation of translation approaches using pseudo-parallel corpora.

Table 4 .
Translation tools incorporating low-digital-resource languages.We detail the names of the tools, examples of low-resource languages they support, and the total number of languages available in each tool.

Table 5 .
Words not recognized by Abbyy FineReader in Mixtec language.

Table 6 .
Categories of texts identified in the corpus.

Table 7 .
Comparison between Spanish and Mixtec parallel corpus.

Table 8 .
Examples of processed texts in Mixtec and Spanish.

Table 9 .
Examples of vocabularies in Mixtec and Spanish.
Table 11 shows the number of layers and hyperparameters.

Table 11 .
Input parameters of the neural network.

Table 12 .
Spanish-to-Mixtec evaluation using BLEU and TER metrics.The table compares different model architectures-such as RNN, BRNN with Dropout, and Transformer with Dropout-across varying layers.The highest-scoring results are highlighted in bold for easy identification.

Table 13 .
Mixtec-to-Spanish evaluation using BLEU and TER metrics.The table compares different model architectures-such as RNN, BRNN with Dropout, and Transformer with Dropout-across varying layers.The highest-scoring results are highlighted in bold for easy identification.

Table 15 .
Comparison with respect to related works in low-digital-resource languages.