Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

Aysa, Anwar; Ablimit, Mijit; Yilahun, Hankiz; Hamdulla, Askar

doi:10.3390/info13040175

Open AccessArticle

Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

Information 2022, 13(4), 175; https://doi.org/10.3390/info13040175

Submission received: 20 February 2022 / Revised: 18 March 2022 / Accepted: 29 March 2022 / Published: 31 March 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Bilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguistic knowledge of other large resource languages, such as Chinese or English. There is little related research on unsupervised extraction for the Chinese-Uyghur languages, and the existing methods mainly focus on term extraction methods based on translated parallel corpora. Accordingly, unsupervised knowledge extraction methods are effective, especially for the low-resource languages. This paper proposes a method to extract a Chinese-Uyghur bilingual dictionary by combining the inter-word relationship matrix mapped by the neural network cross-language word embedding vector. A seed dictionary is used as a weak supervision signal. A small Chinese-Uyghur parallel data resource is used to map the multilingual word vectors into a unified vector space. As the word-particles of these two languages are not well-coordinated, stems are used as the main linguistic particles. The strong inter-word semantic relationship of word vectors is used to associate Chinese-Uyghur semantic information. Two retrieval indicators, such as nearest neighbor retrieval and cross-domain similarity local scaling, are used to calculate similarity to extract bilingual dictionaries. The experimental results show that the accuracy of the Chinese-Uyghur bilingual dictionary extraction method proposed in this paper is improved to 65.06%. This method helps to improve Chinese-Uyghur machine translation, automatic knowledge extraction, and multilingual translations.

Keywords:

bilingual dictionary; seed dictionary; cross-language word embedding

1. Introduction

In the Internet era, users’ requirements for information acquisition have significantly increased. People are no longer satisfied with retrieval results in a single language. Users need to obtain results in multiple languages. In addition, resolving information retrieval barriers between languages has become a primary research goal in current information-retrieval research.

In various cross-lingual natural language processing (NLP) tasks, bilingual dictionaries are an essential resource to provide cross-lingual equivalent information on lexical semantics [1]. In most cross-lingual natural language processing tasks, such as machine translation [2], cross-lingual text classification [3], and cross-lingual sentiment analysis [4], cross-lingual dictionaries play a crucial role. However, cross-lingual dictionary extraction often requires manually labeled cross-language knowledge, such as parallel corpora or manually labeled translation dictionaries. Nowadays, new words in professional fields continue to emerge, and manually compiling dictionaries cannot keep up with the fast progress.

Multilingual dictionary extraction has the potential to help many applications, such as multilingual information mining and keyword searching. It is necessary to explore the automatic extraction method of multilingual dictionaries to realize a cross-lingual knowledge bridge. A fundamental problem in dictionary extraction is how to effectively represent the text features and associate the two languages’ information.

For derivative languages such as Uyghur, the sentence composition and the additional components of word formation are rich. The manual building of a bilingual dictionary is expensive and time-consuming, as it needs to be marked by a staff who are good in both languages. However, using statistical machine translation [5] and other learning methods to align words has many problems; for example, manual proofreading is required, the definition of words in various languages are not the same, and the data source is limited [6]. Moreover, traditional dictionary extraction methods require substantial human resources, time, and energy.

Neural network technology has good modeling and adaptive learning abilities in different environments. It has been well applied in multi-lingual information processing and has shown significant improvement in the field of low-resource language information processing [7]. Cross-language word embedding vectors quantify and apply word units’ semantic and functional roles to provide suitable technical support for cross-language alignment [8]. In addition, they provide bridges, such as mutual learning and knowledge transfer between languages with imbalanced resources. The use of distributed vector representations has been widely used in NLP.

Cross-language word embedding needs to map the vectors of multiple languages to a unified vector space through some transformation methods, or map the vectors of one language to another, such as from one (low-resource) to another (high-resource) language vector space, so that the knowledge and information of a high-resource language can be bridged to other languages with fewer expenses.

In this paper, through the effective use of a Chinese-Uyghur parallel text dataset and a seed dictionary, particles are represented in vector spaces to align and represent bi-languages, and the automatic extraction and augmentation of bilingual dictionaries are carried out.

2. Related Research

In recent years, scholars have conducted significant research on the construction methods of bilingual dictionaries. Previously, Rapp et al. [9] proposed an algorithm based on a word-relation matrix. Although a word appears in different contexts in a monolingual text, the set of words that appear together with it is roughly the same. There is a specific correlation between words, with transitivity and symmetry rules. At present, the main research methods of bilingual dictionary extraction are divided into parallel corpus-based methods and comparable corpus-based methods according to the type of corpus used. There are also supervised learning methods based on the seed dictionary and unsupervised learning and self-learning methods based on an adversarial network.

The first type is the parallel corpus, composed of the source language and its translated target text. It has strong text alignment, high-quality inter-translation information, semantic relations, and the characteristics of high translation accuracy [10]. Mo et al. [11] and Luong et al. [12] proposed using large parallel corpora for bilingual dictionary extraction. Since the parallel corpus contains high-quality alignment information, the effect of constructing a bilingual dictionary based on a parallel corpus is better. However, this method has the disadvantages of limited corpus size, time-consumption, and labor-consumption.

The second method is based on a comparable corpus. Comparable corpora are two-language texts with similar topics, the same domain, a wide range of sources, and rich content-related features, which are easier to build. Morin et al. [13] and Gouws et al. [14] proposed bilingual dictionaries based on comparable corpora. The comparable corpus contains a large amount of cross-translation text information that is not strictly inter-translated. These inter-translated words appear in contexts with similar semantics but in different languages. Although they can alleviate the dependence on parallel corpora to a certain extent, comparable corpora for low-resource languages are very scarce. Moreover, the quality and quantity of identical translation pairs cannot be determined, and the extraction efficiency of bilingual dictionaries is very low [15].

The third category is the weakly supervised dictionary-extraction method based on the seed dictionary. Mikolov et al. [16] and Wick et al. [17] proposed a mapping-based bilingual lexicon extraction method by learning the language structure in a monolingual dataset and using a small number of seed dictionaries to complete the cross-lingual mapping process. Due to the limitation of the size and quality of bilingual dictionaries in small languages, this method still has considerable room for improvement.

The fourth category is comprised of methods based on unsupervised self-learning adversarial networks. Conneau et al. [18] initialized the learned mapping network with an adversarial discriminator that was trained to distinguish the mapped language from the target language. Barone et al. [19] and Cao et al. [20] considered an unsupervised multilingual map-embedding method without bilingual knowledge. The former [19] proposed adversarial autoencoder combines an encoder that maps the source language’s word vectors to the target language, a decoder that reconstructs the original word vectors, and a discriminator to distinguish the mapped embeddings from the actual target language’s embeddings. The latter [20] regards the bilingual dictionary extraction task (unsupervised) as a point-set alignment problem. The point-set alignment task is to find and learn the alignment relationship between two point sets, and to regard the word vectors of the two languages as two if there is a set of points. The extracted dictionary has a high similarity with the point-set alignment method. However, although the unsupervised learning model is not as efficient as the supervised model, it provides a research basis for subsequent research.

Natural language has the characteristics of structure dependence. At present, the research on Chinese-Uyghur bilingual dictionary extraction based on a parallel corpus is not mature enough. The methods used by previous researchers in Chinese-Uyghur bilingual lexicon extraction were carried out under the traditional statistical machine translation framework. No scholars have used an encoding algorithm to extract bilingual information automatically from available parallel corpus datasets. In the machine learning methods [21], specific medical dictionary text materials in Chinese-Uyghur are introduced. Based on the traditional statistical machine translation methods, the bilingual glossary in the field of Chinese-Uyghur medicine is extracted, checked, and analyzed manually, and the accuracy rate is 27.28%. These methods provide simple statistics on the frequency of words and some professional terms in the text. The machine learning process is shallow and does not consider the semantic relationship between the words in the text [22], so it cannot maintain the clear semantic information of the text’s context. These algorithms’ performance is far from meeting the needs of current practical applications, and most of these studies focused on the extraction of some professional terms in specific fields.

Therefore, many scholars have recently optimized and improved this field’s research, especially in applying the popular neural network algorithm in machine learning, multilingual natural language processing, and other related fields, and achieved outstanding results. In particular, one of the representative achievements applied in natural language processing is the word embedding vector algorithm [15]. It is gradually becoming widely used in semantic expansion and emotion analysis. In the monolingual environment, the similarity of the word vectors of two words can be calculated directly, taking into account the smoothing function. Quantifying the relationship matrix between words by word vectors can provide the contextual semantic information.

Therefore, we propose a unified combination of cross-language word embedding based on the inter-word relationship matrix to extract a Chinese-Uyghur bilingual dictionary. Through a small seed dictionary, a boot start learning method learns the strictly aligned parallel data resource of the Chinese-Uyghur lexicon in a generalized field as a weak supervised signal. Chinese and Uyghur word vectors are mapped to a unified vector space to associate bilingual information through the quantified semantic relationship between the word vectors. Finally, the bilingual dictionary is obtained using this model by calculating the similarity.

3. Bilingual Dictionary Construction Method Based on Cross-Language Word Vectors

3.1. Cross-Language Word Embedding Text Representation

Deep neural networks and representation learning [23,24] provide better text representations and ways to alleviate the data sparsity problem. Mikolov et al. [15] proposed a shallow neural network Word2vec word embedding vector text representation method. It is a low-dimensional real vector method [25].

Morphological analysis based on word vectors provides a convenient method for low-resource language processing tasks. Word2vec is implemented in two possible topologies: the CBOW (Continuous Bag of Words) model [26] and the Skip-gram model [27]. Skip-gram with negative sampling is a popular model [16] because of its good training performance and robustness. It provides word vectors that predict surrounding context words by providing source words. [28] The Skip-gram model minimizes the objective, using the training data in the text representation task as follows [29]:

L_{s k i p - g r a m} = - \frac{1}{N} \sum_{t = 1}^{N} \sum_{- c \leq j \leq c, j \neq 0} \log P (w_{t + j} | w_{t})

(1)

In Equation (1), N is the total number of words in the training text, c is the size of the training text context window, and the Skip-gram model places weights on nearby context words in the text rather than on more distant context words. This is critical in achieving a better text representation effect [30]. In contrast, the inverse operation of the Skip-gram model is the CBOW model, which predicts a source word based on the surrounding words. The CBOW model minimizes the following objectives in the training data [29]:

L_{c b o w} = - \frac{1}{N} \sum_{t = 1}^{N} \log P (w_{t} | w_{t - c}, w_{(t - c) - 1}, \dots, w_{t - 1}, w_{t + 1}, \dots, w_{(t + c) - 1}, w_{t + c})

(2)

Compared with CBOW, Skip-gram is more expensive to train. However, Skip-gram has been shown to outperform CBOW in effectively capturing infrequent words, and thus it performs better in both semantic and syntactic tasks [31]. The interpretation model is shown in Figure 1 and Figure 2.

As can be seen from the above figures, Skip-gram predicts more times than CBOW. Because each word is used as the center word, the surrounding words must be used for prediction once. When the amount of text data is small, or the words are rare words appearing less often, this multiple adjustment will make the word vector relatively more accurate. Therefore, this paper uses Skip-gram to quickly and effectively train the word vector representation of the text.

Word2vec uses syntactic-dependent context and words with similar meanings. The word vector obtained through training can judge the semantic similarity by cosine distance. The larger the calculated cosine distance of the word vector, the closer the words are to each other in the vector space, and the closer the semantics are; otherwise, the semantics are farther apart. As shown in Table 1 below, input of the Chinese word (地球) and the Uyghur word ھايۋان (animal), respectively, will provide the first five words that are semantically similar to these two words in the vector space.

From Table 1, it can be seen that the Chinese word 地球 (earth) and the Uyghur word ھايۋان (animals) are input, respectively, and the nearest five words with the most similar semantics to the two input words are obtained by calculating the cosine distance between the word vectors.

3.2. Bilingual Dictionary Construction Method

First, through a supervised learning method based on a few seed dictionaries, the word vector relationship matrix between the two languages is separately learned. Then, they are mapped to a unified vector space, and the distance between the translated words is minimized. Finally, the similarity between the cross-language word vectors is calculated by the nearest neighbor retrieval and Cross-domain Similarity Local Scaling (CSLS) to obtain a bilingual dictionary [32].

The basic idea of a bilingual dictionary construction method based on cross-language word vectors can be divided into the following four steps: first, preprocess the source language and target language, respectively; then, these processed corpora are represented as word vectors Vs and Vt, respectively; next, the seed dictionary is used as the weak supervision signal to learn the Chinese-Uyghur cross-language word vector and the inter-word relationship matrix, respectively; finally, the dictionary is extracted by different dictionary extraction strategies. The specific flow chart of the construction method is shown in Figure 3.

In Figure 3, the source language is Chinese in this method, and the target language is Uyghur. The specific steps and analysis of the method are as follows:

(1): The corpus of the source language and target language is preprocessed. The Chinese text needs to be segmented, and the Uyghur text does not need to be separated by spaces. The Chinese corpus sentences are processed by the Jieba word segmentation tool. Finally, the preprocessed Chinese and Uyghur sentences are obtained, and the word frequency and the number of words are counted, where n is the number of words in the source language, m is the number of words in the target language, {x_i|i = 1, 2, …, n} is the source language’s word set, and {y_j|j = 1, 2, …, m} is the target language’s word set.
(2): The word vector embedding of all corpora is trained through the Skip-gram model based on a hierarchical softmax algorithm, and the words of Chinese and Uyghur are expressed as word vectors. The word vector dimension is expressed as d, and the word vector corresponding to the source language words is expressed as {v_x₁, v_x₂, v_x₃, …, v_xi}, where v_xi ∈ R^d, i ∈ {1, 2, …, n}. Similarly, the word vector corresponding to the target language words can be expressed as {v_y₁, v_y₂, v_y₃, …, v_yi}, where v_yi ∈ R^d, i ∈ {1, 2, …, m}.
(3): The method based on a seed dictionary is used to learn the Chinese dimensional mapping matrix, W_xz. The premise for the construction method of a bilingual dictionary based on a seed dictionary needs a small-scale bilingual seed dictionary of the source language and the target language. The seed lexicons of mutual translation are extracted from the bilingual text, and the mapping relationship between the word vectors of the two languages is learned through the seed dictionary. If the number of seed word pairs is n, the formed seed set is expressed as {w_si, w_ti}, where w_s is the source language word, w_t is the corresponding translation of w_s in the target language, and i is the index of w_s in the seed dictionary.

This paper assumes that X and Z represent word embedding vector matrices in Chinese and Uyghur texts, respectively. Xi* is the word embedding representing the i-th word in the Chinese text word embedding matrix X, and Zj* is the word embedding representing the j-th word in the Uyghur text word embedding matrix Z. The number of rows of the word embedding matrix represents the number of words, and the number of columns represents the dimension of word embedding. The linear transformation mapping matrix W shown by Equation (3) has a size of d × d. Md (R) indicates d × d, so that XW is best approximated to Z; finally, W_* minimizes the mapping matrix of the distance from the source language to the target language.

W_{*} = \underset{W \in M d (R)}{\arg} \min \sum_{i} ‖ X W - Z ‖ F

(3)

After obtaining the mapping matrix W, any untranslated word s_x can be translated to the word corresponding to the target language in the source language through nearest neighbor and CSLS rules. According to the mapped unified vector space, word alignment is carried out through the maximum cosine similarity value. The translation t definition corresponding to any source word s_x is shown by Equation (4).

t = \arg \max \cos (W x_{s}, z_{t})

(4)

On this basis, better results are achieved by adding orthogonal constraints to the mapping matrix W. The problem is transformed into an unbalanced Orthogonal Procrustes problem under orthogonal constraints, and an approximate solution is obtained by the singular value decomposition (SVD) of ZX^T. The specific equations are shown by Equations (5) and (6):

W_{*} = \underset{W \in M d (R)}{\arg} \min \sum_{i} ‖ X W - Z ‖ F = U V^{T}

(5)

U {\sum V}^{T} = S V D (Z X^{T})

(6)

Through Equation (5), the Chinese dimensional mapping matrix Wxz with the minimum distance between the source language and the target language can be obtained. In model iteration, the best bilingual mapping matrix W_* is learned through the seed dictionary.

(4): In the cross-language word vector extraction of Chinese-Uyghur bilingual dictionaries, two methods of dictionary extraction will be introduced, i.e., the nearest neighbor method and CSLS method. After the word vectors of the source language and the target language are mapped to the same space through the orthogonal matrix W, the source language word vector corresponding to the target language word vector in the same vector space will be found according to the nearest neighbor search. Compute the cosine similarity between WX and Z. The greater the cosine value, the more correct the target language translation corresponding to the source language. The calculation is shown by Equation (4). However, this method may have a hub point (hubness) problem in a high-dimensional space [33]. Some points will become the nearest neighbors of most points, so the nearest neighbor search cannot accurately find the words closest to the semantics of each word. The CSLS method penalizes the similarity score of such pivot points. For two-word vectors x and z mapped to the same space, calculate the CSLS score between them as the final similarity score between the two words, as shown by Equation (7):

C S L S (s, t) = 2 \cos (x, z) - s i m_{k} (x) - s i m_{k} (z)

(7)

In Equation (7), cos (x, z) represents the cosine similarity of two words, and sim_k(x) and sim_k(z) represent the mean cosine similarity of x and z and their k nearest neighbors in the same space, respectively, as two penalty terms to solve the hub problem in high-dimensional space. During the experiment, the bilingual dictionary is constructed according to the different extraction methods above, and the bilingual dictionary’s extraction efficiency on the Chinese-Uyghur dataset is compared and verified.

4. Experiments

At present, Chinese-Uyghur bilingual dictionary extraction research is still at the beginning point, and there is no publicly available text corpus. Therefore, it is necessary to construct a Chinese-Uyghur corpus through the texts downloaded from the Internet and to use this dataset for experiments.

4.1. Experimental Corpus

The dataset used in this paper is comprised of the Chinese-Uyghur data texts for various fields collected from the encyclopedia by our laboratory team. The experimental dataset is constructed by manual collection and checking for spelling errors. The source language is Chinese in the experimental data, and the target language is Uyghur. The Uyghur language is a derivative language, and its language resources are scarce and noisy. The corpus includes parallel datasets of Chinese-Uyghur documents in 20 categories in general fields, including astronomy, earth, life, animals, plants, medicine, paleontology, electronic information, and aerospace, with 200 articles in each category, totaling 4000 articles. The average number of words per article in Chinese and Uyghur is about 600 and 480, respectively. The average sentence lengths per sentence in the Chinese and Uyghur datasets are about 32 and 21, respectively. The maximum article size is 25 K, the minimum article size is 14 K, and the average article size is 16 K. For the number of occurrences of each word, we consider the distribution of words as counts of 1–5, 5–10, 10–15, 15–20, and more than 20, of which the 1–5 distribution accounts for the highest proportion of word occurrences. The proportion of Chinese and Uyghur word distribution in the dataset is shown in Figure 4:

Chinese and Uyghur texts downloaded from the Internet are prone to many spelling errors and personalized spelling problems, that is, multiple spelling methods for the same dictionary term. If the spelling is not unified, a word will be mistaken as different words. Therefore, we developed a spelling-checking tool. By analyzing Uyghur phonetic sections’ structural forms and rules, the tool can find most misspelled words. The spelling check process is shown in Figure 5.

After the dataset is prepared, 4000 aligned documents with sentences, within 10 to 50 sentences, are manually screened. Then, spelling errors are checked and corrected, and preprocessing, such as data removal and word segmentation, is performed. The specific scale of the experimental data files in this paper is shown in Table 2.

The small number of seed dictionaries used in this paper are based on the parallel dataset. Chinese words with more than six characters are eliminated through manual translation and proofreading. Finally, 7000 pairs of bilingual translation-words are selected for training and testing, and the word types are divided into nouns, verbs, and adjectives, accounting for 96.11%, 2.97%, and 0.92%, respectively. Some examples are shown in Table 3, as the aligned bilingual dictionary sample examples.

4.2. Evaluation Indicators

In this paper, the accuracy rate P@N (extracted when selecting N candidate words) is used as an evaluation index to measure the quality of the bilingual dictionary. One thousand pairs of Chinese and Uyghur bilingual words are randomly selected as the verification dictionary. RT is the number of words in the extraction result in the experiment and T(w_i) is the extraction result of the extraction method on the word w_i; if the extracted bilingual dictionary is correct, then take 1, otherwise take 0, d(w_i), represents the word w, the translation set in the dictionary, and the specific calculation is shown in Equation (8).

P @ N = \frac{\sum_{i = 1}^{R T} ‖ T (w_{i}) ‖}{R T}, w_{i} \in d (w_{i})

(8)

4.3. Experiment Settings

In the experiment, Chinese and Uyghur monolingual word vectors are trained through the Skip-gram model of the Word2vec hierarchical softmax algorithm. The learning rate is set to the default value of 0.025. The number of training seed dictionaries is set to 6000 pairs, and the number of test dictionaries is set to 1000 pairs. In addition, in order to get a better word vector training effect, when constructing a word vector, we set different dimension values in the word vector space, and the word vector is embedded in the training window size W. In addition, the effects of the number of candidate words N, the size of the training seed dictionary, the change of word frequency of the test dictionary, and the final extraction accuracy of the model are considered. Finally, model comparison experiments are conducted with several previous methods for automatically building bilingual dictionaries. Chinese-Uyghur bilingual dictionaries are extracted according to two different methods, the nearest neighbor method and CSLS, and the results obtained by these two extraction methods are analyzed.

4.4. Experimental Results and Analysis

4.4.1. The Influence of Different Cross-Language Word Embedding Vector Mapping Dimensions

In order to verify the influence of the dimension size of cross-language word embedding vector mapping on the accuracy of the method, experiments under different cross-language word embedding vector mapping dimensions were set up. The accuracy of P@1 (candidate word N = 1), the experimental results, are shown in Table 4.

Analysis of the experimental data in Table 4 shows that as the dimension of the cross-language word embedding mapping vector increases, the extraction efficiency of the Chinese-Uyghur dimensional dictionary gradually decreases. The best experimental results are obtained when the word vector dimension is adjusted to 300. The dimension of the word vector represents the latent semantic features of words. More features can more accurately distinguish words. Although more dimensions can be better distinguished, the relationship between words will be diluted. If the dimension is too high, it will dilute the inter-word relationship between contexts, but context words cannot be distinguished if the dimension is too low. This advantage of word embedding vectors is fully reflected in low-resource languages. Therefore, the dimension selection of the word vector depends on the actual application scenario. It can be seen that the best experimental results can be obtained when the mapping dimension is 300. Accordingly, the word embedding vector dimension was set to 300 in the subsequent experiments.

In practical applications, the better the performance of P@1, the more conducive it will be to the advancement of subsequent tasks, with higher practical significance. It can also be seen from the experimental results that the CSLS extraction method is better than the nearest neighbor method, because it is difficult for the nearest neighbor search to accurately find the words that are semantically closest to each word. The hubness problem is prone to occur. The CSLS extraction method can alleviate the shortcomings of the nearest neighbor method and obtain a better bilingual dictionary.

4.4.2. The Influence of the Number of Candidate Words

In order to verify the relationship between the accuracy of the method and the number of candidate words extracted, the experiments also compared the accuracy of P@1, P@5, and P@10. The specific experimental results are shown in Table 5.

As shown in Table 5, the accuracy of the method in this paper gradually increases with the increase of candidate words. When candidate words reach 10, the highest accuracy rate can reach 65.06%. This group of experiments also achieved better accuracy with the CSLS extraction method. This further illustrates the isomorphism of different languages in the word vector space. The number of candidate words in the following experimental group was set to N = 10.

We also observed the actual extracted correct and incorrect bilingual dictionary table results in this set of experiments. If we take the bilingual dictionary results extracted by P@1, P@5, and P@10 as examples to analyze the correctly extracted and wrongly extracted dictionaries, as shown in Table 6, when the number of candidate words N is set to 1, only one example of bilingual dictionary results is extracted.

The P@1 accuracy rate represents the similarity of the candidate words corresponding to the predicted target words. After sorting from large to small, the result of the predicted target word with the most considerable similarity is considered accurate. The following is an analysis of the bilingual vocabulary that was incorrectly extracted in the P@1 bilingual dictionary extraction task.

In the incorrectly extracted bilingual vocabulary, the target words “ئېگىز” and “تاۋار” correspond to the source words “山坡 (hillside)” and “包装 (package)”, respectively, but the accurate reference target words “قىيالىق” and “ئورام” are a pair of contextually close translations. They have the highest word embedding similarity value. This phenomenon fully proves that the cross-language word vector embedding method positively affects the bilingual dictionary extraction task between languages. However, it was observed from the wrong extraction results that the quality of cross-language word embeddings did not reach the ideal state and, in the end, some words were not extracted correctly. It can be seen that the quality of word embedding needs to be improved.

In addition, in the incorrectly extracted bilingual vocabulary, the target words “ھاۋادىن” and “جانلىقلار” corresponding to the source words “空气 (air)” and “生物 (organism)” were found to be the same as the accurate reference target words “ھاۋا” and “جانلىق”—a stem. The reference target words “ھاۋا” and “جانلىق” are the stems of the extracted target words “ھاۋادىن” and “جانلىقلار”. There are more affixes such as “دىن” and “لار” in the target vocabulary extracted by mistake. This phenomenon reduces the accuracy of the final dictionary extraction result. The Uyghur language is an agglutinative language in which words consist of suffixes attached to a stem (or root). Therefore, the number of Uyghur vocabulary words in this article is more than the of Chinese vocabulary words, because the Uyghur vocabulary can be derived by stems adding suffixes. From the error report, it is known that in the follow-up research, it will be necessary to pay attention to the stem extraction step of the derivative language.

In the wrongly extracted bilingual vocabulary, the extracted target word “تەدبىر” corresponds to the source word “方法 (method)”, a synonym to the reference target word “ئۇسۇل”, and they belong to polysemy words. The result is wrong, since it is not listed in the test dictionary. However, the words with polysemy in this article are effectively obtained from the words extracted by this method, which is an unexpected result in the bilingual dictionary extraction task. Because there are many polysemy phenomena in bilingual texts, this phenomenon is unavoidable in other cross-lingual information extraction tasks. This problem is a complicated problem that needs more research. If such problems can be effectively solved, that will play a crucial role in improving the accuracy of cross-language tasks, such as machine translation.

When the number of candidate words, N, takes different values, this also has a particular impact on the accuracy of the experiment. The following Table 7 and Table 8 are examples of correct and incorrect extraction bilingual dictionary results when the candidate word size is N = 5, 10:

For P@5 and P@10, the accuracy rate represents the similarity of the candidate words corresponding to the predicted target words in all test dictionaries. After sorting from large to small, one of the top 5 and top 10 results of the similarity value is the target. The word result is the same as the exact reference target word in the actual test dictionary (bold words). Below is the error analysis in the bilingual dictionary extraction task for P@1 and P@10.

Table 7 and Table 8 show several examples of translations from Chinese to Uyghur. Though they are far from perfect, some translations are meaningful and are semantically related to the correct lexicon. In addition, the constraints of context information still have a significant impact on dictionary extraction, which requires improving the quality of cross-language word embedding mapping and providing more bilingual texts. In addition, when the candidate word size is N = 1, the source word of “山坡 (hillside)” is wrongly extracted, but when N = 5, it is correctly extracted to the target translation word “قىيالىق”. In addition, source words such as “态度 (attitude)” and “尖锐 (sharp)”, that are incorrectly extracted when N = 5 are correctly extracted to the target translation words “پوزىتسىيە” and “ئۆتكۈر” when N = 10.

The above analysis of the experimental results shows that the number of candidate words greatly influences the experimental results. Next, by combing and summarizing the previous relevant literature, we found that factors such as the context window size, the number of training seed dictionaries, and the word frequency of test dictionaries for cross-language word embedding selection all have particular impacts on the extraction results of bilingual dictionaries. Therefore, the following experiments and analyses of these factors were carried out.

4.4.3. The Influence of the Context Window Size W

First, the context window mainly affects words’ expression form and effectiveness. For example, in the Skip-gram model, too-small window selection will affect the context expression form of the word vector and the quantification of the correlation between words. If the window selection is too large, semantic redundancy will appear, and too much noise will be introduced, resulting in semantic interference. Therefore, it is necessary to find an optimal cross-language word embedding vector mapping context window size W. In this group of experiments, taking the size W of the context window as a variable, the final experimental results are obtained on two different extraction methods, as shown in Figure 6.

As shown in Figure 6, the context window has a more significant impact on the accuracy of dictionary extraction. In the initial stage, the accuracy rate shows an increasing trend with the window size; when the window size reaches 10, the influence of the word embedding vector on the text representation becomes smaller. Experiments have shown that context information has a constraining effect on bilingual word vectors. However, when the context window size reaches a certain level, words that are too far from the target word will interfere with the context information of the target word and have a negative impact. This group of experiments also achieved better accuracy with the CSLS extraction method. The following experiments set the cross-language word embedding context window size W to 10.

4.4.4. The Influence of the Size of the Training Seed Dictionary

The training seed dictionary is an intermediate bridge for the conversion from the source language to the target language. Its size has an impact on the accuracy of bilingual dictionary extraction. The seed dictionary is divided proportionally to show the impact of its size, where 0.1 represents 1/10 of the seed dictionary, and 1 represents the entire seed dictionary. In this group of experiments, the size of the seed dictionary is used as a variable, and the final experimental results are obtained on two different extraction methods, as shown in Figure 7.

As can be seen from Figure 7, with the increase in the size of the training seed dictionary, the accuracy of the two extraction methods is improved to varying degrees. When the seed dictionary reaches 50% of the original seed dictionary, the increasing trend slows down. A smaller seed dictionary can also complete the transformation of cross-language space to achieve a better extraction effect. This group of experiments also shows a similar tendency with the CSLS extraction method.

4.4.5. The Effect of Word Frequency

Finally, to evaluate the influence of test word frequency on the extraction effect, this paper divides the test dictionary into three-word segments: high-frequency words, intermediate-frequency words, and low-frequency words. Assuming that w_f represents word frequency, the specific division criteria are:

{\begin{cases} 0 < w_{f} \leq 50, low frequency \\ 50 < w_{f} \leq 200, intermediate frequency \\ w_{f} > 200, high frequency \end{cases}

As shown in Table 9, the two extraction methods in low-frequency word segments achieve relatively low results. However, with the increase in word frequency, the extraction effect is significantly improved. In particular, in high-frequency word segments the extraction effect is significantly improved. This also shows that the CSLS method provides better results for extracting bilingual dictionaries.

4.4.6. Comparison of Our Method with Other Methods for Building Bilingual Dictionaries

Finally, we compare our method with previous methods for automatically building bilingual dictionaries. We used the convex relaxation optimization retrieval standard method of CSLS loss proposed by Joulin et al. [34] and the iteratively normalized non-isomorphic cross-language word embedding alignment method proposed by Zhang et al. [35] for the Chinese-Uyghur bilingual dictionary. We constructed the experiment and compared the accuracy of the methods. The specific experimental results are shown in Table 10.

Analysis of the experimental data in Table 10 shows that the accuracy of the Chinese-Uyghur bilingual dictionary constructed by our method is better than that of the other two methods. The method proposed by Joulin and Zhang did not carry out an adjustment of the parameter change process of the monolingual word embedding mapping. Under the conditions of the best parameters obtained by the above experiments, in an environment with fewer resources, the accuracy of our method is better than that of their proposed method. This group of experiments also tested and observed the experimental results in different language direction groups. As Uyghur is a derivative low-resource language, its word-formation and additional components are rich. The labeled stem data are scarce, and the Uyghur language’s spelling and pronunciation are very different from Chinese spelling and pronunciation. Therefore, the experimental results of using Uyghur as the source language are lower than those of the target language. Chinese is a resource-rich language. Taking it as the source language, the method in this paper uses a small number of Chinese-Uyghur bilingual seed dictionaries as weak supervision signals. In the absence of high-quality, large-scale bilingual supervision signals, our method can also determine better bilingual mapping. Compared with the other two proposed methods, the weakly supervised model selection criteria proposed in this paper can extract bilingual dictionaries well.

5. Conclusions

In order to automatically construct a Chinese-Uyghur bilingual dictionary, this paper proposed a method to extract a bilingual dictionary by combining the inter-word relationship matrices mapped by cross-language word embedding vectors. Different experimental groups verified the effectiveness of this method, and the overall accuracy was better than that of other previous bilingual dictionary extraction methods. The sizes of the bilingual seed dictionary and the seed parallel corpus were essential for the improved results. This research used vector mapping methods between the languages to build a cross-lingual parallel aligned-term corpora automatically.

The method in this paper was implemented for the first time in a minority low-resource language, such as Uyghur. We also analyzed the extracted correct and incorrect bilingual dictionary results, which will help improve the performance of subsequent cross-lingual information retrieval and extraction tasks. At the same time, it was also proven that the construction method of a Chinese-Uyghur dictionary can have better performance in constructing low-resource language dictionaries by borrowing cross-language word embedding vectors. Since Uyghur is a language with a derivative morphological structure, it is essential to consider its morphological and part-of-speech structure in subsequent downstream tasks.

In a low-resource environment, uneven data distribution is prone to occur. We will further optimize and improve fundamental data resources in future works and expand the amount of data in other fields. Using the method proposed in this paper as a future baseline model, we will try to design some cross-language word embeddings using less supervised or unsupervised methods for low-resource languages, and adopting these methods for more practical applications, such as cross-language text classification and unsupervised neural machine translation.

Author Contributions

Conceptualization, A.A. and M.A.; methodology, A.A., H.Y. and A.H.; software, A.A. and H.Y.; validation, M.A., H.Y. and A.H.; formal analysis, A.A. and H.Y.; investigation, A.A. and M.A.; resources, M.A., H.Y. and A.H.; data curation, H.Y.; writing—original draft preparation, A.A.; writing—review and editing, M.A. and A.H.; visualization, A.A., M.A. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Plan of China (2017YFC0820603), and the Strengthening Plan of the National Defense Science and Technology Foundation of China (2021-JCJQ-JJ-0059) and the Natural Science Foundation of China (62166042, U2003207).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Chinese-Uyghur experimental dataset used in this article has not been released yet. It is a dataset collected by our college and will be released soon.

Acknowledgments

The authors are very thankful to the editor and the referees for their valuable comments and suggestions for improving the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ballesteros, L.A. Cross-language retrieval via transitive translation. In Advances in Information Retrieval; Springer: Cham, Switzerland, 2002; pp. 203–234. [Google Scholar]
Zou, W.Y.; Socher, R.; Cer, D.; Manning, C.D. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1393–1398. [Google Scholar]
Klementiev, A.; Titov, I.; Bhattarai, B. Inducing crosslingual distributed representations of words. In Proceedings of the COLING 2012, Mumbai, India, 8–15 December 2012; pp. 1459–1474. [Google Scholar]
Zhang, M.; Liu, Y.; Luan, H.; Sun, M. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1959–1970. [Google Scholar]
Lauly, S.; Larochelle, H.; Khapra, M.M.; Ravindran, B.; Raykar, V.; Saha, A. An autoencoder approach to learning bilingual word representations. arXiv 2014, arXiv:1402.1454. [Google Scholar]
Nassirudin, M.; Purwarianti, A. Indonesian-Japanese term extraction from bilingual corpora using machine learning. In Proceedings of the 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, 10–11 October 2015; pp. 111–116. [Google Scholar]
Liang, J.; Jiang, M.-J.; Wu, F.-S.; Deng, X. Neural Network Technology Application and Progress for the Field of Medicine. J. Liaoning Univ. Tradit. Chin. Med. 2011, 34, 89–93. [Google Scholar]
Ruder, S.; Vulić, I.; Søgaard, A. A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 2019, 65, 569–631. [Google Scholar] [CrossRef] [Green Version]
Rapp, R. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, MD, USA, 20–26 June 1999; pp. 519–526. [Google Scholar]
Sun, L.; Jin, Y.; Du, L.; Sun, Y. Automatic extraction of bilingual term lexicon from parallel corpora. J. Chin. Inf. Process. 2000, 14, 33–39. [Google Scholar]
Mo, Y.; Guo, J.; Mao, C.; Yu, Z.; Niu, Y. A bilingual word alignment method of Vietnamese-Chinese based on deep neutral network. J. Shandong Univ. Nat. Sci. 2016, 51, 78–82. [Google Scholar]
Luong, M.-T.; Pham, H.; Manning, C.D. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA, 31 May–5 June 2015; pp. 151–159. [Google Scholar]
Morin, E.; Prochasson, E. Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, OR, USA, 24 June 2011; pp. 27–34. [Google Scholar]
Gouws, S.; Søgaard, A. Simple task-specific bilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 1386–1390. [Google Scholar]
Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting similarities among languages for machine translation. arXiv 2013, arXiv:1309.4168. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
Wick, M.; Kanani, P.; Pocock, A. Minimally-constrained multilingual embeddings via artificial code-switching. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Conneau, A.; Lample, G.; Ranzato, M.A.; Denoyer, L.; Jégou, H. Word translation without parallel data. arXiv 2017, arXiv:1710.04087. [Google Scholar]
Barone, A.V.M. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv 2016, arXiv:1608.02996. [Google Scholar]
Cao, H.; Zhao, T.; Zhang, S.; Meng, Y. A distribution-based model to learn bilingual word embeddings. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–17 December 2016; pp. 1818–1827. [Google Scholar]
Yu, Q.; Chang, L.; Xu, J.; Liu, T.Y. Research on bilingual term extraction based on Chinese Uygur medical parallel corpus. J. Inn. Mong. Univ. 2018, 49, 528–533. [Google Scholar]
Silva, V.S.; Freitas, A.; Handschuh, S. Xte: Explainable text entailment. arXiv 2020, arXiv:2009.12431. [Google Scholar]
Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 2003, 3, 1137–1155. [Google Scholar]
Mnih, A.; Hinton, G. Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine learning, Corvallis, OR, USA, 20–24 June 2007; pp. 641–648. [Google Scholar]
Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent convolutional neural networks for text classification. In Proceedings of the Twenty-ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
Goldberg, Y.; Levy, O. word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv 2014, arXiv:1402.3722. [Google Scholar]
Chen, Y.Q.; Nixon, M.S.; Damper, R.I. Implementing the k-nearest neighbour rule via a neural network. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; pp. 136–140. [Google Scholar]
Levy, O.; Goldberg, Y.; Dagan, I. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 2015, 3, 211–225. [Google Scholar] [CrossRef]
Alipour, G.; Bagherzadeh Mohasefi, J.; Feizi-Derakhshi, M.-R. Learning Bilingual Word Embedding Mappings with Similar Words in Related Languages Using GAN. Appl. Artif. Intell. 2022, 10, 1–20. [Google Scholar] [CrossRef]
Hossny, A.H.; Mitchell, L.; Lothian, N.; Osborne, G. Feature selection methods for event detection in Twitter: A text mining approach. Soc. Netw. Anal. Min. 2020, 10, 61. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Artetxe, M.; Labaka, G.; Agirre, E. Bilingual lexicon induction through unsupervised machine translation. arXiv 2019, arXiv:1907.10761. [Google Scholar]
Shigeto, Y.; Suzuki, I.; Hara, K.; Shimbo, M.; Matsumoto, Y. Ridge regression, hubness, and zero-shot learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Porto, Portugal, 7–11 September 2015; pp. 135–151. [Google Scholar]
Joulin, A.; Bojanowski, P.; Mikolov, T.; Jégou, H.; Grave, E. Loss in translation: Learning bilingual word mapping with a retrieval criterion. arXiv 2018, arXiv:1804.07745. [Google Scholar]
Zhang, M.; Xu, K.; Kawarabayashi, K.-I.; Jegelka, S.; Boyd-Graber, J. Are Girls Neko or Sh\= ojo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization. arXiv 2019, arXiv:1906.01622. [Google Scholar]

Figure 1. CBOW model.

Figure 2. Skip-gram model.

Figure 3. Bilingual dictionary construction process based on cross-language word embedding vector.

Figure 4. Word distribution in Chinese and Uyghur datasets.

Figure 5. Text spell check tool.

Figure 6. Influence of cross-language word embedding vector context window size W in our Chinese-Uyghur dictionary.

Figure 7. Influence of the size of training seed dictionary.

Table 1. Stem vector semantic similarity.

Chinese 地球 (Earth) Related Vocabulary		Uyghur ھايۋان (Animal) Related Vocabulary
Related Words	Cosine Distance	Related Words	Cosine Distance
月球 (Moon)	0.8317	دىنازاۋۇر (Dinosaur)	0.8912
行星 (Planet)	0.8098	قۇش (Bird)	0.8768
水星 (Mercury)	0.7975	بېلىق (Fish)	0.8519
陆地 (Land)	0.7653	ئۆسۈملۈك (Botany)	0.8348
太阳 (Sun)	0.7571	پىل (Elephant)	0.8301

Table 2. Experimental dataset.

Language Type	Chinese	Uyghur
parallel documents	4000	4000
Sentences	55,457	58,124
Vocabulary	59,482	85,815

Table 3. Examples of Chinese-Uyghur bilingual words.

Chinese	Uyghur	Chinese	Uyghur
矩阵 (matrix)	ماترىتسا	讲台 (platform)	سەھنە
数学家 (mathematician)	ماتېماتىك	尖锐 (sharp)	ئۆتكۈر
空气 (air)	ھاۋا	淋巴 (lymph)	لىمفا
复习 (review)	تەكرار	淀粉 (starch)	كراخمال
先天性 (congenital)	تۇغما	放大器 (amplifier)	كۈچەيتكۈچ
芬兰 (Finland)	فىنلاندىيە	流通 (circulation)	ئوبوروت
好吃 (delicious)	يېيىشلىك	行政 (administration)	مەمۇرىي
药片 (tablet)	تابلېتكا	做法 (practice)	ئۇسۇل
合作者 (collaborator)	ھەمكارلاشقۇچى	指导员 (instructor)	يېتەكچى
流畅 (fluent)	راۋان	衣服 (clothes)	كىيىم

Table 4. The influence of different cross-language word embedding mapping dimensions on the experimental results.

Vector Dimension	Extraction Method
Vector Dimension	Nearest Neighbor/%	CSLS/%
100	25.78	28.64
200	27.38	31.72
300	31.45	36.97
400	30.92	33.18
500	27.71	29.15

Table 5. The influence of the candidate words on the experimental results.

Extraction Method	Candidate Words N
Extraction Method	P@1	P@5	P@10
nearest neighbor/%	31.45	43.55	51.17
CSLS/%	36.97	47.27	65.06

Table 6. Examples of bilingual dictionary extraction results when candidate word size N = 1.

Correctly Extracted Bilingual Dictionary			Incorrectly Extracted Bilingual Vocabulary
Source Word	Target Words	Reference Target Word	Source Word	Target Words	Reference Target Word
古代	قەدىمكى	قەدىمكى	山坡	ئېگىز	قىيالىق
敏感	سەزگۈر	سەزگۈر	包装	تاۋار	ئورام
高兴	خۇشاللىق	خۇشاللىق	空气	ھاۋادىن	ھاۋا
感染	يۇقۇملىنىش	يۇقۇملىنىش	生物	جانلىقلار	جانلىق
地质学家	گېئولوگ	گېئولوگ	方法	تەدبىر	ئۇسۇل

Table 7. Example of extracting bilingual dictionary results when candidate word size N = 5. The five most likely translations are shown.

Correctly Extracted Bilingual Dictionary			Incorrectly Extracted Bilingual Vocabulary
Source Word	Target Words	Reference Target Word	Source Word	Target Words	Reference Target Word
乐观	كەيپىياتى ئۈمىدۋار ئۆگىنىش ئۆگىنىشكە بالىلارنى	ئۈمىدۋار	态度	خۇلق پىسخولوگلار ئەمەلىيەتچانلىق پوزىتسىيەسى ئەمەلىيەتتىن	پوزىتسىيە
单位	بىرلىك بىرلىكى بىرلىكىنى نىسبەتلا بىرلىكتىن	بىرلىك	空白	رېئاللىقتا كۆرۈنۈشنى ئۇنتۇلغۇسىز رېئاللىق تۇيغۇدا	بوشلۇق
山坡	ئېگىز تاغدىكى قىيالىق تۇپرىقى قۇملۇق	قىيالىق	鸟儿	قۇشقاچ شىر يىرتقۇچ شۈيخۇا ئۆردەك	قۇش
农耕	تېرىقچىلىق مۇداپىئەسى ئەمەلىيىتى شۇغۇللىنىش كەسىپلەر	تېرىقچىلىق	尖锐	تىرناقلىرى قوللىرىغا قۇشلارغا جاغ تاجىسى	ئۆتكۈر
司机	يېنىڭىزدىن شوپۇر بولسىڭىز ئولتۇرۇش تاكسىغا	شوپۇر	器官	يېتىلمىگەن ھۈجەيرىلىرىنىڭ ئەزالىرىنىڭ ھۈجەيرىلىرى نېرۋا	ئەزالارى

Table 8. Example of extracting bilingual dictionary results when candidate word size N = 10.

Correctly Extracted Bilingual Dictionary			Incorrectly Extracted Bilingual Vocabulary
Source Word	Target Words	Reference Target Word	Source Word	Target Words	Reference Target Word
态度	پوزىتسىيە پىسخولوگلار ئەمەلىيەتچانلىق پوزىتسىيەسى ئەمەلىيەتتىن قابىلىيەت كەيپىياتى پوزىتسىيە يېزىقچىلىق ئۆگىنىشىگە	پوزىتسىيە	人类	ئىنسانلار ئىنسانلارنىڭ تەرەققىياتنى تەرەققىياتىدىكى جەمئىيەت مۇساپىسىدە ئىزدىنىشى مەدەنىيىتىنىڭ تەرەققىياتى يارىتىشى	ئىنسان
尖锐	تىرناقلىرى قوللىرىغا قۇشلارغا جاغ تاجىسى بۈرگىنىڭ ئەگمە ئەگمىسى ئۆتكۈر تىرنىقى	ئۆتكۈر	米饭	شامنىڭ توڭلاتقۇدىن سېتىۋالغاندىن دۇخوپكا ئائىلىلەردە توڭلاتقۇ ئىسسىتقىلى ساقلىغىلى قازاندىكى سۇنىلا	كۈچ
巧克力	ئارىلاشتۇرۇپ شاكىلات پاراشوكى سېرىقمايدىن ئىچىملىك بولكا ماروژنى سوپۇننىڭ پاختا شاكىلاتنىڭ	شاكىلات	老虎	تۈلكە شىر يىرتقۇچ قۇش تۆگىقۇش بۈركۈت قۇشلار جىگدىچى چۈمۈلىخور بۈركۈتنى	يولۋاس
渔民	بېلىقچىلار ئېلىمىز تاشپاقىسى ئىزدىشى بورانقۇش بېلىق قۇشى جەلپ دېڭىزدا دېلفىنى	بېلىقچىلار	墨鱼	ئويستېر بېلىقلاردۇر سەكسەنپۇت بېلىقلارمۇ پالانچە بۈركۈتنى ئۈزمىگەچكە قىسقۇچپاقا تالمۇ ئانىسىغا	سىياھبېلىق
工业	ئېتانولنى سانائەتنىڭ ئىشلەپچىقىرىشىدا سانائەت سانائىتىدە كەسپىن تۇغۇندىسى مەھسۇلاتلىرىنىڭ تەرەققىيات سانائىتىنىڭ	سانائەت	波士顿	كولۇمبىيە فاكۇلتېتى 309 زۇڭتۇڭى مۇھەررىرى فېدراتسىيە گىرېس ئىمزالاپ كونۋېي داۋىس	بوستون

Table 9. The influence of the change of word frequency in the test dictionary on the experimental results.

Word Frequency	Extraction Method
Word Frequency	Nearest Neighbor/%	CSLS/%
Low frequency	16.23	21.89
Intermediate frequency	23.21	34.89
High frequency	34.43	57.38

Table 10. Accuracy rates of Chinese-Uyghur bilingual dictionaries under different methods.

Method	Chinese-Uyghur/%		Uyghur-Chinese/%
Method	Nearest Neighbor/%	CSLS/%	Nearest Neighbor/%	CSLS/%
Joulin et al. [34]	45.78	53.12	42.86	47.97
Zhang et al. [35]	46.35	57.54	45.76	52.97
Our method	51.17	65.06	49.56	60.78

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aysa, A.; Ablimit, M.; Yilahun, H.; Hamdulla, A. Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision. Information 2022, 13, 175. https://doi.org/10.3390/info13040175

AMA Style

Aysa A, Ablimit M, Yilahun H, Hamdulla A. Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision. Information. 2022; 13(4):175. https://doi.org/10.3390/info13040175

Chicago/Turabian Style

Aysa, Anwar, Mijit Ablimit, Hankiz Yilahun, and Askar Hamdulla. 2022. "Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision" Information 13, no. 4: 175. https://doi.org/10.3390/info13040175

APA Style

Aysa, A., Ablimit, M., Yilahun, H., & Hamdulla, A. (2022). Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision. Information, 13(4), 175. https://doi.org/10.3390/info13040175

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

Abstract

1. Introduction

2. Related Research

3. Bilingual Dictionary Construction Method Based on Cross-Language Word Vectors

3.1. Cross-Language Word Embedding Text Representation

3.2. Bilingual Dictionary Construction Method

4. Experiments

4.1. Experimental Corpus

4.2. Evaluation Indicators

4.3. Experiment Settings

4.4. Experimental Results and Analysis

4.4.1. The Influence of Different Cross-Language Word Embedding Vector Mapping Dimensions

4.4.2. The Influence of the Number of Candidate Words

4.4.3. The Influence of the Context Window Size W

4.4.4. The Influence of the Size of the Training Seed Dictionary

4.4.5. The Effect of Word Frequency

4.4.6. Comparison of Our Method with Other Methods for Building Bilingual Dictionaries

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI