Recognizing Textual Inference in Mongolian Bar Exam Questions

: This paper examines how to apply deep learning techniques to Mongolian bar exam questions. Several approaches that utilize eight different fine-tuned transformer models were demonstrated for recognizing textual inference in Mongolian bar exam questions. Among eight different models, the fine-tuned bert-base-multilingual-cased obtained the best accuracy of 0.7619. The fine-tuned bert-base-multilingual-cased was capable of recognizing “contradiction”, with a recall of 0.7857 and an F1 score of 0.7674; it recognized “entailment” with a precision of 0.7750, a recall of 0.7381, and an F1 score of 0.7561. Moreover, the fine-tuned bert-large-mongolian-uncased showed balanced performance in recognizing textual inference in Mongolian bar exam questions, thus achieving a precision of 0.7561, a recall of 0.7381, and an F1 score of 0.7470 for recognizing “contradiction”.


Introduction
According to the World Bank's Worldwide Governance Indicators, Mongolia has a relatively low quality of governance.Its government effectiveness score is 34.91%, regulatory quality is 42.45%, and rule of law is 45.75% [1].Additionally, during a speech at the intermediate evaluation discussion panel of the "New Development Mid-term Action Plan", the Chief Cabinet Secretary of the Mongolian government stated that "Over the past 20 years in Mongolia, the government, ministries, and their agencies have issued 517 short or long-term development plans and strategy papers."Currently, 203 of these documents are effective, though many of them overlap or contradict each other significantly.Only 132 of them are enforced, which has led to less than 26% efficiency [2]".Using artificial intelligence (AI) in the analysis of Mongolian government documents is critical, as the current situation produces significant contradictions and overlaps in Mongolian government documents.
On the other hand, at the organizational level, analyzing contracts or legal documents and making decisions at the management level is an important task.The demand is increasing as researchers, lawyers, executives, and managers examine more legal documents with faster and more accurate results.Furthermore, expert decision-making computer systems are devices in demand by professionals and managers to avoid problems and disputes.It is vital to make professional and accurate decisions backed by achievements in AI.Thus, our goal is to construct a decision support system (DSS) to help managers solve complex problems in Mongolia.Such a system is hardly implemented in Mongolia.As far as we know, a deep learning method has not yet been matured for Mongolian legal documents.
Moreover, a comprehensive analysis should be undertaken to deal with Mongolian legal documents instead of deploying a language model as it exists in English.In 1992, Mongolia adopted civil law, thus discarding the socialist legal system, which had been in place for the past sixty-eight years.However, English-speaking countries, including Canada and the United States of America, follow case law.Depending on the legal system, jurisdiction procedures vary.Therefore, language models developed for the case law cannot be immediately deployed to the Mongolian legal domain, which is based on civil law.

Motivations of This Research
The increasing demands in Mongolia and the current situation mentioned in Section 1 have prompted us to undertake extensive research to develop deep learning methods to analyze Mongolian legal documents.However, as described in Section 2, the existing research in Mongolian natural language processing (NLP) has left behind the cutting-edge trend and has notable gaps in existing knowledge.Although there is a lack of decent NLP tools for the Mongolian language, analyzing Mongolian legal documents using the rapidly emerging deep learning methods is a pioneering opportunity to become a starting point for the development of advanced systems in the Mongolian legal domain.Therefore, this research is vital to solving real-world problems in Mongolia, and it has a potential impact on the Mongolian legal domain.

Scope of the Present Paper
This paper focuses on modern Mongolian legal documents written in the Cyrillic script.It does not cover legal documents from the Inner Mongolia Autonomous Region written in traditional Mongolian script.
The Mongolian language is spoken by people in Mongolia, as well as by ethnic Mongols living in China and Russia.Throughout history, Mongols have created and used various writing scripts such as traditional Mongolian, Phags-pa, Horizontal square, Soyombo, Todo, Latin script, and even phonetic writing with Chinese characters [3].In 1946, a language reform took place, and the Cyrillic script was adopted as the official script for the Mongolian language.This adaptation included two additional characters.
The spelling of modern Mongolian in Cyrillic script was based on the pronunciation of the Khalkha dialect of the largest Mongolian ethnic group [4].This change was significant because the traditional script preserved the old Mongolian language, while the modern Mongolian in Cyrillic script reflected the pronunciations in modern dialects.Although the spoken language changed as the Mongolian language evolved, the spelling remained unchanged in the traditional Mongolian script.As a result, there are notable differences between the traditional script and the Cyrillic script documents.The Mongolian language is agglutinative, meaning that inflectional suffixes such as plural, case, reflexive, voice, tense, aspect, and mood suffixes are concatenated with the stem.Therefore, stemming is necessary for Cyrillic script.It is important to note that modern Mongolian in Cyrillic script is casesensitive, while the traditional script is not.Such distinctions affect the use of NLP tools and language resources, as they are not the same for modern and traditional Mongolian.

Contributions of the Present Paper
To achieve the research goal, this paper examines deep learning techniques for Mongolian legal documents.Particularly, this paper discusses the authors' achievements in recognizing textual inference in Mongolian bar exam questions.We believe that recognizing textual inference is one of the important tasks in our system.
The contributions of this paper can be summarized as follows: • A creation of a textual inference dataset from Mongolian bar exam questions; • A pioneering trail to demonstrate fine-tuned transformer models for recognizing textual inference in Mongolian bar exam questions;

•
The development of a demo system that can be used to recognize textual inference in Mongolian legal documents by utilizing the above contributions.
Section 2 introduces related work.Existing deep learning tasks and language models for the Mongolian language are also briefed in Section 2. However, research related to (1) the NLP of texts in the traditional Mongolian script and (2) the analysis of legal documents written in the traditional Mongolian script are not included there.Recognizing textual inference tasks for the Mongolian language are then explained in Section 3. The experimental results are explained in Section 4. Finally, concluding remarks are given in Section 5.

Related Work
Deep learning has proven to deliver high performance in a variety of fields [5] and has already achieved near-human performance in various NLP tasks in English [6,7].In recent years, pretrained language models such as BERT [8] and text-to-text transfer transformer (T5) [9] have provided notable achievements on large-scale general natural language inference (NLI) datasets such as the Stanford Natural Language Inference corpus [10], and the Multi-Genre Natural Language Inference corpus [11].However, existing Mongolian legal datasets are scarce, so the performance drops when directly adapting the existing pretrained models.OpenAI introduced ChatGPT models, which communicate conversationally to answer questions [12].However, as far as we know, no research has been conducted to evaluate the performance of ChatGPT in Mongolian language.In general, deep learning approaches to the Mongolian language lag behind due to a lack of research in Mongolian NLP.AI implementation in the Mongolian legal domain is very limited.The following sections summarize the relevant work in recognizing textual inference in the legal domain and Mongolian NLP.

Recognizing Textual Inference in the Legal Domain
Under the umbrella of the NLI field, various approaches in recognizing textual inference have been applied to the legal domain.Recognizing textual inference in the legal domain is a task of predicting entailment between a given premise, i.e., a law or an article from a law, and the given hypothesis, which is a statement of a legal question.
At the Legal Textual Entailment task of the Competition on Legal Information Extraction/Entailment (COLIEE) [13], various methods, including an ensemble of several BERT fine-tuning runs [14][15][16], ensemble with data augmentation [14][15][16], positive and negative sampling [14,15], an ensemble of predicate-argument structures, and a rule-based method [16] were demonstrated for recognizing inference between a legal question and Japanese civil law articles.Bui et al. [17] utilized Fine-tuned LAnguage Net (FLAN) large language models (LLMs) such as FLAN-T5 [18], FLAN-UL2 (unifying language learning) [19], and FLAN-Alpaca-XXL [20] along with data augmentation.They manually selected 56 prompts from the PromptSource library and used them in zero-shot prompting on the above LLMs for predicting the inference of given Japanese civil law problem-article pairs.Onaga et al. [21] extended their previous model [16] and integrated a RoBERTa-based model LUKE-Japanese [22] as a Japanese named entity recognition model (NER) for dealing with anonymized personal names.The COLIEE-2023 training data contain 996 pairs of a legal question and Japanese civil law articles.The performance of each submitted method was evaluated on the test data that contained 101 pairs of a legal question from the most recent Japanese bar exam.However, these methods using the COLIEE use Japanese language-specific approaches, which cannot be directly applied to recognizing textual inference in the Mongolian language.
Moreover, some methods have been proposed to identify a paragraph(s) from existing cases in Canadian case law, which entails a given new case.Nguyen et al. [23] utilized the sequence-to-sequence model MonoT5 [24] and fine-tuned it with hard negation mining and ensemble methods, which search hyperparameters to find the optimal weight for each checkpoint.However, such models developed for the case law system cannot be immediately applied to recognizing textual inference in Mongolian bar exam questions due to jurisdictional differences.

Mongolian NLP
This section summarizes research on NLP in the Mongolian language.Although deep learning has become attractive in recent years, little research in deep learning has been conducted on the Mongolian language.Ariunaa and Munkhjargal utilized a recurrent neural network for the sentiment analysis of Mongolian tweets [25].Battumur et al. trained BERT to correct Mongolian spelling errors [26].Dashdorj et al. utilized BiLSTM and a convolutional neural network to classify public complaints addressed to government agencies [27].
Following is a brief explanation of traditional NLP approaches for the Mongolian language, Choi and Tsend determined the appropriate size of the Mongolian general corpus using the Heaps' law and type token ratio.They concluded that an appropriate size for a Mongolian general corpus is 39-42 million tokens [28].To tag Mongolian parts of speech (POS), Lkhagvasuren et al. utilized a neural network model with a multilayer perceptron [29].Jaimai and Chimeddorj utilized a hidden Markov model with a bigram [30].Ivanov et al. utilized a minimum edit distance and a word n-gram with a back-off to identify Mongolian spelling errors and suggest alternative correct words [31].Batsuren et al. built a Mongolian WordNet [32].To determine Mongolian word sense disambiguation, Bataa and Altangerel utilized a "one sense per collocation" algorithm [33].Jaimai et al. built a Mongolian morphological analyzer [34] utilizing the program PC-KIMMO.Dulamragchaa et al. [35] built Mongolian phonological rules and lexicon for a PC-KIMMO program.Chagnaa and Adiyatseren improved the Mongolian two-level rules for PC-KIMMO [36].Munkhjargal et al. built a finite state Mongolian morphological transducer [37].Enkhbayar et al. developed a stemming method for Mongolian nouns and verbs [38].Khaltar and Fujii developed a Mongolian lemmatization method [39].Nyandag et al. tested a keyword extraction method utilizing cosine similarity and TF-IDF [40].Munkhjargal et al. created a Mongolian NER system [41].Khaltar et al. [42,43] created a method for extracting loan words from the Mongolian corpora.Lkhagvasuren and Rentsendorj built an open information extraction system that extracts relation tuples from Mongolian text [44].In their research, Damiran and Altangerel tested (1) a decision tree algorithm [45] to classify Mongolian novels according to their authors and (2) a naive Bayesian classifier to classify Mongolian news articles [46].Ehara et al. developed a transfer-based Mongolian-to-Japanese machine translation system [47] using the ChaSen, which is a Japanese morphological analyzer.Enkhbayar et al. examined the ambiguity degree in the Japanese-to-Mongolian translation of functional expressions [48].
Mongolian NLP research is limited to those mentioned above, and none of them has considered analyses of Mongolian legal documents.We believe that analyzing the Mongolian bar exam questions is a good opportunity to test deep learning methods and become a starting point for the development of advanced systems in the Mongolian legal domain.

Existing Deep Learning-Based Tasks and Language Models for the Mongolian Language
Since BERT has been shown to perform well, Erdene-Ochir revealed Mongolian BERT models [49] trained on approximately 500 million words extracted from Mongolian media and a Mongolian Wikipedia dump.Yadamsuren released Mongolian RoBERTa base [50], Mongolian RoBERTa large [51], Mongolian ALBERT [52], and Mongolian GPT2 [53], which were trained on the same dataset as the Mongolian BERT models.
Gunchinish [54] revealed text classification tasks that classify Mongolian news articles by fine-tuning bert-base-mongolian-cased [55] and bert-large-mongolian-uncased [56] models.Bataa [57] finetuned the bert-base-mongolian-cased [55] model for the Mongolian NER task.Conneau et al. trained a transformer-based masked language model on one hundred languages and included Mongolian CommonCrawl data [58].However, they did not disclose the results for the Mongolian language.Google publicized an un-normalized multilingual model named BERT-base-multilingual-cased, which does not perform any normalization on the input and additionally included the Mongolian language [59].
There are not many NLP tasks available in the Mongolian language that have utilized masked language models such as BERT.Thus, this paper aims to demonstrate an NLI task for recognizing textual inference in Mongolian bar exam questions.

Mongolian Legal Documents
Because our target is the Mongolian legal domain, this section summarizes the existing Mongolian legal documents.Among the Mongolian 75K news articles described in Section 2.4, 8285 news articles are in the category "legal", which are 10.9% of the news articles.In addition, many Mongolian legal documents have been made publicly available in human-readable digital formats such as HTML, PDF, or Microsoft Word documents.(13) decisions by the heads of the organizations appointed by the Parliament; (14) the mayor of Ulaanbaatar's and provinces governors' orders; and (15) the General Judge Council's resolutions, which can be found on the public domain website "Unified Legal Information System" [74] of the National Legal Institute of Mongolia.As of 12 January 2024, there were 12,571 Mongolian legal documents.Moreover, (1) 150,375 criminal court orders-including 121,624 orders of a first instance, 24,142 orders of an appellate stage, and 4632 orders of a control stage; (2) 289,162 civil court orders-including 242,461 orders of a first instance, 33,660 orders of an appellate stage, and 13,041 orders of a control stage; and (3) 31,345 administrative court orders-including 18,174 orders of a first instance, 9150 orders of an appellate stage, and 4025 orders of a control stage can be found at the database of Mongolian court orders [75].
However, these legal documents have not been analyzed, which is mainly due to the lack of NLP tools that can handle Mongolian legal documents.Further computational analysis is therefore needed.It is a labor-intensive task to convert the above data to a machine-readable format and build gold standard corpora.On the other hand, in each year the Mongolian Bar Association [76] publishes a book that contains Mongolian bar exam questions and corresponding correct answers.The number of questions in 2016 was 4406; in 2019 was 4174; in 2021 was 4500; and in 2022 was 4598.Due to the unavailability of the properly labeled NLI dataset in the Mongolian legal domain, the authors' first contribution in the paper is the creation of an NLI dataset from Mongolian bar exam questions, which will be explained in Section 3.1.

NLI of Mongolian Bar Exam Questions
This section discusses the author's approach to recognizing textual inference in Mongolian bar exam questions.The objective of recognizing textual inference in Mongolian bar exam questions is similar to NLI, and it predicts entailment between a given premise, i.e., a law or an article from a law, and the given hypothesis, which is a statement of a legal question.However, to tolerate the legal exactness, we do not use a "neutral" label.Thus, if the hypothesis entails the premise, the label is "True" (entailment), or if the hypothesis does not entail the premise, the label is "False" (contradiction).

An NLI Dataset of Mongolian Bar Exam Questions
In this research, an NLI dataset from Mongolian bar exam questions was prepared for recognizing textual inference in the Mongolian language.The books of the Mongolian bar exam questions have an average of 4500 questions with an enormous amount of content that requires expert knowledge.Despite the labor-intensive and time-consuming task, an NLI dataset was compiled manually by selecting 829 questions related to Mongolian civil law from the Mongolian bar exam questions.To reflect with the COLIEE, a legal entailment competition that was held over ten years; all 829 questions related to Mongolian civil law were chosen in this research among categories such as the constitution, human rights, criminal code, general administration, education, higher education, the central bank, health, health insurance, taxation, etc., of the Mongolian bar exams.The training data of the latest COLIEE-2023 contain 996 pairs of legal questions and Japanese civil law articles.A Mongolian bar exam question is a multiple choice test with four answers, including one correct answer and three incorrect answers.Each question was substituted with the corresponding law article by human experts in the Mongolian legal domain, and it was utilized as a "premise".As the "hypothesis", the 415 correct answers were utilized with the label "True", and the 414 incorrect answers were utilized with the label "False".After checking and understanding the contents of the four answers, human experts selected the corresponding articles in the Mongolian laws.The Civil Code articles are usually very detailed and require careful attention, which need to be understood one by one.Examples of an answer (hypothesis) to a Mongolian bar exam question and the corresponding article (premise) from Mongolian civil law are shown in Table 1.Please refer to Table A1 for more examples.This dataset was used in all experiments.(243.2.The seller shall be obligated to provide the buyer with accurate and complete information about the designation, usage characteristics, storing, using, and transporting conditions and procedures, warranty and guarantee period, and the manufacturer of the goods sold.*) False * Unofficial English translation by the authors.

Language Modeling for Predicting NLI Labels in Mongolian Bar Exam Questions
Transformer-based language models were utilized for recognizing textual inference in Mongolian bar exam questions and predicting NLI labels.In other words, pretrained transformer models were fine-tuned in the Mongolian legal domain.The NLI was treated as a classification problem, which aims to recognize textual inference of hypothesis-premise pairs and label them as "entailment" or "contradiction".We followed the standard practice for sentence pair tasks as in Devlin et al. [8].Thus, "premise" and "hypothesis" were conjugated with a separate token [SEP], prepended to the "classification" token [CLS], and the sequence was input into the transformer models.
First, the experiments to predict textual inference in Mongolian bar exam questions using the existing pretrained models were conducted, which are explained in Section 3.3.Then, the existing pretrained models were fine-tuned for recognizing textual inference in Mongolian bar exam questions, and the achievements are discussed in Section 3.4.Detailed fine-tuning settings and hyperparameters are also introduced in Section 3.4.1.

The Performances of the Existing Pretrained Models in Recognizing Textual Inference in Mongolian Bar Exam Questions
Experiments were conducted to predict textual inference in the NLI dataset of Mongolian bar exam questions (explained in Section 3.1) using the existing pretrained models, including the (1) mongolian-roberta-base [50]; (2) mongolian-roberta-large [51]; (3) albertmongolian [52]; (4) bert-base-mongolian-cased [55]; (5) bert-large-mongolian-uncased [56]; (6) bert-base-multilingual-cased [59]; (7) bert-large-mongolian-cased [77]; and (8) bert-basemongolian-uncased [78].Training was run only for the top layers to use the representations learned by existing pretrained models to extract features from new samples, i.e., the NLI dataset of Mongolian bar exam questions.A new classifier was added to label Mongolian bar exam questions' pairs as "entailment" or "contradiction".A total of 829 pairs of Mongolian bar exam questions were split through random shuffling with a trainingvalidation-test split ratio of 80:10:10, respectively.During these experiments, the above pretrained models were used with the default settings, a training batch size of 16, and five epochs.The performance outcomes of the textual inference tasks utilizing the existing pretrained models are shown in Table 2.The best results are shown in bold text.
The bert-large-mongolian-uncased [56] obtained the highest average accuracy of 0.7381, whereas the mongolian-roberta-large [51] had the lowest average accuracy of 0.5357 in the unseen test data.Although the bert-base-mongolian-uncased [78] obtained an F1 score of 0.7294 in recognizing "entailment", the bert-large-mongolian-uncased [56] obtained an F1 score of 0.7250 in recognizing "entailment" and an F1 score of 0.7500 in recognizing "contradiction".In general, the existing pretrained bert-large-mongolianuncased [56] model achieved the best performance in recognizing textual inference in the unseen test data.
As illustrated in Figure 1a-h, the training and validation accuracy improved incrementally after each epoch.In most cases, the validation accuracy was lower than the training accuracy.
The performance outcomes of the fine-tuned models for recognizing textual inference in Mongolian bar exam questions are described in the next section.Detailed fine-tuning settings and hyperparameters are also introduced there.

Fine-Tuning Pretrained Transformer Models in Recognizing Textual Inference in Mongolian Bar Exam Questions
The existing pretrained models introduced in Section 3.3 were fine-tuned for recognizing textual inference in Mongolian bar exam questions by unfreezing and retraining.The legal domain has specific vocabulary and characteristics in legal texts.Thus, as discussed in Section 3.3, the performance outcomes of the existing pretrained models in Mongolian legal texts were not decent.Fine-tuning allows us to adapt the feature representations in the existing pretrained models to the new samples, i.e., the NLI dataset of Mongolian bar exam questions for making the existing pretrained models more applicable to the Mongolian legal NLI task.The setup and dataset are described below.

Setup
The existing pretrained models were unfrozen and retrained with the following hyperparameters: a batch size of 16, a learning rate of 1 × 10 −5 , a dropout rate of 0.3, a "softmax" activation function, and an Adam optimizer.Other experimental settings were the same as the experiments in Section 3.3.The classifier was also the same as the experiments in Section 3.3, which labeled Mongolian bar exam questions' pairs as "entailment" or "contradiction".All training was run for five epochs.Please refer to Table 3 for more details about each model.

Datasets
The same training, validation, and test data of 829 pairs of Mongolian bar exam questions that were split with a training-validation-test split ratio of 80:10:10, respectively, were used in all experiments.The experimental data distribution of the Mongolian bar exam questions is shown in Table 4.The maximum tokens of the pairs of Mongolian bar exam questions, which were determined by the tokenizer of each pretrained model, are shown in Table 5.
Table 5.The length of the maximum token sequence.

Experimental Results of Recognizing Textual Inference in Mongolian Bar Exam Questions
The performance outcomes of the textual inference tasks utilizing the fine-tuned models are shown in Figures 2-9      Among the eight different models, the fine-tuned bert-base-multilingual-cased achieved the highest average accuracy of 0.7619, the best F1 score of 0.7561 in recognizing "entailment", and the best F1 score of 0.7674 in recognizing "contradiction" in the unseen test data.The highest recall of 0.8571 (See Figure 8a) was obtained in recognizing "entailment" using the fine-tuned bert-large-mongolian-cased.Moreover, the highest precision of 0.7857 (See Figure 8a) was obtained in recognizing "contradiction" using the fine-tuned bertlarge-mongolian-cased. On the contrary, as shown in Figure 6a, the fine-tuned bert-largemongolian-uncased demonstrated balanced performance in recognizing textual inference in Mongolian bar exam questions, thus achieving a precision of 0.7561, a recall of 0.7381, and an F1 score of 0.7470 for recognizing "contradiction".It also achieved a precision of 0.7442, a recall of 0.7619, and an F1 score of 0.7529 for recognizing "entailment."In contrast, the fine-tuned mongolian-roberta-large performed less successfully, thereby having the lowest average accuracy of 0.5833 in the unseen test data.As shown in Figure 3a, the fine-tuned mongolian-roberta-large lagged behind, thereby obtaining the lowest precision, recall, and F1 score in recognizing both the "contradiction" and "entailment" categories.It obtained the lowest precision of 0.5854, recall of 0.5714, and F1 score of 0.5783 for "contradiction," as well as a precision of 0.5814, recall of 0.5952, and an F1 score of 0.5882 for "entailment".
An overall comparison in recognizing textual inference in Mongolian bar exam questions using different models is shown in Table 6 with an accuracy, macro average F1 score, and weighted average F1 score.Table 6 also compares the performance of fine-tuned models against the existing pretrained models.The fine-tuned bert-base-multilingual-cased [59] model showed an average accuracy of 0.7619, a macro average F1 score of 0.7618, and a weighted average F1 score of 0.7618.In general, the fine-tuned bert-base-multilingual-cased model achieved the best performance in recognizing textual inference in Mongolian bar exam questions.As illustrated in Figures 2b, 3b, 4b, 5b, 6b, 7b, 8b, and 9b, the training and validation accuracy improved incrementally after each epoch.The training accuracy ranged from 0.7363 to 0.9832, while the validation accuracy ranged from 0.5000 to 0.8125.
The confusion matrices of the fine-tuned models on unseen test data are shown in Figure 10.The numbers inside a bracket represent the percentage within the total test data.As illustrated in Figure 10f, the fine-tuned bert-base-multilingual-cased incorrectly labeled 21.42% of the "contradiction" pairs in the test data as "entailment" in recognizing textual inference in Mongolian bar exam questions.Also, as shown in Figure 10g, in the finetuned bert-large-mongolian-cased, 14.28% of "entailment" pairs were incorrectly labeled as "contradiction".Overall, the fine-tuned bert-base-multilingual-cased showed a better performance in recognizing textual inference in Mongolian bar exam questions.

Conclusions
In this paper, the existing deep learning models were examined for recognizing textual inference in Mongolian bar exam questions.Several fine-tuned transformer-based models were investigated, which are important for the DSS that we aim to develop.The demonstrated fine-tuned models were evaluated in recognizing textual inference in Mongolian bar exam questions.Overall, as shown in Table 6, the fine-tuned bertbase-multilingual-cased [59] model showed the best results in recognizing textual inference in Mongolian bar exam questions.It was capable of recognizing "contradiction" with a precision of 0.7500, a recall of 0.7857, and an F1 score of 0.7674, as well as recognizing "entailment" with a precision of 0.7750, a recall of 0.7381, and an F1 score of 0.7561.The demo system has been developed, and it can be accessed online at https://www.dl.is.ritsumei.ac.jp/legal_analysis/NLI.html (accessed on 18 January 2024).
In future work, some distinct features need to be investigated to improve the accuracy of distinguishing "contradiction" and "entailment" more accurately.The Mongolian bar exam questions may contain many common or similar sentences.The positive and negative sampling methods or data augmentation need to be considered for further improvements.Our further research will apply LLMs to identify conflicting Mongolian legal texts.

Figure 1 .
Figure 1.Training and validation accuracy for each epoch. .

Figure 2 .
Figure 2. Performance outcomes of the fine-tuned mongolian-roberta-base.(a) Evaluation metrics: Precision, Recall, F1 score and Accuracy.(b) Training and validation accuracy for each epoch.

Figure 3 .
Figure 3. Performance outcomes of the fine-tuned mongolian-roberta-large.(a) Evaluation metrics: Precision, Recall, F1 score and Accuracy.(b) Training and validation accuracy for each epoch.

Figure 4 .Figure 5 .
Figure 4. Performance outcomes of the fine-tuned albert-mongolian.(a) Evaluation metrics: Precision, Recall, F1 score and Accuracy.(b) Training and validation accuracy for each epoch.

Table 1 .
Examples of NLI dataset of Mongolian bar exam questions.

Table 3 .
Settings of the existing pretrained models.

Table 6 .
Performance comparison in recognizing textual inference in Mongolian bar exam questions using different models.
* The performance of the existing pretrained models.** The performance of the fine-tuned models.