Prediction of Arabic Legal Rulings using Large Language Models

In the intricate field of legal studies, the analysis of court decisions is a cornerstone for the effective functioning of the judicial system. The ability to predict court outcomes helps judges during the decision-making process and equips lawyers with invaluable insights, enhancing their strategic approaches to cases. Despite its significance, the domain of Arabic court analysis remains under-explored. This paper pioneers a comprehensive predictive analysis of Arabic court decisions on a dataset of 10,813 commercial court real cases, leveraging the advanced capabilities of the current state-of-the-art large language models. Through a systematic exploration, we evaluate three prevalent foundational models (LLaMA-7b, JAIS-13b, and GPT3.5-turbo) and three training paradigms: zero-shot, one-shot, and tailored fine-tuning. Besides, we assess the benefit of summarizing and/or translating the original Arabic input texts. This leads to a spectrum of 14 model variants, for which we offer a granular performance assessment with a series of different metrics (human assessment, GPT evaluation, ROUGE, and BLEU scores). We show that all variants of LLaMA models yield limited performance, whereas GPT-3.5-based models outperform all other models by a wide margin, surpassing the average score of the dedicated Arabic-centric JAIS model by 50%. Furthermore, we show that all scores except human evaluation are inconsistent and unreliable for assessing the performance of large language models on court decision predictions. This study paves the way for future research, bridging the gap between computational linguistics and Arabic legal analytics.


Introduction
The fusion of law, artificial intelligence (AI) and natural language processing (NLP) stands as a groundbreaking frontier in contemporary research.The legal domain, with its intricate statutes, precedents, and interpretations, offers a unique challenge for computational models.Yet, the potential implications of successfully navigating this domain are profound.If legal decisions can be predicted with high precision using machine learning models, a set of invaluable insights are given to the judicial system.Such advancements could advance legal research, case preparation, and help judges and lawyers with deeper insights that they may not take in consideration during the cases' analysis.
Predicting court decisions is challenging, especially for under-represented languages in NLP studies, such as Arabic.The inherent complexity of case description texts, combined with the nuances of the Arabic language, compounds the difficulty.Arabic, with its rich morphological structure and myriad dialects, has been a challenging landscape for NLP tasks [1].Moreover, case description texts in Arabic are characterized by their detailed rhetoric, extensive use of precedents, and domain-specific terminologies [2].

Context
The effectiveness of language model pre-training has been demonstrated in enhancing various tasks within the realm of natural language processing.This approach has proven successful in elevating the performance of a wide range of tasks related to processing and understanding human language [3,4].
In the area of AI applied to the legal domain, there have been significant advancements.The rise of machine learning has pushed in a new wave of research, with scholars exploring the potential of statistical models for legal prediction [5].The recent advancements in large language models, especially transformers, have further expanded the horizons in this domain [6].These models have demonstrated exceptional capabilities in a range of NLP tasks, from machine translation [7,8] to sentiment analysis [9][10][11], making their application to legal area an exciting avenue of exploration.
This paper embarks on an exploration of predicting Arabic court decisions using large language models (LLMs).By leveraging the latest in NLP and deep learning, we aim to test different approaches to using LLMs to maximize the predictive capability.

Related Works
Language models (LMs) serve as the basis for various language technologies, but the understanding of their capabilities, limitations, and risks is still lacking.Several benchmarks were built to bridge this gap.The objective of a benchmark is to set a standard by which the performance of systems can be evaluated across a variety of tasks.Kumar et al. [12] introduced the Holistic Evaluation of Language Models (HELM) to enhance the transparency of language models.Initially, they created a taxonomy to categorize the wide range of possible scenarios and metrics relevant to language models.Subsequently, a comprehensive subset of scenarios and metrics was selected based on coverage and feasibility, while also identifying any gaps or underrepresentation.Finally, a multi-metric approach was adopted to evaluate language models.The Beyond the Imitation Game benchmark (BIG-bench) was introduced by Srivastava et al. [13], featuring 204 tasks contributed by 444 authors from 132 institutions.These tasks covered diverse topics and aimed to test the limits of current language models.The performance of various model architectures, including OpenAI's GPT models and Google's dense and sparse transformers, was evaluated on BIG-bench across a wide range of model sizes.Human expert raters also participated to establish a strong baseline.The findings revealed that model performance and calibration improved with larger model sizes, but they still fell short compared to human performance.Interestingly, performance was similar across different model classes, with some advantages observed for sparse transformers.Tasks that showed gradual improvement often required extensive knowledge or memorization, while tasks with breakthrough behavior involved multiple steps or components.In settings with ambiguous context, social bias tended to increase with scale, but it could be mitigated through prompting techniques.
Elmadany et al. [14] presented ORCA, which is an openly accessible benchmark aimed at evaluating Arabic language comprehension.ORCA was meticulously developed to encompass various Arabic dialects and a wide range of complex comprehension tasks.It leveraged 60 distinct datasets across seven clusters of Natural Language Understanding (NLU) tasks.To assess the current advancements in Arabic NLU, ORCA was employed to conduct a thorough comparison of 18 multilingual and Arabic language models.Furthermore, a public leaderboard was provided, featuring a unified evaluation metric (ORCA score).This score represents the macro-average of the individual scores across all tasks and task clusters.
Abdelali et al. [15] evaluated the performance of Foundation Models (FMs) in various text and speech tasks related to Modern Standard Arabic (MSA) and Dialectal Arabic (DA), including sequence tagging and content classification, across different domains.ChatGPT (OpenAI's GPT-3.5-turbo),Whisper (OpenAI) [16], and USM (Google) [17] were used to conduct zero-shot learning and address 33 distinct tasks using 59 publicly available datasets, resulting in 96 test setups.They found out that LLMs performed lower compared to state-of-the-art (SOTA) models, across most tasks, dialects and domains, although they achieved comparable or superior performance in a few specific tasks.The study emphasized the importance of prompt strategies and post-processing for enhancing the performance of FMs and provided in-depth insights and findings.
On another hand, the field of prompt engineering [18] has gained prominence in developing and refining inputs for language models.It provides a user-friendly and intuitive interface for human interaction with LLMs.Given the sensitivity of models to even minor changes in input, prompt engineering focuses on creating tools and techniques to identify robust prompts that yield high-performance outcomes.Various automatic optimization approaches [19,20] have been suggested to determine the optimal prompt for a particular task or a range of tasks.These methods aim to find the most suitable prompt that yields the best performance outcome.
More specifically, numerous studies have ventured into predicting court decisions across different jurisdictions.In the US, machine learning has been used to anticipate the outcomes of Supreme Court decisions [21].In Europe, deep learning models have been employed to predict decisions of the European Court of Human Rights [22].Concerning the Arabic legal domain, A pioneering model named AraLegal-BERT [23] was proposed.It is a bidirectional encoder Transformer-based model (BERT [24]) finetuned for the Arabic legal domain.The model was evaluated against three BERT variations for Arabic across three Natural Language Understanding (NLU) tasks, showcasing superior accuracy over the general and original BERT models on legal text.This work exemplifies how domain-specific customization can significantly improve language model performance in narrow domains, advancing the field's understanding of model adaptation for specialized use-cases.However, the tasks targeted in the study were specifically: legal text classification task, Named Entity Recognition Task, and Keywords Extraction Task.These tasks are different from the scope of our paper targeting prediction of Arabic Legal Rulings, which is more challenging and complex.Moreover, AraLegal-BERT is trained from scratch on specific Arabic datasets.This approach is different from the current study, where we tried first to profit from the LLMs advanced linguistic capabilities.Then, we tried to enhance the eliciting performance of these LLMs on Arabic legal ruling prediction using Zero-Shot and Few-Shot Learning.Up to our knowledge, the application of the aforementioned approach to the Arabic legal system remains a new field, with an attractive potential.This paper aims to bridge this gap by presenting a systematic investigation into predictive analysis of Arabic court decisions via an array of cutting-edge large language models tested on a dataset of real commercial cases.

Contribution
Given the aforementioned context and the gap identified in Arabic legal system analysis, our research offers the following novel contributions: • Comprehensive Model Evaluation: The research conducted a predictive analysis of Arabic court decisions by leveraging three prominent large language models: LLaMA-7b, JAIS-13b, and GPT-3.5-turbo,applied to a dataset comprising 10,813 real commercial court cases.

• Significance of Text Preprocessing:
The study thoroughly investigated the potential benefits derived from summarizing and translating the original Arabic input texts, culminating in the creation of 14 distinct model variations.• Highlighting LLaMA's Limitations: LLaMA models have been touted as almost equivalent to GPT models [25].Nevertheless, the findings of this paper reveal the intrinsic reduced performance of all LLaMA model variants compared to JAIS and GPT-3.5 on our dataset of Arabic court decisions.• Insights into Evaluation Metrics: The paper offers a detailed evaluation of model performance using diverse metrics, namely human assessment, GPT evaluation, Rouge (1, 2, and L), and Bleu scores.Importantly, the research underscored the unreliability of all metrics, barring human assessment.• Bridging Research Domains: This pivotal study bridges the gap between computational linguistics and Arabic legal analytics, laying a foundation for future scholarly endeavors in this interdisciplinary realm.
2 Materials and Methods

Base Large Language Models
LLaMA-7b [25] (designed by Meta AI), JAIS-13b-chat [26] (MBZUAI University), and GPT-3.5-turbo[27][28][29][30] (OpenAI) are three recent representatives of a frontier of advancements in large language model (LLM) technology, each hailing from different origins with distinct architectural innovations.LLaMA-7b, an open-source LLM emanating from Meta AI, showcases a unique architectural approach with a range of models tailored for various applications.On the other hand, JAIS-13b-chat, with its focus on bilingual (Arabic and English) capabilities, offers a novel solution to Arabiccentric language processing tasks.GPT-3.5-turbo, a product of OpenAI, stands out for its optimization for chat-based applications, demonstrating a balance between performance and cost-effectiveness.Table 1 summarizes the main characteristics of these three models, providing a comparative glimpse into their architectural underpinnings, language and domain proficiency, training data, and use cases.Only JAIS was trained on a sizeable proportion (29%) of Arabic texts.By contrast, Arabic language represented 0.03% of GPT3's training dataset by word and character count, and 0.01% by document count [31].Similar figures are assumed for GPT-3.5-turbo.Meta AI did not disclose the proportion of tokens per language in LLaMA models' training datasets, but the description of the sources of their pre-training datasets reveals that it is overwhelmingly in English [25].
Another important element in a large language model is the tokenizer.Tokenization consists in subdiving words into sub-word tokens in order to learn vocabulary that encompasses sub-word units such as prefixes, suffixes, and root components, enabling effective handling of diverse word morphologies.Each of the three base models considered use tailored pre-trained tokenizers that are based on Byte-Pair Encoding (BPE).BPE is a data compression algorithm initially designed to reduce the size of files by replacing frequent sequences of bytes with shorter representations [32].In recent years, it has been adopted in NLP to tokenize text into subwords or characters in a way that strikes a balance between the flexibility of character-level representations and the efficiency of word-level representations [33].Nevertheless, we noticed that most common tokenizers used in LLMs are not adapted to Arabic language, as can be seen in an example in Figure 1.In this example, the LLaMA tokenizer segments a word into individual characters which do not have any independent meaning.The same occurs with GPT's Tiktoken tokenizer.By contrast, JAIS tokenizes the same word in this example into a single token, which conserves the meaning.

Language and Domain Proficiency
Outperforms on many benchmarks including reasoning, coding, proficiency, and knowledge tests.
Optimized for chat, Capable of understanding and generating natural language or code.

Training Data and Open-source Availability
Trained on 1.4 trillion tokens from publicly available datasets, overwhemingly in English.

Use Cases and Performance
Fine-tuned for dialogue, Optimized versions for chat (Llama-2-Chat).
Optimized for chatbased applications, Human-like responses in conversations.In contrast to LLaMA and GPT-3.5, the JAIS model initially refused to generate predictions concerning court decisions.This refusal reveals the type of precautionary measures incorporated into the model during its Reinforcement Learning from Human Feedback (RHLF) phase.Nevertheless, we successfully elicited predictions from the model by including an explicit instruction, stating that these are experiments that are intended solely for educational and research purposes.
We employed multiple configurations of the three aforementioned foundational models.These configurations encompass zero-shot, single-shot, and fine-tuning training paradigms.Furthermore, they are implemented on either the original Arabic dataset or on pre-processed texts that have undergone summarization and/or translation.Cumulatively, these diverse configurations result in 14 distinctive model variants.A comprehensive description of these variants is provided in Section 2.4.

Fine-tuning using LLM-Adapters
Engaging in complete finetuning has the potential to result in catastrophic forgetting, given that it involves altering all parameters within the model.In contrast, Parameter Efficient Fine-Tuning (PEFT), by exclusively modifying a limited subset of parameters, as opposed to full-parameter fine tuning, demonstrates greater resilience against the detrimental impacts of catastrophic forgetting [35].In this context, LLM adapters offer a simple and efficient approach to PEFT in large language models [36].LoRA (Low-Rank Adaptation) is a method that can significantly reduce the number of trainable parameters required for fine-tuning large language models.It is a type of LLM adapters that is integrated into the LLM-Adapters framework and supports finetuning of LLaMA models among others [36].As LoRA has a significant motivations of successfully lowering the amount of trainable factors without sacrificing performance, applying it to the LLaMA model aims to achieve high performance while minimizing computational costs.
With this approach, LoRA follows a strategy that reduces the number of parameters to be trained during fine-tuning by freezing all of the original model parameters and then inserting a pair of rank decomposition matrices alongside the original weights.Additionally, LoRA utilizes the adapter method in such a way of adding a subset of parameters, enabling a few low-intrinsic adapters in parallel with the attention module without increasing inference latency.In this work, we carry out the fine-tuning of the LLaMA-7b base model using LoRA approach on Arabic texts, following the implementation of [37].In fact, LoRA's design allows for more flexibility in adding adapters [38], making it efficient for scaling up to large language models for improved performance on custom datasets and tasks.Nevertheless, we did not manage to fine-tune the larger JAIS-13b and GPT-3.5-turbobase models due to resource constraints.
Figure 2 illustrates the integrated mechanism of the LoRa adapter within the LLM module of the transformer, highlighting the modified forward pass in the network, and the weight adjustment mechanism.The LoRa method enhances the fine-tuning of Large Language Models (LLMs) by decomposing the weight update matrix into a lower-rank representation instead of updating the original weight matrix directly, leading to fewer parameters during adaptation.This results in faster training and potentially reduced computational needs without losing vital information.In conventional fine-tuning, weight changes are computed via backpropagation based on the loss gradient.LoRa, instead, decomposes these changes into two smaller, lower-dimensional matrices.Then, it trains these smaller matrices, enabling effective representation in a lower-dimensional space and reducing the parameter space.
In the LoRA method, the decomposition of the weight update matrix ∆W into two matrices W A and W B is given by: (1) Assuming W A and W B are of dimensions m × r and r × n respectively, where r is the rank, and m and n are the dimensions of the original matrix ∆W , the total number of parameters to be learned reduces from m × n to m × r + r × n.
Further, if X is the input to a layer and Y is the output, the modified forward pass in LoRA can be represented as: where b is the bias vector.The error E in approximation can also be analyzed.It's given by: where ||.|| F denotes the Frobenius norm.This mathematical formulation elucidates the reduction in computational complexity and the preservation of essential information for task adaptation achieved by LoRA.This method preserves the essential information required for task adaptation while reducing the computational burden, showcasing a trade-off between model complexity and adaptation capacity.
The implementation of LoRa is relatively straightforward, as seen in Figure 2. A modified forward pass in the network is applied, adjusting the magnitude of weight updates to balance pre-trained knowledge with new task-specific adaptation.

Dataset
We retrieved the Saudi Ministry of Justice dataset (SMOJ) through a web scraping from the Saudi Justice Portal (SJP) website [39], focusing on the category of commercial courts, which contains a series of court decisions about financial and commercial disputes, all in Arabic language.To facilitate the data retrieval, we used Selenium Python library [40], which enables programmatic interactions with web pages, essentially simulating user actions to access and gather data.
The data collection process for SMOJ was structured and systematic, starting with the iteration through a range of page numbers.This range spans from page 1 to 60,000, a scope determined based on the expected volume of data available on the website.Before any data extraction occurs, each page's availability is verified by checking for the presence of the text 'Page not found.'This precautionary measure ensures that only existing pages are processed, minimizing potential errors and preventing unnecessary resource consumption.
Once page availability is confirmed, the data extraction process is initiated using Beautiful Soup Python package [41], which is tailored for HTML and XML parsing, and is employed to dissect the HTML structure of the web pages.This allows for the extraction of specific elements, focusing on critical legal information contained within the SJP website.The data extraction process focuses on three primary categories: case description, justification, and court decision.We used the case description as input (prompt) to the LLM models, and the court decision as output (completion).There is no strictly pre-defined format or ordering for the case description and court decision.We decided to ignore the justification field and not include it in the input, seeing that it often unveils the inclination of the court decision.
After removing duplicates and excessively long cases (more than 4096 tokens), we randomly subdivided the SMOJ dataset into a training dataset containing 10,713 cases and a testing dataset containing 100 cases.We opted for a reduced testing dataset to be able to manually evaluate the outputs of each LLM model.In fact, we will shown in section 3 that all other automatic evaluations were unreliable and inconsistent.
Figure 3 depicts the histogram of the number of words in the prompts (case descriptions) and completions (court decisions) in the SMOJ training dataset.The total number of words in the training set is 5M, and the average number of words in the prompts and completions is 422 and 52, respectively.The large size of the prompts is a real challenge, which motivated us to test LLM models on summarized prompts, as will be detailed in section 2.4.

LLM model variants
For each of the three base pre-trained models described in section 2.1 (LLaMA-7b, GPT-3.5-turbo, and JAIS-13b-chat), we implemented different variants.Figure 4 illustrates the main steps for evaluating the 8 LLaMA variant models on the SMOJ dataset.The base pre-trained LLaMA-7b model is the core of all these models.They differ by the inclusion or not of single-shot or fine-tuning learning, and the addition or not of summarizing and/or translation steps: enhance the prediction for LLaMA and GPT-3.5 models, since they are overwhelmingly pre-trained on English texts, as explained in section 2.1.The assessment of the usefulness of this pre-processing step will be discussed in section 3. Table 2 showcases the instructions fed to GPT-3.5-turboAPI to summarize and/or translate the original SMOJ dataset, for fine-tuning the LFS and LFST models.
Table 2 Instructions used to summarize and translate the SMOJ dataset through GPT-3.5-turboAPI.
Due to limited resources, we could not fine-tune the GPT and JAIS models in the same way that we did for the LLaMA models.Besides, it is pointless to apply JAIS on translated text, since it was pre-trained with special focus on Arabic language.Furthermore, multi-shot variants were not examined in our research.This is due to the extensive input size present in our dataset and the inherent limited context length associated with the models (4096 tokens).
For each model, we employ a suite of metrics, as detailed in section 2.5, to evaluate their performance by comparing the predicted rulings to the suitable version of the ground-truth (GT) rulings from the test dataset.Specifically: • Models L0, L1, LF, J0, J1, G0, and G1 are evaluated against the original Arabic version of the GT rulings.• Model LFS is gauged against the summarized Arabic version.
• Models LT1, LT0, LFT, GT0, and GT1 are assessed based on the translated form of the GT rulings.• Model LFST is measured against the GT rulings that have been both summarized and translated.

Metrics
The following metrics were applied to evaluate each of the LLM models described in section 2.4: • Human score: A human evaluator was tasked with assessing the accuracy of the predicted rulings generated by each model in relation to the ground-truth rulings of the test dataset.This evaluation was conducted on a scale ranging from 0 to 5. A score of 0 indicated that the predicted ruling was either nonsensical or wholly incorrect, while a score of 5 signified a flawless prediction, mirroring the decisions encapsulated in the ground-truth ruling, regardless of the actual wording.To ensure a uniform evaluation standard and minimize variability in scoring, all model outputs were reviewed by the same evaluator.• GPT score: We used GPT-3.5-turboAPI to automatically and systematically compare all the predicted rulings generated by each model to the ground-truth rulings of the test dataset.To guide this assessment, we provided the GPT model with the following instruction: "Compare the following two court decisions (predicted: 'Decision (predicted)' and ground-truth: 'Decision (GT)') and assign a score from 0 to 5 to the predicted decision.0 means: Non sense.5: means perfect answer.Format the response as: Score; Justification.For example: 0; Non sense."• BLEU score: BLEU [43,44], an acronym for Bilingual Evaluation Understudy, was designed as a metric for assessing the quality of machine-translated text between two natural languages.The BLEU score is computed using a weighted geometric mean of modified n-gram precisions.This is further adjusted by the brevity penalty, which diminishes the score if the machine translation is notably shorter than the reference translation.The utilization of the weighted geometric mean ensures a preference for translations that consistently perform well across different n-gram precision levels.More specifically, the BLEU score is given by: w n log p n Where: -p n is the n-gram precision.
-w n are the weights for each precision (typically w 1 = w 2 = w 3 = w 4 = 0.25 for BLEU-4).-BP is the brevity penalty: with c as the predicted output length and r as the ground-truth length.
• ROUGE score: ROUGE [46] is an acronym for Recall-Oriented Understudy for Gisting Evaluation.Its primary purpose is to evaluate the performance of automatic summarization tools and machine translation systems within the realm of natural language processing (NLP).The fundamental idea behind ROUGE is to juxtapose an algorithmically generated summary or translation with one or multiple humancrafted reference summaries or translations.This comparison helps to determine how well the machine-generated output aligns with the human standard.We apply it here to the comparison between predicted and GT rulings in the SMOJ testing dataset.We specifically used three variants of the ROUGE score: -ROUGE-1: This metric gauges the overlap of unigrams (individual words) between the predicted output and the GT ruling.By examining the matching single words between both texts, ROUGE-1 provides insights into the basic lexical similarity.-ROUGE-2: Stepping beyond individual words, ROUGE-2 considers bigrams (pairs of adjacent words).By comparing the overlap of these word pairs between the predicted and GT ouputs, ROUGE-2 offers a deeper understanding of the phrasal and structural alignment.The general formula for the ROUGE-N score is: Where: * Count match (N-gram) is the maximum number of times an N-gram is found in both the predicted and GT ouputs.* Count(N-gram) is the count of the N-gram in the GT ouput.
-ROUGE-L: This metric employs the concept of the Longest Common Subsequence (LCS).LCS is the maximum sequence of tokens that appear in both the machine-produced and reference texts.This metric offers a more holistic perspective on similarity as it naturally considers sentence-level structures and automatically identifies the co-occurring n-gram sequences.
For each of these three metrics, we compute the precision (P), recall (R), and F1score (F).We calculated the ROUGE metrics using the rouge 1.0.1 Python library [47].
While the BLEU and ROUGE metrics were primarily conceived for tasks related to translation and summarization, they can potentially serve as indicative tools for evaluating the alignment between predicted and GT rulings in the SMOJ dataset.We will assess the correctness of this hypothesis in section 3.

Results
Within this section, we undertake a comprehensive evaluation of the implemented models, encompassing both qualitative and quantitative assessments.In subsection 3.1, we present a qualitative comparison between human and GPT scores on a small sample, exemplifying scenarios where predictions align with or deviate from expectations.We also discuss the challenges and nuances of employing GPT-3.5 as an evaluation metric.In subsection 3.2, we delve into the performance evaluation of the 14 models using diverse metrics, shedding light on the impact of zero-shot, single-shot, and fine-tuning approaches, as well as the prompt summarization and/or translation pre-processing steps.We further discuss the reliability of GPT, BLEU, and ROUGE scores.This holistic evaluation provides insights into the strengths and limitations of LLMs for the prediction of court decisions.

Qualitative evaluation
Table 3 provides a qualitative comparison between human and GPT scores on a small sample of predicted and GT rulings from the testing dataset.This sample is representative of most of the encountered cases.The first row shows an example in Arabic.
The predicted output contains a correct decision briefly expressed with implicit reference to the amount mentioned in the input (case description), while the GT ruling explicitly mentions the names of the plaintiff and defendant and the amount of money that the latter should pay to the former.Because of the difference in formulation, the GPT API gave the prediction a score of only 2/5.Whereas, the human evaluator took into account the semantic matching and assigned a higher score of 4/5.
In the second example, the LLM model issues a perfect ruling matching the same amount to be paid by the plaintiff as in the GT decision.Even though the identities of the plaintiff and defendant are not explicitly mentioned, this is not important, since they are already mentioned in the case description.In this case, both the human evaluator and GPT-3.5 assign a perfect score of 5/5.
In the third example, the predicted output is a series of nonsensical words and symbols.This happens often with LLaMA models, especially when the input size is large.As expected, the human score in this case is 0. However, GPT-3.5 oddly assigns a score of 2/5 to this prediction.This example also reveals the poor Google translation in the GT output, especially for the last sentence where the Arabic word Al-hādī ('guide') was mistaken for its homonym: 'pacific'.Such translation shortcomings can affect the quality of LLM training.
The fourth prediction example in Table 3 is similar, in terms of meaningless predicted output and poor GT translation, but in this case, both the human and GPT scores are rightly equal to 0.
In the fifth and last example, the LLM model just rehashed the instruction and part of the input that was fed to it, without adding any prediction.This also often happens with LLaMA models.As expected, the human score in this case is 0. However, GPT-3.5 surprisingly assigns a score of 4/5 to this prediction, which showcases its unreliability as en evaluation metric.This will be discussed more precisely using quantitative analysis in the next section.
Table 3 Human and GPT assigned scores on sample predicted and GT rulings from the testing dataset.

Quantitative evaluation
Table 4 presents the performance evaluation of the eight LLaMA-7b variant models on the test datasets, measured using different metrics.The LFST model, which underwent both summarization and translation processes using GPT-3.5-turbo,consistently delivered the top performance across nearly all evaluation metrics.In stark contrast, the other LLaMA variants displayed significantly subpar results.Their human evaluation scores were under 0.5 out of 5, and their GPT scores did not surpass 2 out of 5. Furthermore, both the Rouge and Bleu scores for all models were notably low.This underperformance was particularly pronounced for Arabic language models.L1 receives a score of 0 for all metrics except GPT score.This performance degradation compared to L0 can be explained by the increase of the input size due to adding an example of prompt/completion in the instruction, which often makes the total input size exceed the model's maximum context length.The primary reason for summarizing prompts in the LFS and LFST models stems from our observation that the LLaMA models frequently produce low-quality responses to longer prompts.This observation finds some validation in Figure 5, which showcases scatter plots correlating input size (measured by word count) with human evaluation scores for the LT1 and LFST models.Notably, for the LT1 model, prompts exceeding 1000 words invariably receive a score of zero.However, the overall correlation remains relatively weak, at -0.3.In contrast, upon summarizing the prompts for the LFST model, the correlation between input size and evaluation score vanishes.This suggests that the modified LLaMA model can handle moderately sized inputs in an equal manner.
Table 5 presents the outcomes of applying the same metrics to the four GPT-3.5turbovariant models.Both G0 and GT1 demonstrate closely aligned performance when evaluated using human scores.This suggests that the integration of translation and single-shot training did not significantly enhance performance for GPT-based models.However, when we consider the Bleu and Rouge scores, the translated models, GT0 and GT1, consistently outperform their counterparts.Interestingly, there is a noticeable discrepancy between the GPT score and the human judgment.A more detailed examination of specific prediction instances confirms that the GPT score can be unreliable in several scenarios.
Table 6 shows the performance of the two JAIS-13b-chat models.Only Arabicbased models were tested in this scenario since the JAIS base model is specifically tailored for Arabic language.We observe a slight improvement when moving form zeroshot (J0) to single-shot (J1) according to all metrics, except for the GPT score.This further highlights the unreliability of the GPT score for this task.Even though JAIS was pre-trained with special focus on Arabic language, it falls short in comparison with all GPT-based models (Table 5).This confirms the superiority of GPT-based Fig. 5 Scatter-plot between the input size (in terms of number of words) and the human score obtained, for LT1 (left) and LFST (right) models.
Table 5 Results of the evaluation of the four GPT3.5-turbovariant models on the testing datasets, using various metrics.models for a wide range of tasks even for under-represented languages in its learning dataset, such as Arabic.
Figure 6 maps out the 14 implemented models in the (Human score, GPT score) space.This visualization underscores the dominance of the GPT-based models and the underperformance of the LLaMA-based counterparts.Among the LLaMA models, only the LFST variant comes close to the performance of JAIS and GPT models in terms of human evaluation.Notably, LFST is not a pure LLaMA model as it leverages the summarizing and translation capabilities of GPT-3.5.On another hand, while JAIS models outpace LLaMA models, they lag behind the GPT models.A striking feature of Figure 6 is the evident discrepancy between GPT and human scores.For instance, despite LFST achieving the highest GPT score across all models, it secures a merely moderate human score.In a similar vein, LFT showcases a higher GPT score than both LT0 and LT1, even though the latter pair surpass it in human evaluations.This incongruence is especially pronounced in English-based models.This observation is further confirmed in Figure 7 where we notice that the correlation between the human score and GPT score is much higher for Arabic-based models (0.92) than for English-based models (0.60).A plausible explanation for this divergence is that the process of translating from Arabic to English may introduce errors, omitting or misrepresenting key details, which makes the score evaluation by GPT-3.5 more challenging.This observation extends to the Bleu and Rouge scores, which consistently display a lower alignment with human scores for English models.Most notably, Rouge-1 precision and Rouge-L precision exhibit negative correlations with human scores, standing at -0.79 and -0.77 respectively.All these results suggest that the GPT, Bleu, and Rouge scores are unreliable for performance evaluation in the considered task.

Discussion
The results presented in section 3 provide several important insights in the context of legal ruling prediction using large language models: • Performance of GPT-3.5-basedModels: The GPT-3.5-based models outperform all other models by a wide margin, surpassing even the dedicated Arabic-centric JAIS model's average score by 50%.This is surprising since the proportion of Arabic language in JAIS's pre-training dataset is around 1000 times larger than in GPT3's pre-training dataset, and JAIS's tokenizer is a priori more adapted to Arabic than GPT's (see section 2.1).• Reliability of the Human Score: The human score serves as the gold standard, highlighting the nuanced comprehension humans have over automated metrics in assessing the quality of legal ruling prediction.• GPT Score Limitations: The GPT score, though indicative, showcases its limitations in several instances, rendering it potentially misleading.Moreover, the significant divergence between GPT scores and human evaluations, especially on translated datasets, underscores potential translation errors or inherent metric limitations.• Inefficiency of ROUGE and BLEU: The ROUGE and BLEU scores, originally designed for translation and summarization tasks, exhibit their unsuitability for the task at hand.

Conclusions
This study represents a pioneering effort in the realm of Arabic court decision analysis, shedding light on the efficacy of advanced language models in predicting legal outcomes.The findings underscore the remarkable out-performance of GPT-3.5-basedmodels, surpassing even domain-specific models tailored for Arabic language.This unexpected outcome challenges conventional assumptions about the importance of domain specificity and dataset size in model performance.Nevertheless, in spite of the relative superiority of GPT-3.5-basedmodels, their absolute performance on predicting Arabic legal rulings is still unsatisfactory, with an average human score of 2.4 out of 5. Better models fine-tuned on larger Arabic legal datasets need to be developed before LLMs can act as useful legal assistants.Furthermore, the study emphasizes the indispensable role of human evaluation as the gold standard for assessing the quality of legal ruling predictions.While automated metrics like GPT scores, ROUGE, and BLEU can provide valuable indications in some cases, they exhibit limitations in capturing the nuanced and context-dependent nature of legal language.The inefficacy of ROUGE and BLEU scores in this context underscores the need for tailored evaluation metrics when applying advanced language models to legal analysis tasks.Future research in this domain should focus on developing more contextually relevant evaluation measures to better reflect the accuracy and relevance of predictions in the legal context.
Overall, this study serves as a foundation for future research at the intersection of computational linguistics and Arabic legal analytics.It encourages further exploration into the potential of large language models in assisting legal professionals and policymakers, ultimately contributing to the effective functioning of the judicial system and the enhancement of legal decision-making processes.

Fig. 2
Fig. 2 Operational schematic of LoRa adapters within the transformer.

Fig. 3
Fig. 3 Histogram of the number of words in the prompts (case descriptions) and completions (court decisions) in the SMOJ training dataset.

Fig. 4
Fig. 4 Diagram of the main steps for evaluating the 8 LLaMA variant models on the SMOJ dataset.
2. • Model LT1 is a single-shot model applied to the translated English testing dataset.It includes in the instruction a single translated prompt/completion pair from the training dataset.• Model LF is obtained by fine-tuning the base model on the original Arabic training dataset for 200 epochs.• Model LFT is obtained by fine-tuning the base model on the translated English training dataset for 200 epochs.• Model LFS is obtained by fine-tuning the base model on a subset of the Arabic training dataset after summarizing the prompts through GPT-3.5-turboAPI.We selected only a subset of 1000 prompt and completion pairs due to budget limitations, since requests to the GPT API are costly.• Model LFST is obtained by fine-tuning the base model on a subset of 1000 prompt and completion pairs from the Arabic training dataset after summarizing and translating them through GPT-3.5-turboAPI.

Fig. 6
Fig. 6 Human score versus GPT score for each tested Large Language Model.Arabic-based models are represented as circles, while English-based models are represented as squares, with a different color code for each base model (LLaMA, JAIS, GPT-3.5).

Fig. 7
Fig. 7 Heatmap of the correlation coefficient between the values of the metrics used for evaluating Arabic (top) and translated (bottom) language models.

Table 4
Results of the evaluation of the 8 LLaMA-7b variant models on the testing datasets, using various metrics.

Table 6
Results of the evaluation of the two JAIS variant models on the testing datasets, using various metrics.