Research on a Decision Prediction Method Based on Causal Inference and a Multi-Expert FTOPJUDGE Mechanism

: Legal judgement prediction (LJP) is a crucial part of legal AI, and its goal is to predict the outcome of a case based on the information in the description of criminal facts. This paper proposes a decision prediction method based on causal inference and a multi-expert FTOPJUDGE mechanism. First, a causal inference algorithm was adopted to process unstructured text. This process did not require very much manual intervention to better mine the information in the text. Then, a neural network dedicated to each task was set up, and a neural network that simultaneously served multiple tasks was also set up. Finally, the pre-trained language model Lawformer was used to provide knowledge for downstream tasks. By using the public data set CAIL2018 and comparing it with current mainstream decision prediction models, it was shown that the model signiﬁcantly improved the performance of downstream tasks and achieved great improvements in multiple indicators. Through ablation experiments, the effectiveness and rationality of each module of the proposed model were veriﬁed. The method proposed in this study achieved reasonably good performance in legal judgment prediction, which provides a promising solution for legal judgment prediction.


Introduction
Legal judgement prediction (LJP) is a crucial part of legal artificial intelligence (AI), and its goal is to predict the outcome of a case based on the information included in the description of criminal facts.Legal judgment prediction can not only provide judicial personnel with accurate judgment results to better assist them in making judgments and improve their work efficiency, but also help people who are unfamiliar with legal knowledge and require legal advice.It can also provide a general understanding of a crime that is committed by yourself or a loved one.
In the past, legal decision prediction was often regarded as a text classification problem [1].For example, Liu et al., refined cases by automatically generating and refining the description of the crime facts of real criminal cases, and then merging similar cases and removing relatively irrelevant information, which actually involved manipulating textual features to a lesser extent [2].Although great achievements have been made, they still rely on intuitively processing data while ignoring the judgment process of judges in reality, deviating from the actual situation, and lacking a mature understanding of the law and the description of the facts in the case.When these models are applied to other scenarios, the outcomes are often less optimistic than expected.Subsequently, Zhong et al., pointed out that, unlike countries such as Europe and the United States, China is a civil law country based on legal provisions, so the prediction of legal provisions should be the most basic work out of the three subtasks of judgment prediction.In fact, there is a strict order corresponding to how those judges decide cases in the real world [3].Later, Yang et al., believed that, in addition to the strict order of tasks, there is also a mechanism for mutual feedback between results.They proposed a multi-view network by combining the attention mechanism and a bidirectional feedback neural network, which could effectively complete the three subtasks.The decision prediction was carried out depending on the outcome [4].In addition, researchers have also leveraged other techniques to improve the interpretability and generalization capabilities of these models.Jiang et al., used deep reinforcement learning to obtain simple document features from factual descriptions to predict crimes [5].Chen et al., proposed a legal graph network (LGN) to achieve high-accuracy crime prediction [6].
In recent years, causal inference has been widely used in the field of machine learning, and has also been effectively combined with deep learning.Liu et al., proposed a graphbased causal inference framework and applied it to the field of legal AI.They built a causal graph using a factual description of a case, and injected the causal knowledge contained in the framework into the neural network in the form of an auxiliary loss function, achieving better performance and interpretability [7].The method of building a causal graph with data and then injecting causal knowledge into a neural network is the mainstream feature of causal theory in the field of artificial intelligence.Moreover, there are also ways to design encoders and decoders directly using the principles of causality.For example, in the field of legal AI, the generation of court opinions is also an important task, which is critical for subsequent judges to understand the case information and make judgments.When Wu et al., dealt with this problem, they found that, since most of the cases participating in the trial were beneficial to the plaintiff (plaintiff), the documents generated only by using this data tended to be in the plaintiff's favor.However, this outcome is obviously unreasonable.Therefore, they used the counterfactual principles in the causal relationship to design a natural language-generation mechanism based on the attention and counterfactual principles (attentional and counterfactual-based natural language generation, AC-NLG).It consisted of an attention encoder and a counterfactual encoder, which took the plaintiff's claim and the factual description of the case as the input and enabled the encoder to calculate a weight for perceiving the factual description and the relevant information in the claim.By using a counterfactual decoder combined with a collaborative decision prediction model, factual biases in the data could be removed and decision-discriminative opinions (both supporting and non-supporting opinions) could be produced.Good results have been achieved in both quantitative and qualitative evaluation indicators [8].
Before the era of deep learning, researchers tried to model common information among multiple tasks, hoping to obtain a better generalization ability through joint task learning.This is the goal of multi-task learning (MTL) as summarized by Caruana in 1997.The outstanding experimental results can improve the main task by exploiting the domainspecific information contained in the training information of related tasks [9].Multi-task learning has been successfully used in all the applications of machine learning, from natural language processing [10] and speech recognition [11] to computer vision [12] and drug discovery [13].It is also known by many names: federated learning, meta-learning, and assisted task learning.In general, once a process requires the optimization of more than one function, it is actually multi-task learning.In these scenarios, it is helpful to think clearly about what the task is doing in terms of MTL in order to gain insights from it.Furthermore, due to the combination of multiple task networks, the network layers are bound to share parameters, which not only reduces the memory usage, but also avoids the repeated calculation of the parameters of shared layers and improves the speed of the model inference.More importantly, if multiple tasks can complement information or can adjust each other, it is possible to improve the model performance [9,14,15].
In the current knowledge on judgment prediction, there is insufficient information for unstructured text mining (such as case fact descriptions), an insufficient understanding of the relationship between the three tasks, a lack of model structure adjustment according to the task relationship, and a lack of pre-trained language models as upstream tasks.This paper proposes a causal inference and a multi-expert FTOPJUDGE decision prediction model, including the pre-trained language model Lawformer, a causal inference mechanism, and structures such as a multi-task FTOPJUDGE classifier.The superiority of the model was verified using the public data set CAIL2018, by comparing its results with that of the current mainstream decision prediction models.Through ablation experiments, the effectiveness and rationality of each module of the proposed model were verified.
The rest of the paper unfolds as follows: Section 2 presents causal inference and the multi-expert FTOPJUDGE.The Section 3 contains the experimental results and discussion.The Section 4 presents a summary of the full text.

Data Set Introduction
The data set used in this experiment was China's first large-scale legal data set for judgment prediction, the China AI Legal Challenge data set (CAIL2018).It was released at the "2018 China Legal Research Cup Smart Challenge" jointly held by Tsinghua University, the China Judicial Big Data Research Institute, and other institutions.CAIL2018 collected 2.68 million criminal case judgment documents published by the China Judgment Document Network (http://wenshu.court.gov.cn/,accessed on 10 October 2020), involving a total of 202 crimes and 183 articles of law, where the sentences included 0-25 years, life, and the death penalty.These documents provide references and standards for researchers in the field of legal AI and save a lot of time for researchers.They greatly promote the development of judgment prediction in China and play a positive role for research in the field of legal intelligence.
Compared with other LJP data sets, CAIL2018 is larger in scale and is divided into three parts, namely practice data, race data, and data not used in the match.For the current LJP research, the practice data was often called CAIL-small, and the competition data was called CAIL-big.Researchers generally conduct experiments on these two data sets to verify the effectiveness of a model.Each document in CAIL2018 is stored in JSON format and contains two parts: a description of the case facts and the results of the judgment.

The Overall Framework of the Model
The causal inference and multi-expert FTOPJUDGE judgment prediction model was mainly composed of two parts: the causal inference model and the text-processing model.These two parts were carried out at the same time and fused at the final loss calculation.The text-processing model consisted of the pre-trained language model Lawformer, the text-encoding model BiLSTM-Att, and the multi-expert FTOPJUDGE classifier, which were partially composed.The overall model framework is shown in Figure 1.
The causal inference and multi-expert FTOPJUDGE decision prediction model is also called the Causal-Lawformer-BiLSTM-Att-Multi-Experts-FTOPJUDGE (CBMF).The causal inference part mainly used the same description as in the judgment documents, and obtained the causal strength by extracting keywords and establishing a causal graph.In the text-processing stage, the pre-trained language model Lawformer was used to process the case fact description to obtain rich prior knowledge of the input word vector.Then, the text encoder Bi-LSTM was used to process the word vector, and the attention mechanism was used to obtain the text vector.The text vector provided exclusive information for each sub-task through the multi-expert mechanism to achieve a mutual balance between tasks and the performance gain in the common part.In addition, information other than the case fact description in the judgment documents was introduced as additional information and input into the FTOPJUDGE classifier to complete the prediction of the laws, charges, and sentences.
additional information and input into the FTOPJUDGE classifier to complete the prediction of the laws, charges, and sentences.

Causal Inference
Causal inference is the process of obtaining the causal relationship between variables.Most existing studies have focused on the processing of structured data, while there are few studies on mining the causal relationship between factors from unstructured data such as character information.However, this is a critical component of legal AI.In this paper, a novel graph-based causal inference (GCI) [7] framework is proposed, which constructs causal graphs from fact descriptions without much human intervention and helps legal AI make correct decisions.GCI consists of three parts, including the construction of a causal graph to assess the causal strength and make decisions.This specific process is elaborated in detail in Figure 2.

Constructing Cause and Effect Diagrams
In the first step, the modified YAKE algorithm was used to extract  most important keywords of the law from the description of the facts of the case without supervision  , where  ∈ ,   ,  , … ,  and  is the status type.The reasoning behind this algorithm was that, from the perspective of judgment prediction, the prediction of the laws

Causal Inference
Causal inference is the process of obtaining the causal relationship between variables.Most existing studies have focused on the processing of structured data, while there are few studies on mining the causal relationship between factors from unstructured data such as character information.However, this is a critical component of legal AI.In this paper, a novel graph-based causal inference (GCI) [7] framework is proposed, which constructs causal graphs from fact descriptions without much human intervention and helps legal AI make correct decisions.GCI consists of three parts, including the construction of a causal graph to assess the causal strength and make decisions.This specific process is elaborated in detail in Figure 2.

Causal Inference
Causal inference is the process of obtaining the causal relationship between variables.Most existing studies have focused on the processing of structured data, while there are few studies on mining the causal relationship between factors from unstructured data such as character information.However, this is a critical component of legal AI.In this paper, a novel graph-based causal inference (GCI) [7] framework is proposed, which constructs causal graphs from fact descriptions without much human intervention and helps legal AI make correct decisions.GCI consists of three parts, including the construction of a causal graph to assess the causal strength and make decisions.This specific process is elaborated in detail in Figure 2.

Constructing Cause and Effect Diagrams
In the first step, the modified YAKE algorithm was used to extract  most important keywords of the law from the description of the facts of the case without supervision  , where  ∈ ,   ,  , … ,  and  is the status type.The reasoning behind this algorithm was that, from the perspective of judgment prediction, the prediction of the laws The overall process of GCI.

Constructing Cause and Effect Diagrams
In the first step, the modified YAKE algorithm was used to extract p most important keywords of the law from the description of the facts of the case without supervision l i , where l i ∈ L, L = {l 1 , l 2 , . . . ,l M } and M is the status type.The reasoning behind this algorithm was that, from the perspective of judgment prediction, the prediction of the laws and regulations of a case is the most basic task; therefore, this is the most important task to a certain extent, especially in circumstances where the law is able to predict whether the following tasks will achieve excellent performance.For Chinese text, the improved YAKE algorithm considered the importance of words from four perspectives:

•
The position of the sentence in which the word was located; the earlier the sentence appeared in the text, the more important it was.Its score calculation formula was as follows: where Median(Sent) is the median position in the text of all sentences containing the word.

•
The word frequency-inverse text frequency; a high-frequency word was not necessarily the most important.The inverse document frequency was used to measure the true importance of a word, which consisted of the word frequency TF and the inverse text frequency IDF.The specific formula was as follows: where TF(x) is the word frequency of the word x in the text, MeanTF is the average of all word frequencies, and σ is the standard deviation of the word frequency, which was normalized to avoid the problem of excessive word frequency in long texts.N is the total number of texts in the corpus, and N(x) is the number of texts containing x.
Here, the original YAKE only used word frequency, and we enabled the algorithm to find more key words by introducing inverse text frequency.

•
Context relation; when a word co-occurred with more irrelevant words, the importance of the word was lower.
where DL means that the window slid from left to right, and DR means the opposite.
|A t,w | represents the number of different words that appeared in the window, MaxTF represents the maximum frequency of all words, and CoOccur x, k represents the number x of k co-occurrences.

•
The frequency of words appearing in sentences; the more sentences a word appeared in, the more important it was.
T Sentence = SF(x) Sentence all (7) where SF(x) is the number of sentences containing the word x, and Sentence all is the number of all the sentences.
Based on these four considerations, each word x was scored as follows: where S(x) is the score of the word x.The smaller S(x) was, the more important the word x was.The original YAKE also considered whether a word was capitalized.Since we were dealing with Chinese text, this part was discarded.The second step was to select p key words that are most important to the law, and use the K-means algorithm to cluster them into q class keywords.
The data X = x 1 , x 2 , . . ., x p were randomly divided into q groups, namely C = C 1 , C 2 , . . ., C q , where x 1 , x 2 , . . ., x p were the p key words were those most important to the law.A number of objects, q, was randomly selected from C as the initial cluster center u 1 , u 2 , . . ., u q , and the distance between keyword x i and each cluster center u j was calculated as: Each keyword was assigned to its nearest cluster center u j .The cluster centers and the keywords assigned to them represented a cluster.Every time a keyword was allocated, the cluster center was recalculated according to the existing keywords in the cluster.The calculation formula was as follows: This process was repeated until no keywords were reassigned to other clusters, no cluster centers changed, or the sum of the squared errors was locally minimized.The clustered q class key and all the statutes were called the element of the causal graph f actor.
The third step was to use the greedy fast causal inference algorithm (greedy fast causal inference, GFCI) for the causal discovery to establish edges in the causal graph, to treat all elements, f actor, as nodes of the causal graph, and to determine whether there was a causal relationship between the nodes.If there was a causal relationship, then an edge was established.
GFCI is a combination of a score-based and constraint-based algorithm that combines the best of both worlds and performs as well as the score-based approach.Specifically, GFCI does not rely on the assumption that there are no potential confounders, and was therefore suitable for our task.GFCI establishes edges for nodes with causal relationships in a causal graph, and establishes different types of edges for different causal relationships.There are four types of edges, as shown in Table 1 [7].
Table 1.Types of edges in causal graphs and their meanings.

Edge Meaning (Type)
A where () is the score of the word .The smaller () was, the more important the word  was.The original YAKE also considered whether a word was capitalized.Since we were dealing with Chinese text, this part was discarded.
The second step was to select  key words that are most important to the law, and use the -means algorithm to cluster them into  class keywords.
The data  =  ,  , … ,  were randomly divided into  groups, namely  =  ,  , … ,  , where  ,  , … ,  were the  key words were those most important to the law.A number of objects, , was randomly selected from  as the initial cluster center  ,  , … ,  , and the distance between keyword  and each cluster center  was calculated as: Each keyword was assigned to its nearest cluster center  .The cluster centers and the keywords assigned to them represented a cluster.Every time a keyword was allocated, the cluster center was recalculated according to the existing keywords in the cluster.The calculation formula was as follows: This process was repeated until no keywords were reassigned to other clusters, no cluster centers changed, or the sum of the squared errors was locally minimized.The clustered  class key and all the statutes were called the element of the causal graph .
The third step was to use the greedy fast causal inference algorithm (greedy fast causal inference, GFCI) for the causal discovery to establish edges in the causal graph, to treat all elements, , as nodes of the causal graph, and to determine whether there was a causal relationship between the nodes.If there was a causal relationship, then an edge was established.
GFCI is a combination of a score-based and constraint-based algorithm that combines the best of both worlds and performs as well as the score-based approach.Specifically, GFCI does not rely on the assumption that there are no potential confounders, and was therefore suitable for our task.GFCI establishes edges for nodes with causal relationships in a causal graph, and establishes different types of edges for different causal relationships.There are four types of edges, as shown in Table 1 [7].

Edge Meaning (Type) A B A makes B A B
There is an unobserved confounding factor between A and B A B Either A makes B or there is a confounding factor A B Either A makes B, or B makes A, or there is a confounding factor In addition, we also needed to consider some special cases to prune the edges.First of all, the identification of the statute was based on the description of the facts, and the statute was the result of the final determination, so it was impossible to have an edge from the statute to other nodes.Meanwhile, the time was also considered.Due to the causality constraint, a cause must occur before the result.The factual descriptions in a judgment document are usually written in chronological order, so the chronological order could be used to constrain the edge.
The fourth step was to sample the causal graph to obtain the causal subgraph.Due to the uncertainty of the causal relationship, the causal graph also had uncertainty, so it was necessary to sample the causal graph and determine whether the causal subgraph conformed to the real causal relationship.There are different sampling methods for where () is the score of the word .The smaller () was, the more important the word  was.The original YAKE also considered whether a word was capitalized.Since we were dealing with Chinese text, this part was discarded.
The second step was to select  key words that are most important to the law, and use the -means algorithm to cluster them into  class keywords.
The data  =  ,  , … ,  were randomly divided into  groups, namely  =  ,  , … ,  , where  ,  , … ,  were the  key words were those most important to the law.A number of objects, , was randomly selected from  as the initial cluster center  ,  , … ,  , and the distance between keyword  and each cluster center  was calculated as: Each keyword was assigned to its nearest cluster center  .The cluster centers and the keywords assigned to them represented a cluster.Every time a keyword was allocated, the cluster center was recalculated according to the existing keywords in the cluster.The calculation formula was as follows: This process was repeated until no keywords were reassigned to other clusters, no cluster centers changed, or the sum of the squared errors was locally minimized.The clustered  class key and all the statutes were called the element of the causal graph .
The third step was to use the greedy fast causal inference algorithm (greedy fast causal inference, GFCI) for the causal discovery to establish edges in the causal graph, to treat all elements, , as nodes of the causal graph, and to determine whether there was a causal relationship between the nodes.If there was a causal relationship, then an edge was established.
GFCI is a combination of a score-based and constraint-based algorithm that combines the best of both worlds and performs as well as the score-based approach.Specifically, GFCI does not rely on the assumption that there are no potential confounders, and was therefore suitable for our task.GFCI establishes edges for nodes with causal relationships in a causal graph, and establishes different types of edges for different causal relationships.There are four types of edges, as shown in Table 1 [7].
Table 1.Types of edges in causal graphs and their meanings.

Edge Meaning (Type) A B A makes B A B
There is an unobserved confounding factor between A and B A B Either A makes B or there is a confounding factor A B Either A makes B, or B makes A, or there is a confounding factor In addition, we also needed to consider some special cases to prune the edges.First of all, the identification of the statute was based on the description of the facts, and the statute was the result of the final determination, so it was impossible to have an edge from the statute to other nodes.Meanwhile, the time was also considered.Due to the causality constraint, a cause must occur before the result.The factual descriptions in a judgment document are usually written in chronological order, so the chronological order could be used to constrain the edge.
The fourth step was to sample the causal graph to obtain the causal subgraph.Due to the uncertainty of the causal relationship, the causal graph also had uncertainty, so it was necessary to sample the causal graph and determine whether the causal subgraph conformed to the real causal relationship.There are different sampling methods for

B
There is an unobserved confounding factor between A and B A word  was.The original YAKE also considered whether a word was capitalized.Since we were dealing with Chinese text, this part was discarded.
The second step was to select  key words that are most important to the law, and use the -means algorithm to cluster them into  class keywords.
The data  =  ,  , … ,  were randomly divided into  groups, namely  =  ,  , … ,  , where  ,  , … ,  were the  key words were those most important to the law.A number of objects, , was randomly selected from  as the initial cluster center  ,  , … ,  , and the distance between keyword  and each cluster center  was calculated as: Each keyword was assigned to its nearest cluster center  .The cluster centers and the keywords assigned to them represented a cluster.Every time a keyword was allocated, the cluster center was recalculated according to the existing keywords in the cluster.The calculation formula was as follows: This process was repeated until no keywords were reassigned to other clusters, no cluster centers changed, or the sum of the squared errors was locally minimized.The clustered  class key and all the statutes were called the element of the causal graph .
The third step was to use the greedy fast causal inference algorithm (greedy fast causal inference, GFCI) for the causal discovery to establish edges in the causal graph, to treat all elements, , as nodes of the causal graph, and to determine whether there was a causal relationship between the nodes.If there was a causal relationship, then an edge was established.
GFCI is a combination of a score-based and constraint-based algorithm that combines the best of both worlds and performs as well as the score-based approach.Specifically, GFCI does not rely on the assumption that there are no potential confounders, and was therefore suitable for our task.GFCI establishes edges for nodes with causal relationships in a causal graph, and establishes different types of edges for different causal relationships.There are four types of edges, as shown in Table 1 [7].
Table 1.Types of edges in causal graphs and their meanings.

Edge Meaning (Type) A B A makes B A B
There is an unobserved confounding factor between A and B A B Either A makes B or there is a confounding factor A B Either A makes B, or B makes A, or there is a confounding factor In addition, we also needed to consider some special cases to prune the edges.First of all, the identification of the statute was based on the description of the facts, and the statute was the result of the final determination, so it was impossible to have an edge from the statute to other nodes.Meanwhile, the time was also considered.Due to the causality constraint, a cause must occur before the result.The factual descriptions in a judgment document are usually written in chronological order, so the chronological order could be used to constrain the edge.
The fourth step was to sample the causal graph to obtain the causal subgraph.Due to the uncertainty of the causal relationship, the causal graph also had uncertainty, so it was necessary to sample the causal graph and determine whether the causal subgraph conformed to the real causal relationship.There are different sampling methods for B Either A makes B or there is a confounding factor A word  was.The original YAKE also considered whether a word was capitalized.Since we were dealing with Chinese text, this part was discarded.
The second step was to select  key words that are most important to the law, and use the -means algorithm to cluster them into  class keywords.
The data  =  ,  , … ,  were randomly divided into  groups, namely  =  ,  , … ,  , where  ,  , … ,  were the  key words were those most important to the law.A number of objects, , was randomly selected from  as the initial cluster center  ,  , … ,  , and the distance between keyword  and each cluster center  was calculated as: Each keyword was assigned to its nearest cluster center  .The cluster centers and the keywords assigned to them represented a cluster.Every time a keyword was allocated, the cluster center was recalculated according to the existing keywords in the cluster.The calculation formula was as follows: This process was repeated until no keywords were reassigned to other clusters, no cluster centers changed, or the sum of the squared errors was locally minimized.The clustered  class key and all the statutes were called the element of the causal graph .
The third step was to use the greedy fast causal inference algorithm (greedy fast causal inference, GFCI) for the causal discovery to establish edges in the causal graph, to treat all elements, , as nodes of the causal graph, and to determine whether there was a causal relationship between the nodes.If there was a causal relationship, then an edge was established.
GFCI is a combination of a score-based and constraint-based algorithm that combines the best of both worlds and performs as well as the score-based approach.Specifically, GFCI does not rely on the assumption that there are no potential confounders, and was therefore suitable for our task.GFCI establishes edges for nodes with causal relationships in a causal graph, and establishes different types of edges for different causal relationships.There are four types of edges, as shown in Table 1 [7].

Edge Meaning (Type) A B A makes B A B
There is an unobserved confounding factor between A and B A B Either A makes B or there is a confounding factor A B Either A makes B, or B makes A, or there is a confounding factor In addition, we also needed to consider some special cases to prune the edges.First of all, the identification of the statute was based on the description of the facts, and the statute was the result of the final determination, so it was impossible to have an edge from the statute to other nodes.Meanwhile, the time was also considered.Due to the causality constraint, a cause must occur before the result.The factual descriptions in a judgment document are usually written in chronological order, so the chronological order could be used to constrain the edge.
The fourth step was to sample the causal graph to obtain the causal subgraph.Due to the uncertainty of the causal relationship, the causal graph also had uncertainty, so it was necessary to sample the causal graph and determine whether the causal subgraph conformed to the real causal relationship.There are different sampling methods for B Either A makes B, or B makes A, or there is a confounding factor In addition, we also needed to consider some special cases to prune the edges.First of all, the identification of the statute was based on the description of the facts, and the statute was the result of the final determination, so it was impossible to have an edge from the statute to other nodes.Meanwhile, the time was also considered.Due to the causality constraint, a cause must occur before the result.The factual descriptions in a judgment document are usually written in chronological order, so the chronological order could be used to constrain the edge.
The fourth step was to sample the causal graph to obtain the causal subgraph.Due to the uncertainty of the causal relationship, the causal graph also had uncertainty, so it was necessary to sample the causal graph and determine whether the causal subgraph conformed to the real causal relationship.There are different sampling methods for different edges.The specific methods are as follows: among the four types of edges, where () is the score of word  was.The original Y we were dealing with Chine The second step was to use the -means algorithm The data  =  ,  , …  ,  , … ,  , where  ,  , … law.A number of objects,   ,  , … ,  , and the distan culated as: Each keyword was assi the keywords assigned to the the cluster center was recalc calculation formula was as f This process was repea cluster centers changed, or th tered  class key and all the The third step was to causal inference, GFCI) for t treat all elements, , a was a causal relationship be edge was established.
GFCI is a combination o the best of both worlds and GFCI does not rely on the a therefore suitable for our tas in a causal graph, and estab ships.There are four types o In addition, we also nee of all, the identification of th statute was the result of the f the statute to other nodes.M constraint, a cause must occ document are usually writte used to constrain the edge.
The fourth step was to means the edge will be retained; Mathematics 2022, 10, x FOR PEER REVIEW where () is the score of the word .The smaller () was, the more i word  was.The original YAKE also considered whether a word was capi we were dealing with Chinese text, this part was discarded.
The second step was to select  key words that are most important to use the -means algorithm to cluster them into  class keywords.
The data  =  ,  , … ,  were randomly divided into  groups,  ,  , … ,  , where  ,  , … ,  were the  key words were those most im law.A number of objects, , was randomly selected from  as the initial c  ,  , … ,  , and the distance between keyword  and each cluster cente culated as:  = (  −  ) Each keyword was assigned to its nearest cluster center  .The cluste the keywords assigned to them represented a cluster.Every time a keyword w the cluster center was recalculated according to the existing keywords in th calculation formula was as follows: This process was repeated until no keywords were reassigned to othe cluster centers changed, or the sum of the squared errors was locally minimiz tered  class key and all the statutes were called the element of the causal g The third step was to use the greedy fast causal inference algorithm causal inference, GFCI) for the causal discovery to establish edges in the cau treat all elements, , as nodes of the causal graph, and to determine w was a causal relationship between the nodes.If there was a causal relation edge was established.
GFCI is a combination of a score-based and constraint-based algorithm t the best of both worlds and performs as well as the score-based approach GFCI does not rely on the assumption that there are no potential confound therefore suitable for our task.GFCI establishes edges for nodes with causal in a causal graph, and establishes different types of edges for different ca ships.There are four types of edges, as shown in Table 1 [7].

Edge Meaning (Type) A B A makes B A B
There is an unobserved confounding factor between A B Either A makes B or there is a confounding fact A B Either A makes B, or B makes A, or there is a confound In addition, we also needed to consider some special cases to prune th of all, the identification of the statute was based on the description of the f statute was the result of the final determination, so it was impossible to have the statute to other nodes.Meanwhile, the time was also considered.Due to constraint, a cause must occur before the result.The factual descriptions in document are usually written in chronological order, so the chronological o used to constrain the edge.
The fourth step was to sample the causal graph to obtain the causal su means the edge will be deleted, because Each keyword was a the keywords assigned to t the cluster center was reca calculation formula was as This process was rep cluster centers changed, or tered  class key and all t The third step was t causal inference, GFCI) fo treat all elements, , was a causal relationship edge was established.
GFCI is a combination the best of both worlds an GFCI does not rely on the therefore suitable for our t in a causal graph, and est ships.There are four types  Each keyword was assigned t the keywords assigned to them repr the cluster center was recalculated calculation formula was as follows This process was repeated un cluster centers changed, or the sum tered  class key and all the statut The third step was to use the causal inference, GFCI) for the cau treat all elements, , as node was a causal relationship between edge was established.
GFCI is a combination of a scor the best of both worlds and perfor GFCI does not rely on the assump therefore suitable for our task.GFC in a causal graph, and establishes ships.There are four types of edges In addition, we also needed to of all, the identification of the statu statute was the result of the final de the statute to other nodes.Meanwh constraint, a cause must occur bef document are usually written in ch used to constrain the edge.
The fourth step was to sample where () is the score of the word .The smaller () was, the more important the word  was.The original YAKE also considered whether a word was capitalized.Since we were dealing with Chinese text, this part was discarded.
The second step was to select  key words that are most important to the law, and use the -means algorithm to cluster them into  class keywords.
The data  =  ,  , … ,  were randomly divided into  groups, namely  =  ,  , … ,  , where  ,  , … ,  were the  key words were those most important to the law.A number of objects, , was randomly selected from  as the initial cluster center  ,  , … ,  , and the distance between keyword  and each cluster center  was calculated as: Each keyword was assigned to its nearest cluster center  .The cluster centers and the keywords assigned to them represented a cluster.Every time a keyword was allocated, the cluster center was recalculated according to the existing keywords in the cluster.The calculation formula was as follows: This process was repeated until no keywords were reassigned to other clusters, no cluster centers changed, or the sum of the squared errors was locally minimized.The clustered  class key and all the statutes were called the element of the causal graph .
The third step was to use the greedy fast causal inference algorithm (greedy fast causal inference, GFCI) for the causal discovery to establish edges in the causal graph, to treat all elements, , as nodes of the causal graph, and to determine whether there was a causal relationship between the nodes.If there was a causal relationship, then an edge was established.
GFCI is a combination of a score-based and constraint-based algorithm that combines the best of both worlds and performs as well as the score-based approach.Specifically, GFCI does not rely on the assumption that there are no potential confounders, and was therefore suitable for our task.GFCI establishes edges for nodes with causal relationships in a causal graph, and establishes different types of edges for different causal relationships.There are four types of edges, as shown in Table 1 [7].

Edge Meaning (Type) A B A makes B A B
There is an unobserved confounding factor between A and B A B Either A makes B or there is a confounding factor A B Either A makes B, or B makes A, or there is a confounding factor In addition, we also needed to consider some special cases to prune the edges.First of all, the identification of the statute was based on the description of the facts, and the statute was the result of the final determination, so it was impossible to have an edge from the statute to other nodes.Meanwhile, the time was also considered.Due to the causality constraint, a cause must occur before the result.The factual descriptions in a judgment where () is the score of the word .The smaller () was, the more imp word  was.The original YAKE also considered whether a word was capitaliz we were dealing with Chinese text, this part was discarded.
The second step was to select  key words that are most important to the use the -means algorithm to cluster them into  class keywords.
The data  =  ,  , … ,  were randomly divided into  groups, na  ,  , … ,  , where  ,  , … ,  were the  key words were those most impor law.A number of objects, , was randomly selected from  as the initial clus  ,  , … ,  , and the distance between keyword  and each cluster center  culated as:  = (  −  ) Each keyword was assigned to its nearest cluster center  .The cluster ce the keywords assigned to them represented a cluster.Every time a keyword was the cluster center was recalculated according to the existing keywords in the cl calculation formula was as follows: This process was repeated until no keywords were reassigned to other cl cluster centers changed, or the sum of the squared errors was locally minimized.tered  class key and all the statutes were called the element of the causal grap The third step was to use the greedy fast causal inference algorithm (gr causal inference, GFCI) for the causal discovery to establish edges in the causal treat all elements, , as nodes of the causal graph, and to determine whe was a causal relationship between the nodes.If there was a causal relationship edge was established.
GFCI is a combination of a score-based and constraint-based algorithm that the best of both worlds and performs as well as the score-based approach.Sp GFCI does not rely on the assumption that there are no potential confounders therefore suitable for our task.GFCI establishes edges for nodes with causal rela in a causal graph, and establishes different types of edges for different causa ships.There are four types of edges, as shown in Table 1 [7].

Edge Meaning (Type) A B A makes B A B
There is an unobserved confounding factor between A a A B Either A makes B or there is a confounding factor A B Either A makes B, or B makes A, or there is a confounding In addition, we also needed to consider some special cases to prune the ed of all, the identification of the statute was based on the description of the fact statute was the result of the final determination, so it was impossible to have an e the statute to other nodes.Meanwhile, the time was also considered.Due to the constraint, a cause must occur before the result.The factual descriptions in a document are usually written in chronological order, so the chronological orde , each with a probability of 1/2, so when sampling it, half of the edge is likely to be retained and half is likely to be discarded; similarly, for Each keyword was as the keywords assigned to t the cluster center was reca calculation formula was as This process was rep cluster centers changed, or tered  class key and all t The third step was t causal inference, GFCI) for treat all elements, , was a causal relationship edge was established.
GFCI is a combination the best of both worlds an GFCI does not rely on the therefore suitable for our t in a causal graph, and est ships.There are four types In addition, we also n of all, the identification of statute was the result of the the statute to other nodes.constraint, a cause must o document are usually writ used to constrain the edge , there is a 1/3 chance of each case.

Assessing Causal Strength
Since all the resulting causal subgraphs were still inherently noisy, we refined them by estimating the strength of the causal relationships.We assigned high values to edges with strong causality, and values of close to 0 to edges with no or weak causality.The specific method was: for the edge in the causal subgraph G, the average treatment effect (ATE) ψ G T,Y was used as the strength of the node-T to the node-Y edge in the graph G, and then propensity score matching (PSM) was used to evaluate it.The specific principles of ATE and PSM are introduced below.
ATE is used to evaluate the average intervention effect of an individual in the intervention state-that is, the difference between the observation result of individual i in the intervention state and its counterfactual.The principle is that, for edge T → Y, if the intervention T is changed from 0 to 1, the expected change of the result Y is as follows: Here, E is the expectation and do(T = 1) means setting the intervention T to 1. Propensity score matching, PSM, is a statistical method that is used to reduce the influence of data bias and confounding variables so that comparative experiments are on the same starting line.Combining the two methods can achieve an assessment of the causal strength; the formula is as follows: where j = argmin is the most similar instance of the opposite group of i, L is the likelihood function, and t i , y i , and z i are the values of the intervention, outcome, and confounding factor for i, respectively.

Making Decisions
For each factor graph G q , we obtained its causal strength and then calculated the quality BIC G q , X of the subgraph G q by evaluating its degree of fitting with the data X.
Here, we used the Bayesian information criterion (BIC) for the calculation.This was mainly used to measure the excellence of the subgraph in fitting the data.Then, for edge T j → Y i in each subgraph G q , the weight sum of the mass BIC G q , X and the causal strength of each graph was used to obtain the causal strength Ψ G q T j ,Y i of the edge T j → Y i in the general graph.The specific formula for the calculation was as follows: where K G q is the parameter in the graph G q , N X is the number of x, L is the likelihood function, and Y i represents the legal clause l i .If the edge T j → Y i does not exist in the graph Finally, for each case, we treated the factual description as doc, combined it with a causal diagram, and calculated a score for each statute.The formula was as follows: where τ T j represents 1 if T j is in the fact description, and 0 if T j is not.The obtained scores were input into the random forest classifier [16], and the corresponding law was obtained.

Text Pre-Training
Over the past few years, a variety of pre-trained language models have flourished and demonstrated their ability to effectively extract rich language knowledge and unlabeled corpora, and to achieve significant performance improvements in a variety of downstream tasks.Compared with the traditional Bert, which utilizes a wide range of texts covering all walks of life, some researchers have incorporated a pre-training stage for text extraction in specific domains, and have proved that continuous pre-training on the target domain corpus can continuously achieve performance improvements [17].At the same time, the referee text is usually composed of thousands of words, but the mainstream PLMs are Transformer-based; therefore, the length of the input text is often limited to 512, which does not meet our requirements for processing referee documents.In response to these problems, Xiao et al., proposed a Longformer-based pre-trained language model, Lawformer, in 2021 [18].
As the basic encoder of Lawformer, Longformer does not use a complete self-attention mechanism, but integrates the sliding window attention mechanism (sliding window attention), the extended sliding window attention mechanism (dilated sliding window attention), and the global attention mechanism (global attention) to encode text sequences.The reason for this is that, when the length of the input sequence is n, the time complexity and memory complexity of the complete self-attention mechanism are both nO n 2 , and an excessively long text length n would inevitably lead to an excessively long training time and consume a huge amount of computing resources.In this way, the full self-attention matrix was made sparse by specifying an "attention model" of pairs of input positions that are of mutual concern, resulting in a linear relationship between the complexity and n.An example of the combination of the three attention mechanisms is shown in Figure 3.
where   represents 1 if  is in the fact description, and 0 if  is not.The obtained scores were input into the random forest classifier [16], and the corresponding law was obtained.

Text Pre-Training
Over the past few years, a variety of pre-trained language models have flourished and demonstrated their ability to effectively extract rich language knowledge and unlabeled corpora, and to achieve significant performance improvements in a variety of downstream tasks.Compared with the traditional Bert, which utilizes a wide range of texts covering all walks of life, some researchers have incorporated a pre-training stage for text extraction in specific domains, and have proved that continuous pre-training on the target domain corpus can continuously achieve performance improvements [17].At the same time, the referee text is usually composed of thousands of words, but the mainstream PLMs are Transformer-based; therefore, the length of the input text is often limited to 512, which does not meet our requirements for processing referee documents.In response to these problems, Xiao et al., proposed a Longformer-based pre-trained language model, Lawformer, in 2021 [18].
As the basic encoder of Lawformer, Longformer does not use a complete self-attention mechanism, but integrates the sliding window attention mechanism (sliding window attention), the extended sliding window attention mechanism (dilated sliding window attention), and the global attention mechanism (global attention) to encode text sequences.The reason for this is that, when the length of the input sequence is , the time complexity and memory complexity of the complete self-attention mechanism are both   , and an excessively long text length  would inevitably lead to an excessively long training time and consume a huge amount of computing resources.In this way, the full self-attention matrix was made sparse by specifying an "attention model" of pairs of input positions that are of mutual concern, resulting in a linear relationship between the complexity and .An example of the combination of the three attention mechanisms is shown in Figure 3.For the text, we recognized each word as a token, and a piece of text was represented as T = (t 1 , t 2 , t 3 , . . . ,t n ), where t i represents a word and n is the text-length.
Sliding Window Attention: For this attention, we only calculated the attention score between the surrounding tokens.Specifically, given the size of a sliding window w, each token only paid attention to 1/2 w on each side, although in each layer a token only gathered information near it.However, as the number of layers increased, global information could also be integrated into the hidden representation of each token.

Dilated Sliding Window Attention:
In order to further increase the field of view without increasing the amount of computation, the sliding window could be "expanded", which was similar to the dilated convolution of CNN [19].In this attention mechanism, each window was not continuous, but there was a gap between each participating token with length l.Since we used a multi-head attention mechanism in each window, the gap lengths of different heads l could be different at the same time, which would also enable the attention to obtain information at different levels of text and improve the performance of the model.

Global Attention:
In some specific tasks, some tokens needed to focus on the whole sequence to obtain enough information.For example, in text classification, the special token "CLS" should be used to focus on the entire text.Therefore, we applied global attention to some pre-selected tokens for specific tasks.The chosen tokens would focus on the entire sequence to generate a hidden representation, instead of just focusing on the surrounding tokens.It is worth noting that the parameters of the global attention and the sliding window attention were different.

Text Encoder BiLSTM-Att
The pre-trained word embedding, obtained by Lawformer above, was at the sentence level, so the obtained text sequence representation was S = (s 1 , s 2 , . . . ,s m ), s i ∈ k , where k is the dimension of the word vector and m is the number of sentences in the text.
We processed the text using Bi-LSTM, where S used a forward LSTM and a backward LSTM on the text sequence to obtain two separate hidden states.At time t, its hidden state h t is given by the following: where h t is the output of the forward-LSTM-hidden layer at time t, ← h t is the output of the backward-LSTM-hidden layer at time t, and the two are cascaded together to form h t .Finally, its output is H = (h 1 , h 2 , . . . ,h m ), which contains the contextual and locational information of the text.
After that, the attention mechanism was used for H to obtain the output of Out Att , which enabled the machine to remember more useful information, and meanwhile solved the long-distance dependency problem in Bi-LSTM to a certain extent.The specific formula was as follows: Out Att = Hα T (22) where W w is the parameter matrix to be trained and b w is the bias term.

Multi-Expert FTOPJUDGE Classifier
Zhong et al., believed that the three tasks in judgment prediction were sequential.He pointed out that, different from the case law system of Britain and the United States, China belongs to the civil law system; that is, legal judgments in China are based on the law.Therefore, the judge should first make a judgment on the law involved in a case, and then make a judgment on the charge through the relevant law, and the sentence of the defendant should be decided on this basis.In actual legal judgment, the article of law, the charge, and term of the sentence are closely related and gradually supplement each other.Tang et al., found through experiments on large-scale public data sets that, in MTL models with a complex task association, the performance of some tasks was improved at the expense of the performance of other tasks.This is inevitable, as with the long-distance dependency problem in NLP, and it is called the seesaw phenomenon [20].To solve this problem, we used the expert mechanism of the MMoE.Unlike the MMoE, where multiple experts function similarly, this paper designed two expert mechanisms with different functions to balance the three tasks.

Information Stripping Using the Multi-Expert Mechanism
In the MMoE mechanism, although a separate gating mechanism is configured for each task, there is still a phenomenon where some tasks preemptively serve experts for other tasks, mainly because all experts in the mechanism are shared on all tasks; this is also the root cause of the seesaw phenomenon.In view of this issue, this article introduces experts that work individually on tasks to ensure that each task is sufficiently developed.In addition, it is the function of the multi-expert mechanism to extract the most appropriate information for the three tasks from text embedding, Out Att .
As shown in Figure 4, we set up an exclusive expert group for each task and a shared expert group to realize the information exchange between multiple tasks.Each expert group was composed of multiple expert networks.Dedicated expert groups were responsible for providing information for dedicated tasks, and shared expert groups were responsible for learning and sharing information to facilitate multi-tasking.In other words, the shared expert groups were affected by all tasks, while the exclusive expert groups were affected only by the tasks to which they belonged, and the two groups were selectively fused through a gating mechanism.Taking task k as an example, the input of the multi-expert mechanism is Out Att = {o 1 , o 2 , . . . ,o m }.The specific calculation process is as follows: where o i is the input, E is the expert network, m k is the number of expert networks in the exclusive expert group, m s is the number of expert networks in the shared expert group, W k g ∈ R (m k +m s )×d is the matrix with training parameters, d is the dimension of o i , and the weighted summation of the results of different expert networks constitutes the output, g k , of our multitasking mechanism.

Introducing Additional Knowledge
A lot of knowledge is included in a written judgment in addition to the description of the facts of the crime.This includes basic information about the defendant, the court opinion, etc.All of this information can have an impact on the verdict [21].For additional information   ,  , . . .,  ,  is the type of additional information.We first normalized it using the following formula:

Introducing Additional Knowledge
A lot of knowledge is included in a written judgment in addition to the description of the facts of the crime.This includes basic information about the defendant, the court opinion, etc.All of this information can have an impact on the verdict [21].For addi-tional information X e = x 1 , x 2 , . . ., x e k , e k is the type of additional information.We first normalized it using the following formula: where µ is the mean of x and σ is the standard deviation of x.The purpose was to speed up the solution of the model during gradient descent because it changes linearly, which allowed the data to be true while improving its representation in the model.The result was X e = x 1 , x 2 , . . ., x e k .To make these data work better, we designed an additional knowledge encoder, which consisted of two fully connected layers.The specific formula was as follows: T where W i e and b i e are the parameters of the full connection and training at layer i.Then, the obtained result T e was concatenated with the output of the multi-expert mechanism.The task k was taken as an example in the following equation: where T k is the input of the task k in the FTOPJUDGE classifier.The reason why additional information was introduced here, rather than before the input of the multi-expert mechanism, is because the information would have been lost to a certain extent during the propagation of the neural network, especially after the complex structure of the multiexpert network was applied [22].Therefore, we chose to introduce additional information in the closest part of the classifier.

FTOPJUDGE Classifier
In this section, we introduce the FTOPJUDGE classifier, which was improved in the structure of each module and its operating principle based on its tasks.This classifier was called the fully connected TOPJUDGE classifier.
Different from Zhong et al.'s work of using an LSTM to build a topological classifier, we used a fully connected network to build the topological structure.We used FTOPJUDGE because we stripped the information through a multi-expert mechanism rather than the LSTM that Zhong et al., used in their paper.Since our information was a single information vector rather than a sequence, the use of recurrent neural networks such as LSTM did not result in much performance improvement for the model.FTOPJUDGE proved to be far superior to LSTM.
We used a fully connected network as the basic component of classification.The specific structure is shown in Figure 5.The second task of predicting the charge was taken as an example in the following equations: where T j i is the output of layer j in the three-layer fully connected network corresponding to task i, T in i is the input of task i, and ŷi is the output of task i. Taking the second task as an example, the input was concatenated by the vector T 2  1 and the input provided by the FTOPJUDGE classifier for the second task, T 2 .The calculation process of law prediction and sentence prediction is similar to that of crime prediction, with the difference being that only T 1 is required for the input of law prediction, while T 1 2 , T 2 2 , and T 3 are required for the input of crime prediction.The different inputs of each task also realize the same topological order structure as TOPJUDGE.Finally, we obtained T 1 ∈ R l , T 2 ∈ R ch , and T 3 ∈ R im , where l, ch, and im are the number of label categories for the articles, charges, and imprisonments, respectively.

𝑇
, where  is the output of layer  in the three-layer fully connected network corresponding to task ,  is the input of task , and  is the output of task .Taking the second task as an example, the input was concatenated by the vector  and the input provided by the FTOPJUDGE classifier for the second task,  .The calculation process of law prediction and sentence prediction is similar to that of crime prediction, with the difference being that only  is required for the input of law prediction, while  ,  , and  are required for the input of crime prediction.The different inputs of each task also realize the same topological order structure as TOPJUDGE.Finally, we obtained  ∈  ,  ∈  ℎ , and  ∈  , where , ℎ, and  are the number of label categories for the articles, charges, and imprisonments, respectively.

Integration of Causal Inference and Neural Networks
At this stage, compared with causal inference, neural networks still have a huge advantage in processing large amounts of text data, and we also observed that the causal knowledge contained in the GCI could be effectively injected into powerful neural networks to give the model better performance and interpretability.This motivated us to combine causal inference with neural networks, so that the neural networks could obtain real causal information and benefit from them.Therefore, we used a fusion method, as shown in Figure 6.

Integration of Causal Inference and Neural Networks
At this stage, compared with causal inference, neural networks still have a huge advantage in processing large amounts of text data, and we also observed that the causal knowledge contained in the GCI could be effectively injected into powerful neural networks to give the model better performance and interpretability.This motivated us to combine causal inference with neural networks, so that the neural networks could obtain real causal information and benefit from them.Therefore, we used a fusion method, as shown in Figure 6.We injected the evaluated causal intensities into the text encoder BiLSTM-Att.The case fact description obtained sentence embedding  ℎ , ℎ , . . ., ℎ with contextual information through BiLSTM.After that, the attention mechanism assigned different weights  ,  , . . .,  to each sentence and summed these sentences using the weights to construct the text embedding  : We injected the evaluated causal intensities into the text encoder BiLSTM-Att.The case fact description obtained sentence embedding H = (h 1 , h 2 , . . . ,h m ) with contextual in- formation through BiLSTM.After that, the attention mechanism assigned different weights {a 1 , a 2 , . . . ,a m } to each sentence and summed these sentences using the weights to construct the text embedding Out Att : where q is the learnable query vector.For the three tasks, we used the cross-entropy loss function to calculate each task's own loss separately, applied a weight to each loss, and performed a weighted summation.For the task k, its loss function was: where ŷk is the result we predicted and y k is the real result.The three task losses were weighted and summed: Here, the weights were manually set.Afterwards, an auxiliary loss was introduced, L cons , which utilized the causal strength learned through the GCI to guide the attention mechanism so that it learned causal knowledge about the statutes, as the decision statutes are the basis for decision prediction.Embedding the legal causal knowledge and text information into the text embedding greatly assisted the next judgment prediction.The specific process was as follows: First, w i is each element that belongs to f actor f , ψ T f ,Y j is the corresponding causal strength, and g i is the normalized strength for the entire sequence of causal strengths.
Afterwards, L cons was set to make the weights in the attention close to the normalized causal strength: (39) The task loss and auxiliary loss were added to obtain the total loss: Finally, we used the Adam [23] optimization algorithm to optimize the task.

Data Preprocessing
By analyzing the crimes in CAIL2018, it was found that the distribution of different crimes was quite uneven.Judging from the number of various crimes, the top ten crimes accounted for 79% of the cases.In contrast, the 10 types of crimes with the smallest total number accounted for only 0.12% of the cases, and this kind of situation also existed in the statutes of CAIL2018.Therefore, there was an extremely serious data imbalance problem in CAIL2018, which created challenges for the subsequent crime prediction and law prediction.In addition, for the lengths of the case fact description texts of the cases, the phenomenon of data imbalance was still serious.Taking CAIL-small as an example, the longest text was 56,226 words, the shortest was 6, and the average length was 350.6, as shown in Figure 7.
lem in CAIL2018, which created challenges for the subsequent crime prediction and law prediction.In addition, for the lengths of the case fact description texts of the cases, the phenomenon of data imbalance was still serious.Taking CAIL-small as an example, the longest text was 56,226 words, the shortest was 6, and the average length was 350.6, as shown in Figure 7.Only 1.9% of the texts had a length between 0 and 100 words, 2.4% of the texts were longer than 1000 words, and 95.8% of the texts were between 100 and 1000 words in length.
In response to these situations, we first sorted the crimes and laws so that the selected cases involved the more common types of laws and crimes, so as to reduce the occurrence of small sample problems.At the same time, in order to carry out the comparative experiment better, we drew on Zhong et al.'s work on judgment prediction in 2018.A piece of data in CAIL-small was selected only if it simultaneously satisfied the conditions of a 1.9% 95.8%

Analysis of length of the fact texts in CAIL2018
Text length<100 100<=Text length<1000 Text length>1000 Only 1.9% of the texts had a length between 0 and 100 words, 2.4% of the texts were longer than 1000 words, and 95.8% of the texts were between 100 and 1000 words in length.
In response to these situations, we first sorted the crimes and laws so that the selected cases involved the more common types of laws and crimes, so as to reduce the occurrence of small sample problems.At the same time, in order to carry out the comparative experiment better, we drew on Zhong et al.'s work on judgment prediction in 2018.A piece of data in CAIL-small was selected only if it simultaneously satisfied the conditions of a description text length between 100 and 1000 words, a crime category belonging to the top 119 crimes, and a law category belonging to the top 103 categories.A similar purge was applied to CAIL-big, but the number of crimes and statutes to which our cases belonged were expanded to the top 130 and the top 118, respectively.We collected all the filtered data and used random sampling to divide the data into a training set, a validation set, and a test set in a ratio of 8:1:1.The details are shown in Table 2.In addition, since the sentence was a continuous variable and there was also the problem of data imbalance, the sentence data was taken as discrete (refer to Zhong et al.'s previous work) and the labels were converted according to Table 3.The purpose of this was to make the distribution of the number of cases in each interval relatively uniform while ensuring rationality, and to prevent the occurrence of problems such as a poor model generalization ability.

Evaluation Indicators
To facilitate the comparison of benchmarks and the performance of the ablation model and our model, we adopted four evaluation metrics that are widely used in multiclassification tasks: accuracy (accuracy, Acc.), macro-average precision (macro-precision, MP), macro-precision average recall (macro-recall, MR) and macro-average F1 value (macro-F1, F1).The specific calculation formulas were as follows: where S right represents all correctly classified samples, S all represents all samples, n represents all categories in the data, P i represents the precision of class i samples, R i represents the recall rate of the class i sample, and F i represents the F1 value of the class i sample.The formulas for P i , R i , and F i were as follows: where TP i represents the number of samples in category i that were correctly predicted, FP i represents the number of samples that were incorrectly predicted to be in class i, and FN i represents the number of samples in category i that were predicted incorrectly.

Experimental Design
We set the length of each fact description text to 600, truncated the excess, completed the missing part, and then determined that the text contained 30 sentences, each with a length of 20.For each sentence, the embedding dimension of the sentence vector after pre-training was 768, which was the fixed dimension output by the pre-training model, and the number of expert networks in each expert group was 16.The Adam optimizer was used for model optimization; the initial learning rate was 0.001, the batch size was set to 256, and a total of 40 rounds of training were performed.If the loss did not drop over 10,000 batches, the model was considered to be overfitting, and we terminated the training early.In addition, in order to prevent the occurrence of overfitting, we used the dropout mechanism [24]; the neural network was thrown out, and the retention rate was set to 0.5.We used the Pytorch deep learning framework for the experiments, and the experimental environment used is shown in Table 4.As shown in Table 6, all models performed better on CAIL-big than on CAIL-small, with the reason being that CAIL-big provided more sufficient training data.From the experimental results, our model still gained comprehensive improvement in terms of laws and charges.Compared with the current best-performing LADAN+TOPJUDGE, our model had an improvement of 0.75% and 0.88% in Acc., respectively, and an improvement of 1.36% and 2.3% in F1 values, respectively.However, the performance of the model on the sentence task was somewhat different from what was expected.Although the F1 value was slightly improved, the value of Acc. had a certain gap with LADAN.Compared with LADAN, our model learned legal knowledge through causal inference so that the model could better handle the small sample problem of legal prediction and easily confused laws.The great performance gain observed in our experiment will be beneficial to the task of crime prediction, but LADAN pays more attention to the case fact description itself.It learns 10 related features through an attention-based graph distillation operator to distinguish easily confused cases.Experiments have shown that it is of great help for sentence prediction, which also makes us better understand which content is more helpful for the three tasks.Note: Some numbers in bold in the table represent the optimal results in the experiment.

Ablation Experiments
In order to verify the importance of each part of our model, we designed ablation experiments to delete or replace modules to verify the effectiveness of the modules, including:

•
No Lawformer (NL): removing the PLM module to verify the effectiveness of the pre-trained model in improving the overall performance of the model.

•
No causal inference (NCI): deleting the causal inference module to verify that the causal inference found the causal relationship of related laws and regulations in order to improve the three tasks of LJP.

•
No multi-experts (NME): removing the multi-expert module to verify the superiority of the multi-expert mechanism for balancing the relationship between multi-tasks.

•
No extra knowledge (CEK): omitting the introduced extra knowledge to verify that the introduction of extra knowledge is helpful for the LJP task.

•
Change the location of extra knowledge (CLEK): changing the introduction location of extra information to verify that there is a certain loss in the transmission of information in the neural network.

•
Change FTOPJUDGE to TOPJUDGE(CFTT): changing FTOPJUDGE to TOPJUDGE to verify that FTOPJUDGE is more suitable than TOPJUDGE for processing the information that is output by the multi-expert mechanism.
This ablation experiment was only performed on CAIL-small, and only focused on the two evaluation indicators of Acc. and the F1 value.Because the F1 value was the harmonic average of precision and recall, it also reflected the quality of the MP and MR to a certain extent.The higher the value, the better the classification effect.
The experimental results are shown in Table 7.In order to verify the effectiveness of the pre-trained model Lawformer, we removed it for experiments.The results showed that there was a significant decrease in the accuracy of predicting laws and charges, but only slightly in terms of sentences.However, these two tasks were more dependent on understanding the description of the facts of the case than the prediction of the prison term, thus proving that the pre-training model does help to promote the model's understanding of the description of the facts of the case.In order to verify the validity of causal inference, we removed it and carried out experiments, and the results showed that there was indeed a significant decline in the predictive ability.At the same time, since the law task is the basis of all tasks, it also led to a decline in the prediction performance of prison terms and charges.Therefore, it was verified that causal inference does play an important role in predicting the law task.In order to verify the effectiveness of the multi-expert mechanism, we removed it for experiments.As a result, the model's performance dropped significantly for each task, which also showed that there is indeed a competitive relationship between multi-tasks, and our multi-expert mechanism solves this problem.In order to explore the role of additional knowledge, we removed it and carried out experiments, and found that the performance of the three tasks decreased, but the performance was not significantly decreased compared with the multi-expert mechanism, which proved that the key information it contained was indeed conducive to the determination of various tasks.In the above, we have summarized that the extra information experiences a certain degree of information loss after passing through the multi-expert mechanism.We also changed the introduction position of the extra information to the place where the multi-expert mechanism was input.The experimental results showed that the effect was worse than the situation without any additional information, indicating that this was no longer a loss of information but a disturbance noise.Finally, in order to verify that our proposed FTOPJUDGE module was more suitable for our model than TOPJUDGE, we replaced FTOPJUDGE with TOPJUDGE and conducted another experiment.The experimental results showed that the F1 values of all tasks except the sentence prediction task showed a significant decrease, while the Acc. was not affected very much.This also proved that TOPJUDGE's prediction of some small sample data is not ideal.The reason for this is that it uses LSTM as the basic classifier, which destroys the balance between tasks and also causes information loss.

Discussion
This paper focuses on the research of legal judgment prediction technology in the field of legal AI.By using the multi-dimensional information in judgment documents, the relevant laws, charges, and the sentence of the defendant involved in the case can be predicted.This paper proposes a decision prediction model based on causal inference and multi-expert FTOPJUDGE, including the pre-trained language model Lawformer, a causal inference mechanism, and a multi-task FTOPJUDGE classifier.The superiority of the model was verified by using the public data set CAIL2018 and comparing it with the current mainstream decision prediction models.Through ablation experiments, the effectiveness and rationality of each module of the model were verified.Although the model proposed in this paper has made great progress, there is still a gap between our obtained and ideal results, and the reasons can be traced back to the following points: (1) Data imbalance.Data imbalance is a natural and unavoidable phenomenon, especially in the legal field, where some crimes are scarce and some crimes are numerous.This was obvious when we analyzed the data set.Therefore, in order to alleviate the impact of data imbalances on the model, we also performed a series of processing steps on the data, such as omitting cases with laws and crimes that appear less frequently and converting the sentences to discrete data.However, the phenomenon of data imbalance still existed in our model.The experimental results on CAIL-big provide an example, as shown in Figure 8.Some sentence labels had close to 8000 pieces of data, while others had fewer than 100 pieces.In order to solve this problem, the best solution at present was to introduce richer additional information and to mine the information in the case description more fully.
(2) Sentence issues.It can be seen from the results that, although our model significantly outperformed other models in terms of sentence prediction relative to other tasks, its improvement rate was not very consistent with our expectations, and its abilities still have not reached an applicable level.The reason for this is that, in addition to insufficient information mining for the description of the facts of the case, in real life the judge often judges the sentence of the defendant from multiple perspectives, and in many situations, other factors have an impact on the sentence, such as whether the defendant has a criminal record, whether his guilty attitude is good, whether he is a minor, etc.However, this information does not appear in the factual description of the case, and for CAIL2018, the only additional information available in CAIL2018 was the penalty.Therefore, this also presented difficulties for our judgment prediction.As can be seen in Figure 8, the highest error rates arose from cases with shorter sentences, and our model did not do a good job of distinguishing between cases with no sentence and those with sentences of 0-6 months.

Conclusions
This paper investigates legal judgment prediction technology in the field of legal AI.The charges and the sentence of the defendant were predicted by using multidimensional information in judgment documents, and the relevant laws involved in the case.In existing judgment prediction studies, unstructured text information such as the case fact description is not sufficient, the understanding of the relationship between the three tasks is not sufficient, the model structures are not adjusted according to the relationship between the three tasks, and pre-training language models are not used as the upstream task.In this paper, a causal inference and multi-expert FTOPJUDGE decision prediction model is proposed, including the pre-trained language model Lawformer, a causal inference mechanism, and a multi-task FTOPJUDGE classifier.By using the public data set CAIL2018 and Some sentence labels had close to 8000 pieces of data, while others had fewer than 100 pieces.In order to solve this problem, the best solution at present was to introduce richer additional information and to mine the information in the case description more fully.
(2) Sentence issues.It can be seen from the results that, although our model significantly outperformed other models in terms of sentence prediction relative to other tasks, its improvement rate was not very consistent with our expectations, and its abilities still have not reached an applicable level.The reason for this is that, in addition to insufficient information mining for the description of the facts of the case, in real life the judge often judges the sentence of the defendant from multiple perspectives, and in many situations, other factors have an impact on the sentence, such as whether the defendant has a criminal record, whether his guilty attitude is good, whether he is a minor, etc.However, this information does not appear in the factual description of the case, and for CAIL2018, the only additional information available in CAIL2018 was the penalty.Therefore, this also presented difficulties for our judgment prediction.As can be seen in Figure 8, the highest error rates arose from cases with shorter sentences, and our model did not do a good job of distinguishing between cases with no sentence and those with sentences of 0-6 months.

Conclusions
This paper investigates legal judgment prediction technology in the field of legal AI.The charges and the sentence of the defendant were predicted by using multidimensional information in judgment documents, and the relevant laws involved in the case.In existing judgment prediction studies, unstructured text information such as the case fact description is not sufficient, the understanding of the relationship between the three tasks is not sufficient, the model structures are not adjusted according to the relationship between the three tasks, and pre-training language models are not used as the upstream task.In this paper, a causal inference and multi-expert FTOPJUDGE decision prediction model is proposed, including the pre-trained language model Lawformer, a causal inference mechanism, and a multi-task FTOPJUDGE classifier.By using the public data set CAIL2018 and comparing our model with the current mainstream decision prediction models, the superiority of the model was verified.The validity and rationality of each module of the model were verified by ablation experiments.The main contributions of this paper are as follows: Firstly, this paper proposes a mechanism for processing unstructured text based on a causal algorithm.In this mechanism, the keywords and laws in the text are extracted as causal graph elements, and then the causal inference algorithm is used to discover the causal relationship between each element so as to build a causal graph.Then, the causal graph is obtained by sampling, and the quality of each subgraph is evaluated to approximate the real causal relationship.Finally, the causal information is integrated into the neural network, which gives the neural network a stronger reasoning ability and improves the performance of the model.The experimental results show that this mechanism plays a role in solving the problem of small samples.
Secondly, this paper proposes the multi-expert FTOPJUDGE mechanism.This mechanism sets up an exclusive expert group for each task, and each expert group is composed of multiple expert networks, which alleviates the competition between tasks.At the same time, a shared expert network serving all tasks is set up to ensure information sharing and promotion among multi-tasks.On this basis, TOPJUDGE was reformed, and the FTOPJUDGE classifier was constructed based on a fully connected neural network.The experiments proved that it was helpful for improving the performance of the model.
Finally, the pretrained language model is applied to the decision prediction task.Because this model learned tens of millions of Chinese legal documents as the upstream task of judgment prediction, it could provide abundant prior knowledge for judgment prediction.The experiments showed that it could significantly improve the performance of downstream tasks on several indexes.

Figure 1 .
Figure 1.Overall structure of the model.

Figure 2 .
Figure 2. The overall process of GCI.

Figure 1 .
Figure 1.Overall structure of the model.

Mathematics 2022 ,
10, x FOR PEER REVIEW 4 of 23 additional information and input into the FTOPJUDGE classifier to complete the prediction of the laws, charges, and sentences.

Figure 1 .
Figure 1.Overall structure of the model.

Figure 2 .
Figure 2. The overall process of GCI.
addition, we also n of all, the identification of statute was the result of th the statute to other nodes.constraint, a cause must o document are usually wri used to constrain the edge The fourth step was t does not reveal whether there is a causal relationship between nodes; for Mathematics 2022, 10, x FOR PEER REVIEW where () is the score of the wo word  was.The original YAKE a we were dealing with Chinese text The second step was to select use the -means algorithm to clust The data  =  ,  , … ,  w  ,  , … ,  , where  ,  , … ,  w law.A number of objects, , was r  ,  , … ,  , and the distance bet culated as:

,
there are two possibilities, Mathematics 2022, 10, x FOR PEER REVIEW 6 of 23

or
Mathematics 2022, 10, x FOR PEER REVIEW

Mathematics 2022 ,
10, x FOR PEER REVIEW where () is the score o word  was.The original we were dealing with Chin The second step was use the -means algorithm The data  =  ,  ,  ,  , … ,  , where  ,  law.A number of objects,  ,  , … ,  , and the dist culated as:

Figure 3 .
Figure 3.The combination of the three attention mechanisms in Lawformer.Note: The meaning of Chinese characters in the figure is "seriously injured by the hospital".

Figure 3 .
Figure 3.The combination of the three attention mechanisms in Lawformer.Note: The meaning of Chinese characters in the figure is "seriously injured by the hospital".

Figure 4 .
Figure 4. Structure diagram of the multi-expert mechanism.

Figure 5 .
Figure 5.The specific structure of FTOPJUDGE.Figure 5.The specific structure of FTOPJUDGE.

Figure 5 .
Figure 5.The specific structure of FTOPJUDGE.Figure 5.The specific structure of FTOPJUDGE.

Figure 6 .
Figure 6.The use of causal strength to impose constraints.

Figure 6 .
Figure 6.The use of causal strength to impose constraints.

Figure 7 .
Figure 7. Fact description text length analysis in CAIL2018.

Figure 7 .
Figure 7. Fact description text length analysis in CAIL2018.

Mathematics 2022 , 23 Figure 8 .
Figure 8.A confusion matrix of the sentence prediction results of CAIL-big.Note: The rows represent predicted classifications and the columns describe true classifications.

Figure 8 .
Figure 8.A confusion matrix of the sentence prediction results of CAIL-big.Note: The rows represent predicted classifications and the columns describe true classifications.

Table 1 .
Types of edges in causal graphs and their meanings.

Table 1 .
Types of edges in causal graphs and their meanings.

Table 1 .
Types of edges in caus

Table 1 .
Types of edges in causal graphs and their meanings.

Table 1 .
Types of edges in ca

Table 1 .
Types of edges in causal graph

Table 1 .
Types of edges in causal graphs and their meanings.

Table 1 .
Types of edges in causal graphs and their meanings.

Table 1 .
Types of edges in ca

Table 2 .
Statistical analysis of the data set.

Table 5 .
Results of judgment prediction on CAIL-small in comparative experiments (%) (a) and (b).Note: Some numbers in bold in the table represent the optimal results in the experiment.

Table 6 .
Results of judgment prediction on CAIL-big in comparative experiments (%) (a) and (b).

Table 7 .
Results of decision prediction on CAIL-small in ablation experiments.Note: Some numbers in bold in the table represent the optimal results in the experiment.