An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction

Chen, Junyi; Zhang, Xuanqing; Zhou, Xiabing; Han, Yingjie; Zhou, Qinglei

doi:10.3390/math11092032

Open AccessArticle

An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction

by

Junyi Chen

¹

,

Xuanqing Zhang

²,

Xiabing Zhou

^3,*,

Yingjie Han

¹ and

Qinglei Zhou

¹

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

²

School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454003, China

³

School of Computer Science and Technology, Soochow University, Suzhou 215006, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(9), 2032; https://doi.org/10.3390/math11092032

Submission received: 3 March 2023 / Revised: 12 April 2023 / Accepted: 21 April 2023 / Published: 25 April 2023

(This article belongs to the Special Issue New Machine Learning and Deep Learning Techniques in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Legal Judgment Prediction aims to automatically predict judgment outcomes based on descriptions of legal cases and established law articles, and has received increasing attention. In the preliminary work, several problems still have not been adequately solved. One is how to utilize limited but valuable label information. Existing methods mostly ignore the gap between the description of established articles and cases, but directly integrate them. Second, most studies ignore the mutual constraint among the subtasks, such as logically or semantically, each charge is only related to some specific articles. To address these issues, we first construct a crime similarity graph and then perform a distillation operation to collect discriminate keywords for each charge. Furthermore, we fuse these discriminative keywords instead of established article descriptions into case embedding with a cross-attention mechanism to obtain deep semantic representations of cases incorporating label information. Finally, under a constraint among subtasks, we optimize the one-hot representation of ground-truth labels to guarantee consistent results across the subtasks based on the label-enhancement algorithm. To verify the effectiveness and robustness of our framework, we conduct extensive experiments on two public datasets. The experimental results show that the proposed method outperforms the state-of-art models by 3.89%/7.92% and 1.23%/2.50% in the average MF1-score of the subtasks on CAIL-Small/Big, respectively.

Keywords:

legal judgment prediction; cross-attention; label-enhancement algorithm; multi-task learning; graph convolutional network

MSC:

68T01; 68T07; 68T50

1. Introduction

Legal Judgment Prediction (LJP) is a crucial task in legal artificial intelligence [1], especially in the civil law system, and its goal is to automatically predict judgment results according to factual descriptions of legal cases and established law articles. The civil law system, also known as the statutory law system, is a legal system that has been adopted by many countries including China, France, and Germany. In many jurisdictions, the excessive workload of a court can result in severe delays. Suitable LJP models can assist legal professionals by improving work efficiency. In addition, they can provide legal consulting to those who lack legal expertise. As a result, it has become a research hotspot. At present, LJP is generally regarded as a text classification task that comprises three subtasks, i.e., applicable article prediction, charge prediction, and the term-of-penalty prediction, as illustrated in Figure 1. Some studies attempted to solve several subtasks simultaneously [2], whereas others focused on solving only one [3]. They all proposed excellent LJP models. However, due to the complexity of the semantics of legal case descriptions and the characteristics of multiple subtasks, the existing methods still have several shortcomings.

How to utilize limited but valuable label information. Labels carry valuable information that can aid in text classification [4], and it has become common to incorporate labels’ semantic information rather than solely treating them as indexes. Accordingly, some existing methods integrate established articles’ contents when encoding cases’ factual descriptions [5]. However, they all take the approach of directly encoding the entire established articles for fusion. In Figure 1 and Table A1, it can be seen that the text in a factual description is distinctly different from the established articles. Specifically, most words in a factual description describe the event process. In contrast, an established law article concisely describes the event behavior using more digital-related words such as the amount of money, percentage, multiple, term of penalty, etc. The method of directly encoding established articles for fusion ignores the gap between the two. Zhong et al. [3] proposed that judging whether the law articles and cases are relevant mainly depends on whether the key elements are consistent with the law articles. They manually divided each established article into seven elements to bridge the gap. Inspired by their work, we automatically extract the keywords related to each charge from cases’ factual descriptions to form a keyword dictionary, which is used as the label information rather than the established articles. There are no gaps in the integration process since the keywords come from the factual descriptions of cases. Table A1 shows the keywords associated with the crime of polluting the environment that are closely related to the factual description in Figure 1. By comparison, the keywords have a more concentrated feature distribution and have no digital-related information that is insensitive to the model and difficult to learn. Furthermore, due to the complexity of the factual description, there is a lot of redundant information. Similarly, the crime-keywords dictionary constructed with an unsupervised method also has noisy data. This motivated us to fuse them using a cross-attention mechanism [6] to capture valuable information.

Subtask consistency constraint of labels in LJP. Most existing studies only consider the topological dependencies among the LJP subtasks [2] while ignoring the semantic or logical label relationships. Unlike topological dependencies, the label relationship exists naturally, and, as shown in Figure A1, each charge is only logically or semantically related to some fixed articles, and vice versa. LJP methods can predict unreliable results when ignoring this constraint. Recently, Dong and Niu [7] adopted relational learning to enforce a consistency constraint among the labels. However, there are confusing labels in LJP, and using a one-hot representation may cause the model to become over-confident. We use the label-enhancement (LE) [8] algorithm to modify the original one-hot label distribution based on the consistency constraints of a graph. By exploiting the network backpropagation algorithm, we could iteratively optimize the model parameters to gradually integrate the consistency constraint into the model.

In this paper, we propose an approach based on Cross-Attention Mechanism and Label-Enhancement Algorithm (CMLEA) for the task of LJP. Specifically, we first extract the keywords corresponding to each charge from the factual descriptions of cases to form a crime-keywords dictionary and then construct a charge similarity graph. Second, we use an improved cross-attention algorithm to help locate the key elements in the factual descriptions and discover the charge-related words from the graph. Finally, we add a subtask consistency constraint by modifying the original one-hot label distribution based on the LE algorithm. The contributions of this paper can be summarized as follows:

1.: We propose a new LJP method that avoids the gap in information fusion by using keywords extracted from cases as label-enhanced information instead of established articles.
2.: We propose a novel cross-attention fusion distillation mechanism to fuse keywords, which not only identifies the distinctive keywords of each case but also optimizes the representation of keywords into a distinguishable representation.
3.: We use the LE algorithm to add a subtask consistency constraint to the one-hot distribution of labels, which not only improves the rationality and consistency of the prediction results but also alleviates the issue of the over-confidence of the model caused by confusing labels.
4.: We conduct experiments on two real datasets and achieve excellent results, outperforming all baseline models.

The rest of this paper is organized as follows. Section 2 reviews the related research works. Section 3 shows some mathematical notations and formalizes the LJP task. Section 4 presents our framework and introduces each module step by step. Section 5 conducts experiments to evaluate the proposed model and analyzes the experimental results. Section 6 provides an ethical discussion of LJP. Finally, Section 7 concludes this paper.

2. Related Works

This section reviews the existing literature relevant to our research and points out the differences between previous works and our work.

2.1. Legal Judgment Prediction

LJP has been studied for decades. It is an essential and representative task in countries with civil law systems such as France, Germany, Japan, and China. The early works [9] adopted mathematical or statistical methods to analyze legal cases in specific circumstances. However, these methods were limited by the small amount of labeled data and required manually designed features. With the development of neural networks, many researchers began to explore LJP using deep learning methods. These works can be divided into two primary directions as described below.

The first is the adoption of a novel architecture to improve task performance. For example, Chen et al. [10] proposed a deep gating network using a gating mechanism to improve the performance of predicting the term of penalty. Pan et al. [11] proposed a multi-scale attention model to handle cases with multiple defendants. In addition, Wang et al. [12] proposed a unified dynamic pairwise attention model to predict multiple applicable articles. Li et al. [13] proposed an element-driven attentive neural network model to improve LJP.

The second is to leverage external knowledge or the inherent properties of LJP to improve performance. Hu et al. [14] incorporated ten discriminative attributes of charges to help predict low-frequency charges. Similarly, Guo et al. [15] constructed nine case auxiliary sentences to help predict low-frequency charges. Zhong et al. [16] integrated a real judgment process into the model by iteratively asking questions to detect elements in the factual description and predict the judgment results. Gan et al. [17] proposed integrating declarative legal knowledge as a set of first-order logic rules into a coattention network-based model. These methods have been proven to be effective for LJP. However, integrating external knowledge requires manual construction and annotation, which limits its transferability. Some researchers utilized the relationships among the subtasks. For example, Luo et al. [18] utilized the attention mechanism relationship between facts and law articles to help predict charges. Similarly, Bao et al. [19] used this feature to design a legal attention mechanism. Zhong et al. [2] formalized the dependencies among subtasks as a Directed Acyclic Graph (DAG) and proposed a topological multi-task learning framework. Based on the topology structure among subtasks, Yang et al. [20] proposed a multi-perspective bi-feedback network with a word collocation attention mechanism and Xu et al. [21] designed a novel attention mechanism to distinguish between confusing charges. Moreover, some researchers have explored how to incorporate established articles’ information into the model. Chen et al. [5] proposed an article-wise attention mechanism to enable the content of established articles to guide the prediction of multi-label legal judgment. Yue et al. [22] utilized the results of intermediate subtasks to separate the factual description into different circumstances, which they exploited to make predictions of other subtasks. Dong and Niu [7] formalized LJP as a graph-node classification problem. They utilized a masked transformer network to obtain case representations and achieved local consistency of each node’s label distribution through relational learning. However, these methods ignore the differences between the description of the established articles and the cases. To address this problem, Zhong et al. [3] manually divided each article into seven elements and labeled the various elements. In contrast, we use the keywords extracted from the factual descriptions as the label-enhanced information. Since the two have the same distribution, there are no gaps when they are fused. Furthermore, we add a consistency constraint to the subtasks to improve the rationality and consistency of the prediction results.

2.2. Attention Mechanism

Attention is a technique that focuses on different parts of an input vector to capture long-term dependencies. Mnih et al. [23] developed an attention-based RNN model for image classification, which popularized the attention mechanism. Subsequently, Bahdanau et al. [24] applied the attention mechanism to natural language processing for the first time. They added an attention mechanism to the encoder–decoder method for machine translation, which achieved good results. Yang et al. [25] proposed a hierarchical attention network for document classification, which has been used in LJP [18]. Xiong et al. [6] proposed a coattention mechanism that simultaneously focuses on both questions and documents for the task of question answering. Furthermore, Vaswani et al. [26] proposed a transformer based entirely on the attention mechanism and achieved excellent performance in machine translation tasks. Wu et al. [27] proposed a dual attention matching module to model the relationship between the global features of one modality and the local features of another modality. Similar to these methods, we propose a cross-attention mechanism that is suitable for LJP, which not only obtains the keywords related to each case but also optimizes the hidden layer representation of these keywords to make them distinguishable.

2.3. Label Enhancement

Representing truth labels with one-hot vectors is a common practice for text classification models. However, inadequate one-hot representations tend to train over-confident models, which can lead to arbitrary prediction and model overfitting, especially for confusing datasets (with very similar labels) or noisy datasets (with labeling errors). To address this problem, label-enhancement methods have been proposed.

Szegedy et al. [28] employed the label-smoothing (LS) mechanism, which calculates loss, not with a hard truth label (with one-hot representation) but with a weighted mixture of these targets by adding a noise distribution. Geng [29] proposed a new learning paradigm called label distribution learning (LDL), where the label distribution covers a certain number of labels, representing the degree each label describes the instance. Moreover, Xu et al. [8] and Gao et al. [30] proposed label enhancement (LE), which is a technique for recovering label distributions from a training set by leveraging the topological information in the feature space and the correlations among labels. However, a truth-label distribution is difficult to obtain for many existing classification tasks and we only have a unique label for each sample. Guo et al. [31] proposed a novel label confusion model as an enhancement component for existing popular text classification models, which inspired us to improve the label propagation-based LE algorithm by dynamically adding a consistency constraint for labels among subtasks to the one-hot representation of truth labels.

3. Problem Formalization

The corresponding notations are shown in Table 1. Similar to existing studies, we formalize LJP as a document classification problem. The training legal case set can be denoted as

Γ = {(X_{i}^{d}, Y_{i})}_{i = 1}^{N}

, where N is the number of legal cases. In this paper, LJP aims to predict the applicable article, charge, and term of penalty for each case, so

Y_{i} = {y_{i}^{a}, y_{i}^{c}, y_{i}^{t}}

, where

y_{i}^{a} \in {0, 1}^{m}

,

y_{i}^{c} \in {0, 1}^{n}

, and

y_{i}^{t} \in {0, 1}^{p}

, and m is the number of article labels. Correspondingly, n and p are the numbers of charge labels and term-of-penalty labels, respectively. Our goal is to learn a classifier f from

Γ

that can predict the judgment results on data with unknown labels, i.e.,

f (X_{u}^{d}) \Rightarrow {\hat{y_{u}^{a}}, \hat{y_{u}^{c}}, \hat{y_{u}^{t}}}

, where

X_{u}^{d} \notin Γ

.

4. Methodology

This section introduces the framework of our proposed model and then presents each component.

4.1. Overview of the Proposed Framework

We propose an approach based on cross-attention mechanism and label-enhancement algorithm (CMLEA, see Figure 2) for LJP. Specifically, the CMLEA first builds a charge similarity graph as label-enhanced information based on the crime-keywords dictionary constructed from a set of cases. Then, the graph performs a distillation operation to extract a more discriminative node representation,

E^{c}

. Next, a cross-attention module is introduced to enable effective interaction between the factual description and label information. This module is followed by a fusion layer, which generates a case representation corresponding to each subtask. Finally, given the subtask consistency constraint, we use the labels of the subtasks as nodes to construct a consistency constraint graph and dynamically modify the one-hot representation of truth labels through label propagation to improve the rationality of the model’s prediction results.

4.2. Basic Encoder Module

We employ a basic encoder module to obtain the preliminary vector representations of the factual descriptions of the cases. Since Transformer [26] and its variants have achieved great success in many fields, e.g., Dong and Niu [7], we first encode word sequence

X^{d}

using BERT [32] to obtain the hidden vector representations

H_{d}

and

h^{d}

of the factual description.

\begin{matrix} H^{d} & = B E R T (x_{1}^{d}, x_{2}^{d}, \dots, x_{l}^{d}) \end{matrix}

(1)

\begin{matrix} h^{d} & = m a x p o o l i n g (H^{d}) \end{matrix}

(2)

where

H^{d} = {h_{1}^{d}, h_{2}^{d}, \dots, h_{l}^{d}} \in R^{l \times d_{h}}

, l is the length of the factual description and

d_{h}

is the dimension of the last hidden layer of BERT.

4.3. Crime-Keywords Fusion Module Based on Cross-Attention Mechanism

4.3.1. Crime-Keywords Dictionary Construction

We calculate the importance

α

of word

x \in X^{d}

for each charge

c \in Y^{c}

using YAKE [33]. Furthermore, in order to extract more representative words for each charge, we use the same method as Liu et al. [34], i.e., enhance YAKE with inverse document frequency [35].

α = \frac{w}{l o g ((N + 1) / (1 + d f))}

(3)

where w is the importance score of word x calculated by YAKE, N is the number of documents, and

d f

is the number of documents that contain word x.

According to the score

α

, we select the top k words of each charge to form a crime-keywords dictionary D with size

n \times k

. Unlike Section 4.2, since keywords are discrete and have no contextual semantic information, we obtain the corresponding initial embedding through Word2Vec [36].

E_{d_{i}} = W o r d 2 V e c (k w_{1}, k w_{2}, \dots, k w_{k})

(4)

where

E_{D} = {E_{d_{i}}}_{i = 1}^{n} \in R^{n \times (k \times d_{w})}

is the vector representation of D;

d_{w}

is the dimension of the word vector, as stated previously; and n is the number of charge labels.

4.3.2. Charge Similarity Graph Construction

As in Equation (4), we obtain a charge label’s initial representation with

{e_{c_{i}}}_{i = 1}^{n} \in R^{n \times d_{w}}

. Then, based on

E_{D}

, we improve the representation of each charge label,

e_{c_{i}}

, by incorporating the corresponding keyword representation

E_{d_{i}} = {e_{k w_{j}}}_{j = 1}^{k}

.

e_{c_{i}}^{*} = \frac{e_{c_{i}} + \sum_{j = 1}^{k} e_{k w_{j}}}{k + 1}

(5)

where

e_{k w_{j}}

is the keyword representation corresponding to charge

c_{i}

.

Based on the representation,

e_{c_{i}}^{*}

, we employ the cosine similarity to calculate the weight of the edges between two charges. Furthermore, to focus on the most similar charges, we remove edges with weights less than a predefined threshold

μ

.

\begin{matrix} s_{i j} & = s i m (i, j) = \frac{e_{c_{i}}^{*} \cdot e_{c_{j}}^{*}}{| e_{c_{i}}^{*} | \cdot | e_{c_{j}}^{*} |} \end{matrix}

(6)

\begin{matrix} M_{s} (i, j) & = \{\begin{matrix} s_{i j} & i \neq j, s_{i j} \geq μ \\ 0 & o t h e r w i s e \end{matrix} \end{matrix}

(7)

For better convergence, we normalize the adjacency matrix through

M_{s}^{*} = P_{s}^{- \frac{1}{2}} M_{s} P_{s}^{- \frac{1}{2}}

. Here,

P_{s}

is a diagonal matrix with the elements

{\hat{M}}_{s} (i, i) = \sum_{j = 1}^{n} M_{s} (i, j)

. Finally, we obtain the charge similarity graph

G_{s} = (V_{s}, E_{s})

,

V_{s}

as the set of nodes whose feature matrix is

E^{*} = {e_{c_{i}}^{*}}_{i = 1}^{n} \in R^{n \times d_{w}}

.

E_{s}

is the representation of edges associated with the matrix

M_{s}^{*} \in R^{n \times n}

.

The construction procedure is provided in Algorithm 1.

Algorithm 1 Construction of crime-keywords dictionary and charge similarity graph.

Input:
The legal case set $Γ$ .
The set of charge labels $Y^{c} = {c_{1}, \dots, c_{n}}$ .
Output:
Crime-keywords dictionary D.
Charge similarity graph $G_{s}$ .
1. Construct the crime-keywords dictionary D using Equation (3) based on $Y^{c}$ and $Γ$ ;
2. Obtain the vector representation $E_{D}$ corresponding to D using Equation (4);
3. According to the keywords of each crime in D, calculate the vector representation $e_{c_{i}}^{*}$ of each charge using Equation (5) and obtain $E^{*} = {e_{c_{i}}^{*}}_{i = 1}^{n}$ ;
4. Calculate the similarity $s_{i j}$ of each charge and obtain the similarity matrix $M_{s}$ using Equation (7). Then, normalize $M_{s}$ through $M_{s}^{*} = P_{s}^{- \frac{1}{2}} M_{s} P_{s}^{- \frac{1}{2}}$ ;
5. Obtain the charge similarity graph $G_{s} = (V_{s}, E_{s})$ based on $E^{*}$ and $M_{s}^{*}$ ;

4.3.3. Distillation Operation

Due to the existence of confusing charges, we design a distillation operation to extract a more discriminative node representation, which was inspired by Xu et al. [21] and Yue et al. [22]. Unlike graph convolution [37] where aggregated node representations can become indistinguishable, especially in similarity graphs, we remove similar features between nodes in

G_{s}

through a distillation operation and then utilize the distinct ones to enrich the representation of the central node. The detailed calculation is as follows:

E_{c}^{(l + 1)} = E_{c}^{l} - \frac{M_{s}^{*} E_{c}^{(l)} W^{(l)}}{| N_{c} |}

(8)

where

E_{c}^{(l)} \in R^{n \times d_{w}}

is the representation of the nodes at the

l_{t h}

layer and

E_{c}^{(0)} = E^{*}

,

W^{l} \in R^{d_{w} \times d_{w}}

is the trainable parameter.

N_{c} = {| E_{c_{i}} |}_{i = 1}^{n}

is the number of edges of each node

c_{i}

, that is, the number of non-zero values in each row of

M_{s}

.

4.3.4. Cross-Attention Module

The keywords corresponding to different cases can vary, and even similar cases can have subtle differences in their keywords. Therefore, we propose a cross-attention module that integrates L2F attention and F2L attention based on coattention [6]. Specifically, L2F attention is the interaction between the discriminative charge label representation

E_{c}

and the document representation

H^{d}

to extract the key elements in the case. F2L attention is the interaction between

h^{d}

and the keyword representation

E_{D}

to extract the keywords associated with each case from the dictionary D.

\begin{matrix} A & = s o f t m a x (\frac{(E_{c} W_{q}) {(H^{d} W_{k})}^{T}}{\sqrt{d_{h}}}) \end{matrix}

(9)

\begin{matrix} B & = s o f t m a x (\frac{(h^{d} W_{q}^{^{'}}) {(E_{D} W_{k}^{^{'}})}^{T}}{\sqrt{d_{h}}}) \end{matrix}

(10)

\begin{matrix} H^{l f} & = λ A (H_{d} W_{v}) + (1 - λ) B {(E_{D} W_{v}^{^{'}})}^{T} \end{matrix}

(11)

where

W_{q}, W_{k}^{^{'}}

, and

W_{v}^{^{'}} \in R^{d_{w} \times d_{h}}

are the transformation matrices,

W_{k}, W_{q}^{^{'}}

, and

W_{v} \in R^{d_{h} \times d_{h}}

, which are all parameters that can be learned during training. As described above,

A \in R^{n \times l}

and

B \in R^{n \times k}

, where n is the number of charge labels, l is the length of the factual description, and k is the number of keywords for each charge.

s o f t m a x

is performed across the column.

λ \in (0, 1)

is a balancing parameter that controls the fraction of the information inherited from the L2F attention and F2L attention mechanisms.

H^{l f} \in R^{n \times d_{h}}

is the output of the cross-attention module.

4.4. Fusion Layer

Given

H^{l f}

, we fuse the residual links of the factual description

h^{d}

to obtain a case representation of the different subtasks, i.e.,

H^{a}

,

H^{c}

, and

H^{t}

.

\begin{matrix} H^{c} = W_{c} h^{d} + H^{l f} W_{c}^{^{'}} + b_{c} \end{matrix}

(12)

\begin{matrix} H^{a} = W_{a} h^{d} + W_{a}^{★} H^{l f} W_{a}^{^{'}} + b_{a} \end{matrix}

(13)

\begin{matrix} H^{t} = W_{t} h^{d} + W_{t}^{★} H^{l f} W_{t}^{^{'}} + b_{t} \end{matrix}

(14)

where

W_{c} \in R^{n \times d_{h}}

,

W_{c}^{^{'}} \in R^{d_{h}}

,

W_{a} \in R^{m \times d_{h}}

,

W_{a}^{★} \in R^{m \times n}

,

W_{a}^{^{'}} \in R^{d_{h}}

,

W_{t} \in R^{p \times d_{h}}

,

W_{t}^{★} \in R^{p \times n}

, and

W_{t}^{^{'}} \in R^{d_{h}}

and

b_{c} \in R^{n}

,

b_{a} \in R^{m}

, and

b_{t} \in R^{p}

, which are the trainable weight matrices and biases.

4.5. Subtask Consistency Constraint Module

Naturally, in a training dataset, a subtask consistency graph

G_{c o}

consists of all possible labels

Y = {Y^{a}, Y^{c}, Y^{t}}

as graph nodes and two labels from the subtask that co-occur in no less than ten cases as an edge. The matrix associated with its edges is constructed as follows:

M_{c o} (i, j) = \{\begin{matrix} 1 & i < m < j o r m < i < m + n < j < | L | \\ 0 & o t h e r w i s e \end{matrix}

(15)

where

| L | = m + n + p

is the number of all labels and

M_{c o} \in R^{| L | \times | L |}

. Next, similar to the procedure of constructing the charge similarity graph, we normalize

M_{c o}

to obtain

M_{c o}^{*} = P_{c o}^{- \frac{1}{2}} M_{c o} P_{c o}^{- \frac{1}{2}}

based on its degree matrix

P_{c o}

.

Since the truth label is represented by a one-hot vector, it cannot fully represent the corresponding text features. Especially in LJP, the label relationship of a subtask is complex (as shown in Appendix A.2) and there are noisy data. Inspired by Xu et al. [8], we convert hard labels represented by one-hot vectors to soft labels by adding a subtask consistency constraint. Unlike the graph-based LE algorithm [8], by adding case representation to

L^{(0)}

, we can achieve dynamic changes in the label distribution for different cases of the same label.

\begin{matrix} L^{(t + 1)} & = η L^{(t)} W^{(t)} M_{c o}^{*} + (1 - η) L^{(0)} \end{matrix}

(16)

\begin{matrix} L^{(0)} & = Y + H M_{c o} \end{matrix}

(17)

where

Y \in {0, 1}^{| L |}

is the hard truth label of the training dataset and

H = [H^{a}, H^{c}, H^{t}] \in R^{| L |}

is the case representation of the subtask.

η \in (0, 1)

is a balancing parameter that controls the fraction of the information inherited from the label propagation.

W^{(t)} \in R^{| L | \times | L |}

and

b^{(t)} \in R^{| L |}

are the trainable parameters and biases, respectively. Finally,

L^{(t + 1)}

will converge to L, which is then viewed as the soft target to supervise the model training.

4.6. Prediction and Training

Here, LJP has three subtasks. For ease of description, we use

i = 1, 2, 3

to represent them, respectively, e.g.,

H^{a}, H^{c}, H^{t}

, which can be abbreviated to

H^{i}

. We input the case representation

H^{i}

into the softmax classifier to obtain the prediction for each subtask.

{\hat{y}}^{i} = s o f t m a x (H^{i})

(18)

Algorithm 2 Optimization Algorithm.

Require:
Training set $Γ = {(X_{i}, Y_{i})}_{i = 1}^{N}$ .
Crime-keywords dictionary $E_{D}$ .
Charge similarity graph $G_{s}$ .
Task consistency graph $G_{c o}$ .
T = 10;
t = 0;
% pretrain;
while $t < T$ and not converged do
for each batch in $Γ$ do
Compute $\hat{y} (x_{i}, θ, E_{D}, G_{s})$ using Equation (18);
$θ$ = $a r g m a x_{θ} L_{p} ({\hat{y}}_{i}, y_{i})$ , as in Equation (19);
$t + +$ ;
end for
end while
% add subtask consistency constraint;
while $t < E p o c h$ do
for each batch in $Γ$ do
Compute soft targets $L (x_{i}, θ, E_{D}, G_{s}, G_{c o})$ using Equation (16) and Equation (17);
Compute ${\hat{y}}_{i} (x_{i}, θ, E_{D}, G_{s})$ using Equation (18);
$θ$ = $a r g m a x_{θ} L_{p} ({\hat{y}}_{i}, l_{i})$ , as in Equation (21);
$t + +$ ;
end for
end while

For training, the loss function is divided into two parts, where one is the prediction loss calculated on the one-hot representation Y and the other is the consistency loss calculated on the soft label L. Specifically, we calculate the cross-entropy loss for each subtask and take the sum of the losses for all subtasks as the corresponding loss.

\begin{matrix} L_{p} & = - \sum_{i = 1}^{3} \sum_{j = 1}^{| Y^{i} |} y_{j}^{i} l o g (\hat{y_{j}^{i}}) \end{matrix}

(19)

\begin{matrix} L_{c o n} & = - τ \sum_{i = 1}^{3} \sum_{j = 1}^{| Y^{i} |} l_{j}^{i} l o g (\hat{y_{j}^{i}}) \end{matrix}

(20)

\begin{matrix} L & = L_{p} + L_{c o n} \end{matrix}

(21)

where

| Y^{i} |

denotes the number of labels for the various subtasks.

y_{j}^{i} \in {0, 1}^{| Y^{i} |}

is the one-hot representation of the truth label.

l_{j}^{i} \in {(0, 1)}^{| Y^{i} |}

is the soft label representation obtained in Section 4.5.

τ

is a predefined tunable parameter. The training process of the CMLEA is divided into two stages and the procedure is shown in Algorithm 2.

5. Experiments

In this section, we conduct experiments on two benchmark LJP datasets to show the effectiveness of our approach and present a detailed analysis.

5.1. Datasets

We conducted experiments on the publicly available dataset CAIL2018, which consists of CAIL-Small and CAIL-Big [38]. Various data preprocessing methods were adopted from existing studies, leading to varying results. To the best of our knowledge, Neurjudge [22] and R-former [7] are the latest LJP methods. To conduct the performance comparison, we followed the data preprocessing pipeline of Yue et al. [22] and ran the open source code of R-former on the same dataset. The corresponding data preprocessing process is as follows: (1) Exclude the case samples corresponding to law articles and charges with a frequency of no greater than 100. (2) Divide the terms of penalty into non-overlapping intervals. (3) Filter out complex samples with multiple applicable articles and charges. The detailed statistics of the datasets are shown in Table 2.

5.2. Baseline Methods

SVM+word2vec: uses SVM [39] to classify the text represented by word2vec [36].
FLA [18]: a neural network with attention mechanisms for capturing the interaction between the factual description and applicable articles.
TopJudge [2]: a topological multi-task learning model that incorporates the DAG dependencies among multiple subtasks into LJP.
Few-Shot [14]: an attribute-attentive charge prediction model that can predict few-shot charges and alleviate confusing charge issues.
LADAN [21]: an attention-based model that employs a graph neural network to learn the distinction between confusing legal articles and further distinguishes between confusing charges.
NeurJudge+ [22]: incorporates the semantics of established articles into facts to help divide the factual description into various subtasks for LJP.
R-former [7]: utilizes a masked transformer network to obtain case-discriminative representations, and achieves local consistency of each node’s label distribution through relational learning.

5.3. Experimental Setup and Evaluation Metrics

For CNN- or RNN-based methods, we first used THULAC (https://github.com/thunlp/THULAC (accessed on 30 November 2021)) for word segmentation because the factual description is written in Chinese without spaces, and then utilized word2vec [36] to learn 200-dimensional vector representations of words. Meanwhile, we set the maximum document length to 350 and all hidden sizes to 150. For BERT-based methods, we used a pretrained Chinese model [32] and set the maximum document length to 512 tokens. For the proposed methods, when constructing the crime-keywords dictionary, we exploited the word embeddings (https://ai.tencent.com/ailab/nlp/en/download.html (accessed on 24 December 2021)) trained by Tencent AILab and the embedding size of each word was 200. We set the variable k to 20, threshold

μ

to 0.8, weight

λ

and

η

to 0.5, and

τ

to 0.01. During the training, we utilized the Adam optimizer [40] with a learning rate of

10^{- 5}

. We utilized PyTorch [41] to implement the proposed model and train it on a server (the system environment is described in Appendix A.1). Moreover, we employed a mini-batch for training to reduce the memory overhead and trained each model with 16 epochs. For the R-former, we trained the pretrain

θ

with 10 epochs and E-step with 6 epochs.

Finally, we chose four metrics typically used in document classification tasks [42], i.e., Accuracy (ACC.), Macro Precision (MP), Macro Recall (MR), and Macro F1 (MF1), to evaluate the performance of the model. Since both CAIL-Small and CAIL-Big are imbalanced datasets, we mainly used MF1 for comparison with other methods.

5.4. Experimental Results

Comparison against baselines. We first compared the CMLEA with some baselines and the results are shown in Table 3 and Table 4. We found that our method achieved the best results for all the metrics of the three subtasks. Specifically, we observed the following: (1) SVM+word2vec did not perform as well as the deep learning-based models. This may be due to its limitations when using static word vector features to represent text. (2) FLA and TopJudge both utilized the relationships among the subtasks. FLA achieved poor results because it is a pipeline model, which can easily lead to error propagation. Although TopJudge utilized the topological relationship among the subtasks, it performed better than FLA. However, because it did not utilize valuable label information, it was inferior to the other deep learning-based models. (3) The performance of Few-Shot and LADAN was roughly equivalent and enriched the label information from different perspectives. However, Few-Shot did not consider the constraints among the subtasks and LADAN treated the entire established article as label information, ignoring the gap between the established article and the factual description of the case. (4) The CMLEA outperformed the state-of-the-art methods NeurJudge+ and R-former. Specifically, on the average MF1-score of the three subtasks, the CMLEA improved by 3.89%/7.92% relative to NeurJudge+ and 1.23%/2.50% relative to R-former in CAIL-Small/Big, respectively. We believe that this is due to our incorporation of effective label information (i.e., crime keywords with a distillation operation) and the addition of a consistency constraint among the subtasks. (5) Furthermore, we made an interesting finding by comparing the results of the models on CAIL-Small and CAIL-Big. When predicting law articles and charges, the MF1-score of all methods on CAIL-Big was worse than on CAIL-Small, whereas the accuracy was better. This phenomenon contradicts the rule that the larger the data, the better the model performance. However, after comparing the label frequency of the two, we found that the labels of the law articles and charges in CAIL-Big were highly imbalanced relative to those in CAIL-Small (as shown in Figure A2 and Figure A3). In contrast, the frequency distribution of the term-of-penalty label was similar, and accordingly, the models performed better on CAIL-Big than on CAIL-Small.

Comparison against BERT-based methods. We compared the CMLEA with some methods based on the BERT encoder and its variants, as shown in Table 5. We found that BERT and BERT-Crime [43] performed well on LJP tasks but performed worse than the CMLEA, which further validates the effectiveness of our proposed model. In addition, we found that the performance of NeurJudge+ with BERT as an encoder outperformed that with GRU as an encoder, which verifies that BERT can facilitate downstream tasks.

Ablation analysis of CMLEA. In addition, we analyzed the impact of the crime-keywords fusion module and consistency constraint module by comparing the CMLEA with its variants on CAIL-Small. First, we performed some degradation on the CMLEA to confirm the effectiveness of the individual modules (the performance is illustrated in Figure 3). Compared to the CMLEA, -kw removed the crime-keywords fusion module so the subsequent modules were performed directly on the factual description

h^{d}

. -con removed the subtask consistency constraint and loss

L_{c o n}

calculated on the soft label representations, which predicted the three subtasks based on

H^{a}, H^{c}, H^{t}

, respectively. We observed that both -kw and -con performed worse than the CMLEA, indicating that these two modules boosted the CMLEA to some extent. Furthermore, we replaced the subtask constraint of the CMLEA with the relational learning structure of R-former, denoted by -con+rl, which led to improvements compared to -con, but the performance was still slightly worse than the CMLEA.

5.5. Case Study

We performed a qualitative analysis of the CMLEA. The crime-keywords fusion module adopted by the CMLEA is divided into two parts, the distillation operation to optimize the keyword representations in the crime-keywords dictionary and the cross-attention module to ensure that the keywords interact with the factual description. In order to intuitively see the difference in the vectors after the above operation, we visualized these features in a two-dimensional space.

Specifically, we selected a number of crimes that are easy to confuse: “fraud”, “loan fraud”, “fraudulent loan, bill acceptance, financial documents”, “fundraising fraud”, “illegally absorbing public deposits”, etc. We used the heat map to show the changes in their correlations before and after the distillation operation, as shown in Figure 4. The image on the left is before the distillation operation and the one on the right is after the distillation operation. We found that after the operation, the correlation between each charge decreased and the distance became larger, which indicates that the procedure was effective.

Next, we visualized the factual description changes using t-SNE [44], as illustrated in Figure 5. The factual descriptions of the five crimes of “fraud”, “loan fraud”, “fraudulent loan, bill acceptance, financial documents”, “fundraising fraud”, and “illegally absorbing public deposits” are, respectively, denoted by different colors, i.e., red, dark blue, green, yellow, blue, etc. On the left-hand side is the word embedding of the factual description, which is easily confused. In the middle and on the right-hand side are the representations of these cases after CMLEA processing. We can see that dark blue (loan fraud) and yellow (fundraising fraud) can be distinguished, even though they are still close to other colors and the divisions are not clear. We can see from their corresponding keywords in the crime-keywords dictionary, as shown in Table A1, that many of their keywords are consistent. Furthermore, we know that both have a small number of training samples. We speculate that after we adopted the distillation operation, these factors led to the remaining keywords being insufficient to represent the charge label using the cross-attention mechanism.

In addition, we also analyzed the role of the consistency constraint module. Specifically, we chose two representative cases to compare the changes in the prediction results before and after adding this module, as shown in Table 6. In the absence of the consistency constraint, both examples in the table correctly predicted the relevant article but incorrectly predicted the corresponding crime name, which we expected, i.e., by adding the constraint module, the correctly predicted law articles will be associated with their corresponding crimes, thereby increasing the probability of predicting the proper charge and term of penalty, and vice versa.

6. Ethical Discussion

Although LJP has the potential to improve efficiency and fairness in criminal justice, it is also a sensitive area and some ethical issues inevitably arise [45].

The first dimension to consider is data ethics [46]. We used the public dataset CAIL2018 collected from an official government website (http://wenshu.court.gov.cn/ (accessed on 16 October 2018)). The private information (e.g., name, ID number, plate number, etc.) of all cases was anonymized before the case details were published on the website. In addition, the purpose of the government’s disclosure of judgment documents is to promote judicial justice and enhance judicial credibility. Correspondingly, models that can learn from existing cases can help to maintain fairness.

The second is the robustness of the model. Models inevitably have drawbacks. For example, models tend to be biased toward classes that have large amounts of data and are easy to learn due to imbalanced data distribution and varying levels of difficulty. As shown in Table 3 and Table 4, the ACC-score and MF1-score for the subtask of the term-of-penalty prediction were only 43.45%/60.65% and 40.73%/47.44% on CAIL-Small/Big, respectively. This indicates that the results achieved by these methods are not reliable, especially in the legal field, where a legal judgment mistake could have serious consequences, for instance, a person remaining in prison as a convicted criminal or being acquitted as a free man. Therefore, the predictions made by the models cannot be directly adopted.

In summary, we should note that the proposed method is not intended to replace offline criminal proceedings, nor is it intended to replace independent judgments of judicial officers. It is an aid for legal professionals, i.e., it helps judges to locate the applicable articles quickly and plays an important role in ensuring the principle of “treating like cases alike”. The predictions made by these models should be used as references, and the actual decision-making process still needs to be conducted by the professionals themselves. Therefore, utilizing non-human intelligence to determine the legal rights and obligations of human beings does not violate any ethical codes.

7. Conclusions

In this paper, we propose CMLEA, an approach for LJP based on cross-attention mechanism and label-enhancement algorithm. We construct a crime-keywords dictionary as the label information and design a novel cross-attention algorithm fused with a distillation operation on a charge similarity graph to help locate the key elements in factual descriptions and discover the case-related words in the dictionary. Furthermore, we use the label-enhancement algorithm to add a consistency constraint to subtasks to improve the prediction results. Our experimental results show that the CMLEA can achieve comparable performance with state-of-the-art LJP models. The average MF1-score of three subtasks exceeded two state-of-the-art methods by 3.89%/7.92% and 1.23%/2.50% on the CAIL-Small/Big dataset, respectively.

However, we observed that LJP performed poorly in the subtask of term-of-penalty prediction, with an MF1-score of only 40.73%/47.44%, whereas the MF1-scores of the other two subtasks were 89.58%/86.10% and 91.71%/89.99%, respectively. Meanwhile, we found that the CMLEA had some limitations in dealing with small sample sizes, that is, the cross-attention mechanism did not perform well in these cases and the frequency information of different labels was not taken into account. In addition, the interpretability of the prediction results was insufficient and needs improvement. In the future, we plan to explore (1) how to integrate digital-related information into established articles with factual descriptions to improve the performance of the term-of-penalty prediction; (2) techniques for optimizing the fusion of label information to address the issue of the poor performance of LJPs with small sample sizes; and (3) methods to improve the interpretability of LJP prediction results.

Author Contributions

Conceptualization, J.C.; Data curation, X.Z. (Xuanqing Zhang); Formal analysis, J.C.; Funding acquisition, Q.Z.; Investigation, X.Z. (Xuanqing Zhang); Methodology, J.C.; Project administration, X.Z. (Xiabing Zhou); Resources, Q.Z.; Software, X.Z. (Xuanqing Zhang) and J.C.; Supervision, Y.H. and Q.Z.; Validation, X.Z. (Xuanqing Zhang); Visualization, J.C.; Writing—original draft preparation, J.C.; Writing—review and editing, Y.H. and X.Z. (Xiabing Zhou). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (grant no. 62176174).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used to support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Running Environment

We ran the models on an Ubuntu 22.04 system with one GPU, an NVIDIA Tesla V100, and 32 GB memory. The CPU was an Intel(R) Xeon(R) Silver 4216 @ 2.1 GHz, with 16 Cores and 3200 MHz. Additionally, we used version 1.10.1 of Pytorch and version 3.8.12 of Python.

Appendix A.2. Subtask Consistency Constraint and Data Statistics

Figure A1. Consistency constraint of subtasks’ labels.

Figure A2. Frequency statistics of three subtasks’ labels in CAIL-Small.

Figure A3. Frequency statistics of three subtasks’ labels in CAIL-Big.

Appendix A.3. Crime-Keywords Dictionary

Table A1. Examples of established articles and corresponding keywords.

Content of Established Articles	Crime-Keywords
Law Article 338—Crime of Polluting the Environment: Any person who, in violation of state regulations, discharges, dumps, or disposes of radioactive waste, waste containing infectious disease pathogens, toxic substances, or other harmful substances that seriously pollute the environment shall be sentenced to a fixed-term sentence of not more than three years imprisonment, criminal detention, or a fine. If the circumstances are serious, the sentence shall be fixed-term imprisonment of not less than three years but not more than seven years, and a fine shall be imposed. Under any of the following circumstances, the sentence shall be fixed-term imprisonment of not less than seven years and a fine.	Discharge, Wastewater, Chromium, Monitoring, Zinc, PH, Concentration, Environment, Pollutant, Environmental Protection Agency, Waste, Total Chromium, Department of Environment Protection, Standard, Exceeding Standard, Content, Monitoring Station
Law Article 266—Crime of Fraud: For defrauding public or private property, if the amount is relatively large, the sentence shall be fixed-term imprisonment of not more than three years, criminal detention, or public surveillance, and a fine. If the amount involved is large or there are other serious circumstances, the sentence shall be three years imprisonment. If the amount involved is substantially large or other extremely serious circumstances exist, the sentence shall be fixed-term imprisonment of not less than ten years or life imprisonment and a fine or confiscation of property. Where there are other provisions in this law, the provisions shall prevail.	Trust, Fake, Borrow Money, Bank Card, Cheat, Clothing Fee, Stolen Money, Tipping for Desk Fees, Hide, Fabricate, For the Reason, Squander, Take Possession of, Amount of Money, Real Situation, Lie, Mortgage, Bank, Defraud, Property
Law Article 193—Crime of Loan Fraud: Under any of the following circumstances, for the purpose of illegal possession or defrauding loans from banks or other financial institutions, if the amount is relatively large, the sentence shall be fixed-term imprisonment of not more than five years or criminal detention and a fine of not less than CNY 20,000 but not less than CNY 200,000. If the amount is substantially large or there are other extremely serious circumstances, the sentence shall be fixed-term imprisonment of not less than five years and not more than ten years and a fine of not less than CNY 50,000 but not more than CNY 500,000, or life imprisonment and a fine of not less than CNY 50,000 but not more than CNY 500,000.	Finance, Credit Union, Principal and Interest, Student, Borrow Money, Repayment, Loan, Credit, Take Possession of, Associated Agency, Fraud, Fake, Sub-branch, Bank, Mortgage, Cheat, Repay, Principal Money, Interest, Installments
Law Article 175—Crime of Fraudulent Loan, Bill Acceptance, Financial Documents: Any person who obtains loans, bill acceptances, letters of credit, letters of guarantee, etc., from banks or other financial institutions through fraudulent means and causes heavy losses to banks or other financial institutions shall be sentenced to fixed-term imprisonment of not more than three years or criminal detention and shall also be fined. Any person who causes substantially heavy losses to banks or other financial institutions or there are other extremely serious circumstances shall be sentenced to fixed-term imprisonment of not less than three years but not more than seven years and shall also be fined.	Finance, Credit Union, Company, Borrow Money, Loan, Acceptance Bill, Credit, Cheat, Fake, Associated Agency, Sub-branch, Bank, Maturity, Guarantee, Fraud, Principal Money, Interest, Exchange Bill, Repay
Law Article 192—Crime of Fundraising Fraud: Any person who illegally raises funds through fraudulent means for the purpose of illegal possession, if the amount is relatively large, shall be sentenced to fixed-term imprisonment of not less than three years but not more than seven years and shall also be fined. If the amount is substantially large or there are other extremely serious circumstances, the sentence shall be fixed-term imprisonment of not less than seven years or life imprisonment, a fine, or confiscation of property. Units that commit crimes in the preceding paragraph shall be fined, and the persons directly in charge and other directly responsible personnel shall be punished in accordance with the provisions of the preceding paragraph.	Borrow Money, Fundraising, Funds, High Amount of Money, Absorption, High Interest, Take Possession of, Amount, Monthly Interest, Fraud, Illegal, Cheat, Public, Investment, Repay, Principal Money, Interest, Bait, Raise Funds
Law Article 176—Crime of Illegally Absorbing Public Deposits: Any person who illegally absorbs public deposits or absorbs public deposits in a disguised form, thereby disturbing the financial order, shall be sentenced to fixed-term imprisonment of not more than three years or criminal detention and shall also be sentenced to a fine. If the amount is large or there are other serious circumstances, the sentence shall be fixed-term imprisonment of not less than three years but not more than ten years and a fine. If the amount is substantially large or there are other extremely serious circumstances, the sentence shall be fixed-term imprisonment of not less than ten years and a fine.	Finance, Member, Borrow Money, Disruption, Fundraising, Investors, Biao Hui, Funds, High Amount of Money, Absorption, Membership, Monthly Interest, Public, Repay, Investment, Bank Savings, Principal Money, Interest, Bait, Meeting Day

References

Medvedeva, M.; Wieling, M.; Vols, M. Rethinking the field of automatic prediction of court decisions. Artif. Intell. Law 2023, 31, 195–212. [Google Scholar] [CrossRef]
Zhong, H.; Guo, Z.; Tu, C.; Xiao, C.; Liu, Z.; Sun, M. Legal Judgment Prediction via Topological Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3540–3549. [Google Scholar]
Zhong, H.; Zhou, J.; Qu, W.; Long, Y.; Gu, Y. An Element-aware Multi-representation Model for Law Article Prediction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6663–6668. [Google Scholar]
Meng, Y.; Zhang, Y.; Huang, J.; Xiong, C.; Ji, H.; Zhang, C.; Han, J. Text Classification Using Label Names Only: A Language Model Self-Training Approach. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 9006–9017. [Google Scholar]
Chen, J.; Du, L.; Liu, M.; Zhou, X. Mulan: A Multiple Residual Article-Wise Attention Network for Legal Judgment Prediction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–15. [Google Scholar] [CrossRef]
Xiong, C.; Zhong, V.; Socher, R. Dynamic coattention networks for question answering. In Proceedings of the 5th International Conference on Learning Representations(ICLR ’17), Toulon, France, 24–26 April 2017; pp. 15–28. [Google Scholar]
Dong, Q.; Niu, S. Legal Judgment Prediction via Relational Learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Online, 11–15 July 2021; pp. 983–992. [Google Scholar]
Xu, N.; Liu, Y.P.; Geng, X. Label enhancement for label distribution learning. IEEE Trans. Knowl. Data Eng. 2019, 33, 1632–1643. [Google Scholar] [CrossRef]
Kort, F. Predicting Supreme Court decisions mathematically: A quantitative analysis of the “right to counsel” cases. Am. Political Sci. Rev. 1957, 51, 1–12. [Google Scholar] [CrossRef]
Chen, H.; Cai, D.; Dai, W.; Dai, Z.; Ding, Y. Charge-Based Prison Term Prediction with Deep Gating Network. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6362–6367. [Google Scholar]
Pan, S.; Lu, T.; Gu, N.; Zhang, H.; Xu, C. Charge Prediction for Multi-defendant Cases with Multi-scale Attention. In Proceedings of the CCF Conference on Computer Supported Cooperative Work and Social Computing, Kunming, China, 16–18 August 2019; Springer: Singapore, 2019; pp. 766–777. [Google Scholar]
Wang, P.; Yang, Z.; Niu, S.; Zhang, Y.; Zhang, L.; Niu, S. Modeling Dynamic Pairwise Attention for Crime Classification over Legal Articles. In Proceedings of the 41st international ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 485–494. [Google Scholar]
Li, S.; Liu, B.; Ye, L.; Zhang, H.; Fang, B. Element-Aware Legal Judgment Prediction for Criminal Cases with Confusing Charges. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 660–667. [Google Scholar]
Hu, Z.; Li, X.; Tu, C.; Liu, Z.; Sun, M. Few-Shot Charge Prediction with Discriminative Legal Attributes. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 487–498. [Google Scholar]
Guo, J.; Liu, Z.; Yu, Z.; Huang, Y.; Xiang, Y. Few Shot and Confusing Charges Prediction with the Auxiliary Sentences of Case. J. Softw. 2021, 32, 3139–3150. [Google Scholar]
Zhong, H.; Wang, Y.; Tu, C.; Zhang, T.; Liu, Z.; Sun, M. Iteratively Questioning and Answering for Interpretable Legal Judgment Prediction. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 1250–1257. [Google Scholar]
Gan, L.; Kuang, K.; Yang, Y.; Wu, F. Judgment Prediction via Injecting Legal Knowledge into Neural Networks. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 12866–12874. [Google Scholar]
Luo, B.; Feng, Y.; Xu, J.; Zhang, X.; Zhao, D. Learning to Predict Charges for Criminal Cases with Legal Basis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 2727–2736. [Google Scholar]
Bao, Q.; Zan, H.; Gong, P.; Chen, J.; Xiao, Y. Charge prediction with legal attention. In Proceedings of the 8th CCF International Conference on Natural Language Processing and Chinese Computing, Dunhuang, China, 9–14 October 2019; pp. 447–458. [Google Scholar]
Yang, W.; Jia, W.; Zhou, X.; Luo, Y. Legal Judgment Prediction via Multi-Perspective Bi-Feedback Network. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 4085–4091. [Google Scholar]
Xu, N.; Wang, P.; Chen, L.; Pan, L.; Wang, X.; Zhao, J. Distinguish Confusing Law Articles for Legal Judgment Prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3086–3095. [Google Scholar]
Yue, L.; Liu, Q.; Jin, B.; Wu, H.; Zhang, K.; An, Y.; Cheng, M.; Yin, B.; Wu, D. NeurJudge: A Circumstance-aware Neural Framework for Legal Judgment Prediction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Online, 11–15 July 2021; pp. 973–982. [Google Scholar]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, k. Recurrent Models of Visual Attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2014; pp. 2204–2212. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 3–9 December 2017; pp. 5999–6009. [Google Scholar]
Wu, Y.; Zhu, L.; Yan, Y.; Yang, Y. Dual Attention Matching for Audio-Visual Event Localization. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6291–6299. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
Geng, X. Label distribution learning. IEEE Trans. Knowl. Data Eng. 2016, 28, 1734–1748. [Google Scholar] [CrossRef]
Gao, Y.; Zhang, Y.; Geng, X. Label Enhancement for Label Distribution Learning via Prior Knowledge. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, Online, 7–15 January 2021; pp. 3223–3229. [Google Scholar]
Guo, B.; Han, S.; Han, X.; Huang, H.; Lu, T. Label confusion learning to enhance text classification models. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 12929–12936. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-training with whole word masking for chinese bert. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
Liu, X.; Yin, D.; Feng, Y.; Wu, Y.; Zhao, D. Everything Has a Cause: Leveraging Causal Inference in Legal Text Analysis. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 1928–1941. [Google Scholar]
Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations(ICLR ’17), Toulon, France, 24–26 April 2017; pp. 1–14. [Google Scholar]
Xiao, C.; Zhong, H.; Guo, Z.; Tu, C.; Liu, Z.; Sun, M.; Feng, Y.; Han, X.; Hu, Z.; Wang, H.; et al. Cail2018: A large-scale legal dataset for judgment prediction. arXiv 2018, arXiv:1807.02478. [Google Scholar]
Suykens, J.A.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33th Conference on Neural Information Processing Systems, Vancouver, Canada, 8–14 December 2019; pp. 8026–8037. [Google Scholar]
Zhou, Z.H. Model Selection and Evaluation. In Machine Learning; Springer: Singapore, 2021; pp. 25–55. [Google Scholar] [CrossRef]
Zhong, H.; Zhang, Z.; Liu, Z.; Sun, M. Open Chinese Language Pre-trained Model Zoo. Technical Report. 2019. Available online: https://github.com/thunlp/openclap (accessed on 2 March 2023).
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Tsarapatsanis, D.; Aletras, N. On the Ethical Limits of Natural Language Processing on Legal Text. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 3590–3599. [Google Scholar]
Leins, K.; Lau, J.H.; Baldwin, T. Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2908–2913. [Google Scholar]

Figure 1. An illustration of LJP, which predicts the applicable article, charge, and term of penalty according to the factual description of the case and established articles. There are complex dependencies among the three subtasks.

Figure 2. The framework of the proposed approach. In the similarity graph and constraint graph, the blue nodes represent the charge labels, the green and purple nodes represent the applicable article labels and terms-of-penalty labels, respectively.

Figure 3. Results of CMLEA and its variants on CAIL-Small.

Figure 4. The related heat map of the five charges. The darker the color, the stronger the correlation.

Figure 5. T-SNE visualizations of factual descriptions. Different colors represent different crime-related factual descriptions.

Table 1. Main mathematical notations.

Notation	Description
$X^{d} = {x_{1}^{d}, \dots, x_{l}^{d}}$	a word sequence of the factual description of the case
$Y^{a} = {a_{1}, \dots, a_{m}}$	the set of applicable article labels
$Y^{c} = {c_{1}, \dots, c_{n}}$	the set of charge labels
$Y^{t} = {t_{1}, \dots, t_{p}}$	the set of term-of-penalty labels
$D = {d_{i}}_{i = 1}^{n}, d_{i} = {k w_{1}, \dots, k w_{k}}$	the dictionary of crime keywords

Table 2. Statistics of datasets.

Dataset	CAIL-Small	CAIL-Big
#Training Set Cases	108,619	1,593,982
#Test Set Cases	26,120	185,721
#Validation Set Cases	13,738	-
#Law Articles	99	118
#Charges	115	129
#Term of Penalty	11	11

Table 3. Comparison of results on CAIL-Small. The best results for the main evaluation indicator are in bold.

Methods	Law Articles (%)				Charges (%)				Term of Penalty (%)
Methods	ACC.	MP	MR	MF1	ACC.	MP	MR	MF1	ACC.	MP	MR	MF1
SVM+word2vec	84.17	80.74	75.96	77.09	83.37	80.78	77.30	78.25	33.00	25.56	25.11	22.50
FLA	85.63	83.46	73.83	74.92	84.72	83.71	73.75	75.04	35.04	33.91	27.14	24.79
TopJudge	87.28	85.81	76.25	78.24	86.48	84.23	78.39	80.15	38.43	35.67	32.15	31.31
Few-Shot	88.44	86.76	77.93	79.51	88.15	87.51	80.57	81.98	39.62	37.13	30.93	31.61
LADAN	88.78	85.15	79.45	80.97	88.28	86.36	80.54	82.11	38.13	34.04	31.22	30.20
NeurJudge+	90.37	87.22	85.82	86.13	89.92	87.76	86.75	86.96	41.65	40.44	37.20	37.27
R-former	92.55	89.99	88.18	88.62	92.87	91.07	90.92	90.88	42.94	41.15	38.97	38.82
CMLEA	93.19	89.96	89.37	89.58	93.40	91.59	91.99	91.71	43.45	42.23	40.11	40.73

Table 4. Comparison of results on CAIL-Big. The best results for the main evaluation indicator are in bold.

Methods	Law Articles (%)				Charges (%)				Term of Penalty (%)
Methods	ACC.	MP	MR	MF1	ACC.	MP	MR	MF1	ACC.	MP	MR	MF1
SVM+word2vec	92.62	77.92	61.03	64.29	92.09	82.26	65.28	69.06	46.73	28.98	20.92	20.91
FLA	93.51	74.94	70.40	70.70	93.01	76.56	72.75	72.94	54.29	38.39	29.34	30.85
TopJudge	93.24	74.24	71.19	70.40	93.19	79.44	75.52	75.50	53.52	44.58	30.41	30.61
Few-Shot	93.74	78.51	73.79	74.18	93.24	80.59	76.62	76.89	54.54	39.09	33.36	33.48
LADAN	93.27	75.10	72.04	71.26	93.26	81.21	77.65	77.60	53.62	41.52	37.53	36.06
NeurJudge+	95.58	82.01	77.05	78.05	95.57	85.57	78.81	80.54	57.07	47.65	40.01	41.18
R-former	97.02	86.40	81.87	82.64	97.08	90.67	86.90	87.57	59.78	48.87	45.55	45.81
CMLEA	97.39	89.04	84.62	86.10	97.46	92.14	88.86	89.99	60.65	50.66	46.40	47.44

Table 5. Comparison of results on CAIL-Small (BERT-based methods). The best results for the main evaluation indicator are in bold.

Methods	Law Articles (%)		Charges (%)		Term of Penalty (%)
Methods	ACC.	MF1	ACC.	MF1	ACC.	MF1
BERT	90.81	86.06	90.68	87.69	40.37	34.09
BERT-Crime	91.30	85.70	91.26	87.81	40.90	34.65
R-former	92.55	88.62	92.87	90.88	42.94	38.82
NeurJudge+	92.64	88.75	92.91	90.89	43.81	39.76
CMLEA	93.19	89.58	93.40	91.71	43.45	40.73

Table 6. Comparison of prediction results before and after adding the consistency constraint. ✓indicates the correct label and ✗indicates the incorrect label.

Factual Description of the Case	Without Consistency Constraint	With Consistency Constraint
Defendant Chen invited more than 20 relatives and neighbors to place wreaths at the Boca Chemical Plant to obstruct production because of the death of her husband, an employee of the Chemical Plant, which seriously affected the normal production and operation of the Boca Chemical Plant.	290, ✓ Crime of sabotaging producti- on and operation, ✗ More than nine months in pri- son. ✓	290, ✓ Crime of gathering crowds to disturb social order, ✓ More than nine months in pri- son. ✓
Defendant Lin organized Fang to appeal to the State Letters and Calls Bureau, causing a large number of petitioners to gather for a long time on the sidewalk and bicycle lane opposite the reception desk of the State Letters and Calls Bureau, seriously disrupting social order.	290, ✓ Crime of picking quarrels and provoking trouble, ✗ More than one year in prison. ✗	290, ✓ Crime of gathering crowds to disturb social order, ✓ More than three years in pris- on. ✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Zhang, X.; Zhou, X.; Han, Y.; Zhou, Q. An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction. Mathematics 2023, 11, 2032. https://doi.org/10.3390/math11092032

AMA Style

Chen J, Zhang X, Zhou X, Han Y, Zhou Q. An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction. Mathematics. 2023; 11(9):2032. https://doi.org/10.3390/math11092032

Chicago/Turabian Style

Chen, Junyi, Xuanqing Zhang, Xiabing Zhou, Yingjie Han, and Qinglei Zhou. 2023. "An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction" Mathematics 11, no. 9: 2032. https://doi.org/10.3390/math11092032

APA Style

Chen, J., Zhang, X., Zhou, X., Han, Y., & Zhou, Q. (2023). An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction. Mathematics, 11(9), 2032. https://doi.org/10.3390/math11092032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction

Abstract

1. Introduction

2. Related Works

2.1. Legal Judgment Prediction

2.2. Attention Mechanism

2.3. Label Enhancement

3. Problem Formalization

4. Methodology

4.1. Overview of the Proposed Framework

4.2. Basic Encoder Module

4.3. Crime-Keywords Fusion Module Based on Cross-Attention Mechanism

4.3.1. Crime-Keywords Dictionary Construction

4.3.2. Charge Similarity Graph Construction

4.3.3. Distillation Operation

4.3.4. Cross-Attention Module

4.4. Fusion Layer

4.5. Subtask Consistency Constraint Module

4.6. Prediction and Training

5. Experiments

5.1. Datasets

5.2. Baseline Methods

5.3. Experimental Setup and Evaluation Metrics

5.4. Experimental Results

5.5. Case Study

6. Ethical Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Running Environment

Appendix A.2. Subtask Consistency Constraint and Data Statistics

Appendix A.3. Crime-Keywords Dictionary

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI