Identification of Judicial Outcomes in Judgments: A Generalized Gini-PLS Approach

This paper presents and compares several text classification models that can be used to extract the outcome of a judgment from justice decisions, i.e., legal documents summarizing the different rulings made by a judge. Such models can be used to gather important statistics about cases, e.g., success rate based on specific characteristics of cases’ parties or jurisdiction, and are therefore important for the development of Judicial prediction not to mention the study of Law enforcement in general. We propose in particular the generalized Gini-PLS which better considers the information in the distribution tails while attenuating, as in the simple Gini-PLS, the influence exerted by outliers. Modeling the studied task as a supervised binary classification, we also introduce the LOGIT-Gini-PLS suited to the explanation of a binary target variable. In addition, various technical aspects regarding the evaluated text classification approaches which consists of combinations of representations of judgments and classification algorithms are studied using an annotated corpora of French justice decisions.


Introduction
Judicial prediction is the ability to predict what a judge will decide on a given case. Is it possible to develop efficient predictive models to automatize such predictions? This question has long been driving several initiatives at the crossroads of Artificial Intelligence and Law-in particular, through the development of predictive models based on the alignment of computable features of the case that were available to the judge prior to the judgment, with computable features of the judge's decision on the case. In this line of work, this paper presents a study towards the development of such predictive models taking advantage of Machine Learning and Natural Language Processing techniques. The legal vocabulary being notoriously ambiguous, we first detail important concepts that will be used thereafter.
A case begins with a complaint requesting a remedy for harm suffered due to the wrongdoing of the defendant. The features of the case are the circumstances existing prior to the filing of the complaint that is a set of facts sufficient to justify a right to file a complaint.
A claim is a request made by a plaintiff against a defendant, seeking legal remedy. Claims can be grouped into different categories, depending on the rule applicable and the type of remedy sought (e.g., injunctive relief, cease and desist order, damages).
A judgment summarizes the different rulings made by a judge about a certain case into a document. Judgments therefore contain many features that can be extracted (e.g., type of court, name of the parties, claims made by the parties, judges decisions on the claims). A complaint is a judgment that can contain Stats 2020, 3 many different claims, seeking different types of remedy. Therefore, in general, a judgment concern different types of claims.
The decision is a ruling made on a particular claim. We further consider that the judges decision on a claim is either accepted or rejected. Note that a judgment must be distinguished from the judge's decision on a specific claim.
In recent years, the methodology of judicial predictions were mostly exclusively based on the employ of neural networks, which may be seen as the most flexible models for classification and predictions of legal decisions when large datasets are available. Chalkidis and Androutsopoulos [1] use a Bi-LSTM network running on words on a task of extracting contractual clauses. Wei et al. [2] have shown the superiority of convolutional networks over Support Vector Machines for the classification of texts on large specific datasets. The use of a Bi-GRU has become a standard approach, see [3]. Performance of 92% was obtained on the identification of criminal charges and on judicial outcomes from Chinese criminal decisions [4]. These types of approaches can also be used successfully on judgments in civil matters [5]. Bi-LSTM networks coupled with a representation of the judgment in the form of a tensor achieve performance around 93% on a corpus of 1.8 million Chinese criminal judgments [6]. This work has been successfully replicated on a body of judgments of the European Court of Human Rights in English, with F-measure performance of 80% for bi-GRU networks with attention and Hierarchical BERT [7]. On the same corpus, the development of a specific lexical embedding ECHR2Vec makes it possible to reach performances around 86% [8]. Similar performances of 79% are obtained by TF-IDF (Term Frequency-Inverse Document Frequency) in the Portuguese language [9]. Although neural networks enable very good performances to be achieved, we defend in this paper the use of compression machine learning models based on word representations such as TF-IDF with different variants corresponding to different weighting schemes. These approaches are particularly suited dealing with small-to medium-size annotated datasets.
As we stressed, claims can be grouped into specific categories depending on their nature, e.g., several claims may refer to the notion of child care; such categories are defined a priori by jurists for the analysis of a corpus of judgments of interest. In addition, a judgment most of the time only contains a single claim of a given category (A corpus description and a descriptive analysis are provided in the next section). In this context, we are interested in the definition of predictive models able to predict the judge's decision expressed in a judgment for a specific category of claims. Stated otherwise, knowing that a judgment contains a single claim of a given category, the model will have to answer the following question analyzing the judgment (textual document): has the claim been accepted or rejected? Developing efficient predictors of the outcome of specific categories of claims is of major interest for the analysis of large corpus of judgments. It, for instance, paves the way for large statistical analysis of correlations between aspects of the case (e.g., parties, location of the court) and outcomes for specific categories of claims. Such analyses are important for theoretical studies on law enforcement and future development of models able to predict the outcome of cases. Note that traditional text classification techniques obtain good performance predicting if a judgment contains a claim of a specific category, see [10]. Obtaining relevant statistics about judge's decisions on a given category of claim would therefore be based on (i) applying the aforementioned model to distinguish judgments containing a claim of the category of interest, and (ii) applying the type of models studied in this paper to know the outcome of previously identified judgments.
The methodology of judicial predictions therefore depends on the ability of a model to predict the judge's decision on a claim inherent to a given category-without knowing the precise localization of the statement of the judge's decision inside a judgment. In this context, extracting the result of a claim can be formulated as a task of binary text classification. To tackle this task, we consider in this paper the supervised machine learning paradigm assuming that a set of annotated judgments, i.e., labelled dataset, is provided for each category of claims of interest. We therefore aim to use the labeled dataset for training an algorithm to recognize whether the request has been rejected or accepted. Considering this setting, the paper presents various models and empirically compares them on a corpus of French judgments. A statistical analysis of the impact of various technical aspects generally involved in the classification of texts which consists of a combination of representations of judgments and classification algorithms is proposed. This analysis sheds light on certain configurations making it possible to determine judges' decisions of a claim. We also propose the generalized Gini-PLS algorithm which is an extension of the simple Gini-PLS model [11]. This generalized Gini-PLS consists in adding a regularization parameter that makes it possible to better adapt the regression with respect to the information in the distribution tails while attenuating, as in the simple Gini-PLS, the influence exerted by outliers. We also propose a new regression (LOGIT-Gini-PLS) which is better suited to the explanation of a target variable when the latter is a binary variable. These two models have never been applied to text classification.
The paper is organized as follows: Section 2 presents characteristics of the corpus used for this study and motivates the modeling of the task adopted in this paper (i.e., decision outcome prediction as a binary text classification). Section 3 presents the different TF-IDF vectorizations of the judgments. Section 4 presents the proposed generalized (LOGIT) Gini-PLS algorithms for text classification. Section 5 presents our experiments and results. Section 6 concludes our study.

Datasets and Modeling Motivations
We assume in this paper that predicting judges decisions may be studied through the lens of the definition of binary text classification models. This positioning is based on discussions with jurists and motivated by analyses performed on labeled datasets of French judgments. Six datasets built from a corpus of French judgments are considered in our study, one for each of the six categories of claims introduced in Table 1. The semantics of the membership of a judgments into a category is: the judgments contain a claim of that category, i.e., all the judgments into the ACPA category contain a claim related to Civil fine for abuse of process. Table 2 presents sections of a judgment of that category [ACPA]. The sections refer to the mentions of the claim and to the corresponding decision, respectively. Figure 1 presents additional details about the datasets, in particular the number of claims of a category found in the judgments. On the one hand, the statistics on the labelled data show that the judgments contain for the most part a single claim of a category (or at least one claim of the category). The percentage of judgments having only one request of a category is respectively: 100% for ACPA, 63.33% for CONCDEL, 95.45% for DANAIS, 80.22% for DCPPC, and 76.21% for DORIS. However, we note the exception of the STYX category (damages on article 700 CPC), where, in most of the judgments, there are instead two claims. This exception can be justified by the fact that each party generally makes this type of request because it relates to the reimbursement of legal costs.   On the other hand, few judgments with two or more claims exist. In this case, the classification task of any claim becomes difficult since specific vocabulary and sentences may appear in the judgment related to other claims (although there are in the same category). This may be embodied by noise or outliers in the dataset of each claim category. The use of Gini estimators is therefore welcome to handle outlying observations. Observation 2. Most judges' decisions are binary: accept or reject.  Observation 3. The algorithm must be able to deal with a large number of tokens of judgments. Figure 3 illustrates the distribution of the judgments' lengths (number of tokens, i.e., words). We note that the texts are long in comparison to those usually considered by state-of-the-art text classification approaches. As we will discuss later, this particularity will hamper the use of some efficient existing approaches such as PLS algorithms for compression. Observation 4. In some claim categories, a strong imbalance may exist between the outcomes accept/reject. Table 3 presents the final statistics of the dataset used for both training and evaluating the predictive models evaluated in this study. As can be seen, four claim categories out of six exhibit strong unbalanced decisions.

Texts Classification
Text classification allows judgments to be organized in predefined groups. This technique has received a large audience for a long time. Two technical choices mainly influence the performance of the classification: the representation of the texts and the choice of the classification algorithm. In the following, the predicted variable is denoted y, the predictors are denoted x, the learning base including the observations of the sample is expressed as Considering a vocabulary V = {t 1 , t 2 , . . . , t n }, we further assume that every judgment d ∈ D is represented as a TF-IDF vector embedding (Term Frequency-Inverse Document Frequency) [12] is the weight of t k in d defined as the normalized product of a global weight g(t k ) depending on the training corpus and a local weight l(t k , d) stressing the importance of t k in judgment d: with n f a normalization factor. Table 4 summarizes the notations used in the paper. The global weight is computed following one of the methods presented in Table 5. The local weight is computed from the frequency of occurrences of the word in the judgment using one of the methods of Table 6.  Table 5. Global weighting metrics.

Description Formula
Gross term statement [12] t f (t, d) = Number of occurrences of t in d Presence of the word [12] Increased and standardized frequency of the word [12] at Normalization based on the average frequency of the word [23] (avg is the average) The vector representation of texts generally results in high-dimensional vectors whose coordinates are mostly zero. Consequently, dimension reduction (compression) techniques, such as PLS regressions, make it possible to obtain vectors more relevant to classification tasks.

Generalized Gini-PLS Algorithms for Text Classification
The Gini-PLS regression was introduced by [11]. In what follows, we propose two Gini-PLS algorithms: a generalized Gini-PLS regression based on the Gini generalized covariance operator, and a combination of the latter to the logistic regression. We first review the PLS algorithm.

PLS
The advantage of the Gini-PLS algorithm is to reduce the sensitivity to outliers. It is an extension of the PLS analysis (partial least square) [24]. The PLS analysis explains the dependence between one or more predicted variables y and predictors x = (x 1 , x 2 , · · · , x m ). It mainly consists in transforming the predictors into a reduced number of h orthogonal principal components t 1 , · · · , t h . It is therefore a method of dimension reduction in the same way as principal components analysis (PCA), linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA). The components t 1 , . . . , t h are built in different steps by applying the PLS algorithm repeatedly. More precisely, at each iteration i ∈ [1, h], the component t i is calculated by the formula t i = x · w i , and then the target y is regressed by OLS on x. PLS analysis has several advantages [25] including robustness to the high-dimensional problem and the ability to eliminate the multicollinearity problem [26]. These problems are likely to arise on small corpora of texts with a large number of words as in our case. The PLS method is extended and successfully applied for various regression problems [25] or classification of data in general [27][28][29], and of texts in particular [30].

The Gini Covariance Operator
Schechtman and Yitzhaki [31] have recently generalized the Gini covariance operator, i.e., co-Gini, in order to impose more or less weight at the tails of distributions. This Gini covariance operator is given by: where F(x k ) is the cdf of variable x k . Let us denote r k = (R ↓ (x 1k ), . . . , R ↓ (x Nk )) the vector decreasing rank of variable x k , in other words, the vector which assigns the lowest rank (1) of the observation with the highest value x dk , and so on: The generalized co-Gini operator is given by Schechtman and Yitzhaki [31]: The role of the co-Gini operator can be explained as follows. When ν → 1, the variability of the variables is attenuated so that cog ν (x k , x ) tends to zero (even if the variables x k and x are strongly correlated). On the contrary, if ν → ∞, then cog ν (x k , x ) allows one to focus on the distribution tails x . The use of the co-Gini operator attenuates the influence of outliers because the rank vector acts as an instrument in the regression of y on x (regression by instrumental variables) [32].
Thus, by proposing a Gini-PLS regression based on the ν parameter, we can calibrate the coefficient ν of the co-Gini operator in order to dilute the influence of the outlying observations. This generalized Gini-PLS regression becomes a regularized Gini-PLS regression where the parameter ν plays the role of a regularization parameter.

Generalized Gini-PLS Regressions
The first Gini-PLS algorithm was proposed by [11]. We describe below the new Gini-PLS algorithm based on the generalized co-Gini opetaror. The generalized Gini-PLS algorithm is depicted in Figure 4. Step 1: A weight vector w 1 is first built to improve the link (in the co-Gini sense) between the predicted variable y and the predictors x: max cog ν (y, xw 1 ) , s.t. w 1 1 = 1.
The solution of this program is: As in the standard PLS case, the target y is regressed by OLS on the first component t 1 = xw 1 : y =ĉ 1 t 1 +ε 1 .
Step 2: The rank vector of each regressor R ↓ (x k ) is regressed by OLS on t 1 (with residualsû 1 ): The second component t 2 is given by: Thereby, the components t 1 ⊥t 2 allow a link to be established between y and x by OLS: y =ĉ 1 t 1 +ĉ 2 t 2 +ε 2 .
A cross validation makes it possible to find the optimal number of h > 1 components to retain. To test for a component t h , we compute the model prediction with h components including document d,ŷ h d , and then without document d,ŷ h (−d) . The operation is iterated for all d varying from 1 to N: each time we remove the observation d, we re-estimate the model. To measure the significance of the model, we measure the predicted residual sum of squared issued from the model with h components: The sum of squared residuals of the model with h − 1 components is: The test statistics is:

The component t h is retained in the analysis if
, t h is significant in the sense that it improves the power of prediction of the model. In order to test for t 1 , we use: As in the standard PLS regression, the V IP hj statistic is measured in order to select the word x j which has the most significant impact on the decision y. The most significant words are those including V IP hj > 1 with: with cor 2 (y, t ) being Pearson's correlation between y and component t . This information is back propagated into the model (only once) in order to obtain the optimal number of components (on training data). The target variable y is then predicted as follows: category(x) = 0 ifŷ < 0.5 1 otherwise.

Generalized LOGIT-Gini-PLS
As can be seen in the generalized Gini-PLS algorithm, the weights w j come from the generalized co-Gini operator applied to a Boolean variable y ∈ {0, 1}. In order to find the weights w j which maximize the link between the words x j and the decision y, we propose to use the LOGIT regression-in other words, a sigmoid which is better adapted to Boolean variables. Thus, in each step of the Gini-PLS regression, we replace the maximization of the co-Gini by measuring the following conditional probability: where x d is the d-th line of the matrix x of the predictors (being the words in judgment d). The estimation of the vector β is done by maximum likelihood. Therefore, at each step h of the PLS algorithm, the weights w h are derived as follows: The generalized LOGIT-Gini-PLS algorithm is depicted in Algorithm 1.

Experiments and Results
We discuss the performance of various popular algorithms and the impact of data quantity and imbalance, heuristics, and explicit restriction of judgments to sections (regions) related to the claim category, as well as their ability to ignore other requests in the judgment. These experiments also aim to compare the effectiveness of LOGIT-Gini-PLS with other machine learning techniques. As in Im et al. [33], we compare different combinations of classification algorithms and term weighting methods (used for text representation). These combinations represent 600 experimental configurations including: (See https://github.com/tagnyngompe/taj-ginipls to enjoy the Python code of the Gini-PLS algorithms).  Table 5): χ 2 , dbid f , ∆ DF , dsid f , gss, id f , ig, mar, ngl, r f , avg global (mean of the global metrics); • 6 local weighting schemes (cf. Table 6): t f , tp, logt f , at f , logave, et avg local (mean of the local metrics).

Assessment Protocol
Two assessment metrics are used: precision and F 1 -measure. To take into account the imbalance between the classes, the macro-average is preferred (As suggested by a reviewer, the MCC could be used for the data imbalance). It is the aggregation of the individual contribution of each class. It is calculated from the macro-averages of the precision (P macro ) and of the recall (R macro ), which are calculated according to the average numbers of true positives (TP) , false positives (FP), and false negatives (FN) as follows: [35]: P macro = TP TP+FP , R macro = TP TP+FN . The efficiency of algorithms often depends on the hyper-parameters for which optimal values must be determined. The scikit-learn [36] library implements two strategies for finding these values: RandomSearch and GridSearch. Despite the speed of the RandomSearch method, it is non-deterministic and the values it finds give a less accurate prediction than the default values. It is the same for the GridSearch method, which is very slow, and therefore impractical in view of the large number of configurations to be evaluated. Consequently, the values used for the experiments are the values defined by default (Table 7). Table 7. Values of the hyperparameters of the algorithms.

Classification on the Basis of the Whole Judgment
By representing the entire judgment using various vector representations, the algorithms are compared with the representations that are optimal for them. We note from the results of Table 8 that the trees are on average better for all the categories even if on average the F 1 -measure is limited to 0.668. The results of PLS extensions are not very far from those of trees with differences of F 1 -measure around 0.1 (if we choose the right representation scheme). The F 1 average scores of the NBSVM and fastText algorithms generally do not exceed 0.5 despite being specially designed for texts. It can be noticed that they are very sensitive to the imbalance of data between the categories (more rejections than acceptances). Furthermore, it is more difficult to detect the acceptance of the requests. Indeed, these algorithms classify all the test data with the majority label (meaning) i.e., rejection, and therefore, they hardly detect some request acceptance. The case of the categories doris and dcppc for the NBSVM (F 1macro = 0.834) tends to demonstrate the strong sensitivity to negative cases of these algorithms since the F 1 -measure of "reject" is always higher than that of "accept" (Table 9).  (1)).
PLS algorithms systematically exceed the performance (F 1 -measurement) of fastText and NBSVM from 10 to 20 points. This tends to demonstrate the effectiveness of PLS techniques in their role of reduction of dimensions. Gini-PLS algorithms do not operate any better than conventional PLS algorithms. Presumably, the reduction of dimensions is done while still retaining too much noise in the data. This is confirmed by the results of the trees which remain very mixed for which the F 1 -measure (0.668) that exceeds barely that of Logit-PLS (0.648). It therefore seems necessary to proceed with zoning in the judgment to better identify relevant information and thereby reduce the noise.

Classification Based on Sections of Judgments Including the Vocabulary of the Category
Since the judgments are related to several categories of claim, we experiment with restrictions of the textual content based on different types of regions in judgments. The first types of regions are sections of the judgment: the description of facts and proceedings (Facts and Proceedings), judges' reasoning to justify their decisions (Opinion), the summary of judges' decisions (Holding). The sections are identified using a text labeling method [37]. Other types of regions are statements extracted from the sections related to the category of claim. They express either a claim, a result, a previous result (result_a), judges' reasoning about the category (context). These statements are extracted using regular expressions and are used in the restriction only if they contain a key-phrase of the category. The region-vector representation-algorithm combinations are compared in Table 10. The accuracy rate (F 1 ) increases significantly with the reduction of the judgment to regions, except for the doris category. The best restriction combines regions including the vocabulary of the category in the section Facts and Proceedings (request and previous result), in the Opinion section (context), and in the Holding section (result).
It is noteworthy that restricting the training of the model to the section facts and proceedings corresponds to the prediction of the judge's outcome. When additional information is employed to train the model, such as opinion and holding, the task is reduced to an identification or an extraction of the judge's outcome.
After reducing the size of the judgment, the trees provide excellent results, followed very closely by our GiniPLS and LogitGiniPLS algorithms. For example, in the dcppc category (see Table 5), Tree performance (F 1 = 0.985) slightly exceeds the LogitPLS (0.94) and standard PLS (0.934) algorithms. In the category concdel, Tree performance (F 1 = 0.798) is still closely followed by LogitGiniPLS (0.703) and standard PLS (0.657) algorithms.
The most interesting case is concerned with neighborhood disturbances (doris category). These judgments often involve multiple information that is sometimes difficult to synthesize, even for humans. The argumentation exposed in doris is related to multiple information (problems of views, sunshine, trees, etc.) so that the factual elements conditioning the identification of the judicial outcomes are sometimes complex. This information can be either under-represented or over-represented depending on the vectorization scheme. Our GiniPLS algorithm (like our LogitGiniPLS) seems to be particularly suitable for this category of request. The F 1 -measures obtained in this category amount to 0.806 (for GiniPLS and LogitGiniPLS) and 0.772 for StandardPLS while the trees of decisions are not part of the relevant algorithms for this category of request (or among the best three algorithms). This result reinforces the idea that our GiniPLS algorithms can sometimes compete with decision trees that act as a benchmark in the literature dealing with small datasets. This result would make it possible in the future to consider including our GiniPLS algorithms in ensemble methods to broaden the spectrum of algorithms robust to outliers and which at the same time play a role of data compression. Table 10. Accuracy of the classification with restriction of judgments to specific regions.

Category Region
Representation Algorithm

Conclusions
This article attempted to simplify the extraction of the meaning of the result rendered by the judges on a request for a given category of claims. It consisted in formulating the problem as a task of judgments' binary classification. Ten classification algorithms were tested over 55 methods of vector embeddings. We observed that the classification results were mainly influenced by three characteristics of our data. First, the very small number of training examples disadvantaged certain algorithms (sensitivity to outliers), such as fastText, which requires several thousand examples to update its parameters. Second, the strong imbalance between the classes ("accept" vs. "reject") made it difficult to recognize the minority class which is generally the "accept" class. This was shown by the strong gap between the errors on "reject" and those on "accept", as well as the good results obtained on dcppc. Finally, the presence of other claim categories in the judgment degraded the efficiency of the classification because the algorithms did not manage alone to find the elements in a direct relation with the analyzed category. This was demonstrated by the positive impact of the restriction of the content to be classified in certain particular regions of the decision, even if the appropriate restrictions depended on the category.
Decision trees were suitable for the classification task, but the use of Gini-PLS and Gini-Logit-PLS made it possible to obtain performances fairly close to those of trees and sometimes higher. It would be interesting to combine these variants of PLS algorithms, with others such as Sparse-PLS which could perhaps help to solve the problem of sparse vectors. There is also a large number of neural architectures for the classification of judgment and very large numbers of weighting metrics for the representation of texts, but none seem to fit all categories. Therefore, a study on the use of semantic embedding representations like Sent2Vec [38] or Doc2Vec [39] would be interesting.