Next Article in Journal
Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme
Previous Article in Journal
On the Use of the Cumulative Distribution Function for Large-Scale Tolerance Analyses Applied to Electric Machine Design
Previous Article in Special Issue
Neural Legal Outcome Prediction with Partial Least Squares Compression
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identification of Judicial Outcomes in Judgments: A Generalized Gini-PLS Approach

1
EuroMov Digital Health in Motion, University of Montpellier, IMT Mines Ales, 30100 Ales, France
2
CHROME, University of Nîmes, Avenue du Dr. Georges Salan, 30000 Nimes, France
*
Author to whom correspondence should be addressed.
Stats 2020, 3(4), 427-443; https://doi.org/10.3390/stats3040027
Submission received: 19 July 2020 / Revised: 17 September 2020 / Accepted: 19 September 2020 / Published: 27 September 2020
(This article belongs to the Special Issue Interdisciplinary Research on Predictive Justice)

Abstract

:
This paper presents and compares several text classification models that can be used to extract the outcome of a judgment from justice decisions, i.e., legal documents summarizing the different rulings made by a judge. Such models can be used to gather important statistics about cases, e.g., success rate based on specific characteristics of cases’ parties or jurisdiction, and are therefore important for the development of Judicial prediction not to mention the study of Law enforcement in general. We propose in particular the generalized Gini-PLS which better considers the information in the distribution tails while attenuating, as in the simple Gini-PLS, the influence exerted by outliers. Modeling the studied task as a supervised binary classification, we also introduce the LOGIT-Gini-PLS suited to the explanation of a binary target variable. In addition, various technical aspects regarding the evaluated text classification approaches which consists of combinations of representations of judgments and classification algorithms are studied using an annotated corpora of French justice decisions.

1. Introduction

Judicial prediction is the ability to predict what a judge will decide on a given case. Is it possible to develop efficient predictive models to automatize such predictions? This question has long been driving several initiatives at the crossroads of Artificial Intelligence and Law—in particular, through the development of predictive models based on the alignment of computable features of the case that were available to the judge prior to the judgment, with computable features of the judge’s decision on the case. In this line of work, this paper presents a study towards the development of such predictive models taking advantage of Machine Learning and Natural Language Processing techniques. The legal vocabulary being notoriously ambiguous, we first detail important concepts that will be used thereafter.
A case begins with a complaint requesting a remedy for harm suffered due to the wrongdoing of the defendant. The features of the case are the circumstances existing prior to the filing of the complaint that is a set of facts sufficient to justify a right to file a complaint.
A claim is a request made by a plaintiff against a defendant, seeking legal remedy. Claims can be grouped into different categories, depending on the rule applicable and the type of remedy sought (e.g., injunctive relief, cease and desist order, damages).
A judgment summarizes the different rulings made by a judge about a certain case into a document. Judgments therefore contain many features that can be extracted (e.g., type of court, name of the parties, claims made by the parties, judges decisions on the claims). A complaint is a judgment that can contain many different claims, seeking different types of remedy. Therefore, in general, a judgment concern different types of claims.
The decision is a ruling made on a particular claim. We further consider that the judges decision on a claim is either accepted or rejected. Note that a judgment must be distinguished from the judge’s decision on a specific claim.
In recent years, the methodology of judicial predictions were mostly exclusively based on the employ of neural networks, which may be seen as the most flexible models for classification and predictions of legal decisions when large datasets are available. Chalkidis and Androutsopoulos [1] use a Bi-LSTM network running on words on a task of extracting contractual clauses. Wei et al. [2] have shown the superiority of convolutional networks over Support Vector Machines for the classification of texts on large specific datasets. The use of a Bi-GRU has become a standard approach, see [3]. Performance of 92% was obtained on the identification of criminal charges and on judicial outcomes from Chinese criminal decisions [4]. These types of approaches can also be used successfully on judgments in civil matters [5]. Bi-LSTM networks coupled with a representation of the judgment in the form of a tensor achieve performance around 93% on a corpus of 1.8 million Chinese criminal judgments [6]. This work has been successfully replicated on a body of judgments of the European Court of Human Rights in English, with F-measure performance of 80% for bi-GRU networks with attention and Hierarchical BERT [7]. On the same corpus, the development of a specific lexical embedding ECHR2Vec makes it possible to reach performances around 86% [8]. Similar performances of 79% are obtained by TF-IDF (Term Frequency-Inverse Document Frequency) in the Portuguese language [9]. Although neural networks enable very good performances to be achieved, we defend in this paper the use of compression machine learning models based on word representations such as TF-IDF with different variants corresponding to different weighting schemes. These approaches are particularly suited dealing with small- to medium-size annotated datasets.
As we stressed, claims can be grouped into specific categories depending on their nature, e.g., several claims may refer to the notion of child care; such categories are defined a priori by jurists for the analysis of a corpus of judgments of interest. In addition, a judgment most of the time only contains a single claim of a given category (A corpus description and a descriptive analysis are provided in the next section). In this context, we are interested in the definition of predictive models able to predict the judge’s decision expressed in a judgment for a specific category of claims. Stated otherwise, knowing that a judgment contains a single claim of a given category, the model will have to answer the following question analyzing the judgment (textual document): has the claim been accepted or rejected? Developing efficient predictors of the outcome of specific categories of claims is of major interest for the analysis of large corpus of judgments. It, for instance, paves the way for large statistical analysis of correlations between aspects of the case (e.g., parties, location of the court) and outcomes for specific categories of claims. Such analyses are important for theoretical studies on law enforcement and future development of models able to predict the outcome of cases. Note that traditional text classification techniques obtain good performance predicting if a judgment contains a claim of a specific category, see [10]. Obtaining relevant statistics about judge’s decisions on a given category of claim would therefore be based on (i) applying the aforementioned model to distinguish judgments containing a claim of the category of interest, and (ii) applying the type of models studied in this paper to know the outcome of previously identified judgments.
The methodology of judicial predictions therefore depends on the ability of a model to predict the judge’s decision on a claim inherent to a given category—without knowing the precise localization of the statement of the judge’s decision inside a judgment. In this context, extracting the result of a claim can be formulated as a task of binary text classification. To tackle this task, we consider in this paper the supervised machine learning paradigm assuming that a set of annotated judgments, i.e., labelled dataset, is provided for each category of claims of interest. We therefore aim to use the labeled dataset for training an algorithm to recognize whether the request has been rejected or accepted. Considering this setting, the paper presents various models and empirically compares them on a corpus of French judgments. A statistical analysis of the impact of various technical aspects generally involved in the classification of texts which consists of a combination of representations of judgments and classification algorithms is proposed. This analysis sheds light on certain configurations making it possible to determine judges’ decisions of a claim. We also propose the generalized Gini-PLS algorithm which is an extension of the simple Gini-PLS model [11]. This generalized Gini-PLS consists in adding a regularization parameter that makes it possible to better adapt the regression with respect to the information in the distribution tails while attenuating, as in the simple Gini-PLS, the influence exerted by outliers. We also propose a new regression (LOGIT-Gini-PLS) which is better suited to the explanation of a target variable when the latter is a binary variable. These two models have never been applied to text classification.
The paper is organized as follows: Section 2 presents characteristics of the corpus used for this study and motivates the modeling of the task adopted in this paper (i.e., decision outcome prediction as a binary text classification). Section 3 presents the different TF-IDF vectorizations of the judgments. Section 4 presents the proposed generalized (LOGIT) Gini-PLS algorithms for text classification. Section 5 presents our experiments and results. Section 6 concludes our study.

2. Datasets and Modeling Motivations

We assume in this paper that predicting judges decisions may be studied through the lens of the definition of binary text classification models. This positioning is based on discussions with jurists and motivated by analyses performed on labeled datasets of French judgments. Six datasets built from a corpus of French judgments are considered in our study, one for each of the six categories of claims introduced in Table 1.
The semantics of the membership of a judgments into a category is: the judgments contain a claim of that category, i.e., all the judgments into the ACPA category contain a claim related to Civil fine for abuse of process. Table 2 presents sections of a judgment of that category [ACPA]. The sections refer to the mentions of the claim and to the corresponding decision, respectively. Figure 1 presents additional details about the datasets, in particular the number of claims of a category found in the judgments.
Observation 1.
Decisions most often only contain a single claim of a specific category.
On the one hand, the statistics on the labelled data show that the judgments contain for the most part a single claim of a category (or at least one claim of the category). The percentage of judgments having only one request of a category is respectively: 100% for ACPA, 63.33% for CONCDEL, 95.45% for DANAIS, 80.22% for DCPPC, and 76.21% for DORIS. However, we note the exception of the STYX category (damages on article 700 CPC), where, in most of the judgments, there are instead two claims. This exception can be justified by the fact that each party generally makes this type of request because it relates to the reimbursement of legal costs.
On the other hand, few judgments with two or more claims exist. In this case, the classification task of any claim becomes difficult since specific vocabulary and sentences may appear in the judgment related to other claims (although there are in the same category). This may be embodied by noise or outliers in the dataset of each claim category. The use of Gini estimators is therefore welcome to handle outlying observations.
Observation 2.
Most judges’ decisions are binary: accept or reject.
Figure 2 highlights the fact that outcomes of a given claim are most often accepted or rejected, and that other forms of results are very rare. Theses observations motivate the interest of developing a binary classifier for predicting the outcome of a claim appertaining to a specific category.
Observation 3.
The algorithm must be able to deal with a large number of tokens of judgments.
Figure 3 illustrates the distribution of the judgments’ lengths (number of tokens, i.e., words). We note that the texts are long in comparison to those usually considered by state-of-the-art text classification approaches. As we will discuss later, this particularity will hamper the use of some efficient existing approaches such as PLS algorithms for compression.
Observation 4.
In some claim categories, a strong imbalance may exist between the outcomes accept/reject.
Table 3 presents the final statistics of the dataset used for both training and evaluating the predictive models evaluated in this study. As can be seen, four claim categories out of six exhibit strong unbalanced decisions.

3. Texts Classification

Text classification allows judgments to be organized in predefined groups. This technique has received a large audience for a long time. Two technical choices mainly influence the performance of the classification: the representation of the texts and the choice of the classification algorithm. In the following, the predicted variable is denoted y, the predictors are denoted x, the learning base including the observations of the sample is expressed as D = { ( x i , y i ) i = 1 . . . N } , and C represents claim categories.
Considering a vocabulary V = { t 1 , t 2 , , t n } , we further assume that every judgment d D is represented as a TF-IDF vector embedding (Term Frequency-Inverse Document Frequency) [12] d R n , where each dimension 1 k n refers to word t k V and d [ k ] = w ( t k , d ) is the weight of t k in d defined as the normalized product of a global weight g ( t k ) depending on the training corpus and a local weight l ( t k , d ) stressing the importance of t k in judgment d:
w ( t k , d ) = l ( t k , d ) × g ( t k ) × n f ( d )
with n f a normalization factor. Table 4 summarizes the notations used in the paper. The global weight is computed following one of the methods presented in Table 5. The local weight is computed from the frequency of occurrences of the word in the judgment using one of the methods of Table 6.
The vector representation of texts generally results in high-dimensional vectors whose coordinates are mostly zero. Consequently, dimension reduction (compression) techniques, such as PLS regressions, make it possible to obtain vectors more relevant to classification tasks.

4. Generalized Gini-PLS Algorithms for Text Classification

The Gini-PLS regression was introduced by [11]. In what follows, we propose two Gini-PLS algorithms: a generalized Gini-PLS regression based on the Gini generalized covariance operator, and a combination of the latter to the logistic regression. We first review the PLS algorithm.

4.1. PLS

The advantage of the Gini-PLS algorithm is to reduce the sensitivity to outliers. It is an extension of the PLS analysis (partial least square) [24]. The PLS analysis explains the dependence between one or more predicted variables y and predictors x = ( x 1 , x 2 , , x m ) . It mainly consists in transforming the predictors into a reduced number of h orthogonal principal components t 1 , , t h . It is therefore a method of dimension reduction in the same way as principal components analysis (PCA), linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA). The components t 1 , , t h are built in different steps by applying the PLS algorithm repeatedly. More precisely, at each iteration i [ 1 , h ] , the component t i is calculated by the formula t i = x · w i , and then the target y is regressed by OLS on x . PLS analysis has several advantages [25] including robustness to the high-dimensional problem and the ability to eliminate the multicollinearity problem [26]. These problems are likely to arise on small corpora of texts with a large number of words as in our case. The PLS method is extended and successfully applied for various regression problems [25] or classification of data in general [27,28,29], and of texts in particular [30].

4.2. The Gini Covariance Operator

Schechtman and Yitzhaki [31] have recently generalized the Gini covariance operator, i.e., co-Gini, in order to impose more or less weight at the tails of distributions. This Gini covariance operator is given by:
cog ( x , x k ) : = cov ( x , F ( x k ) ) = 1 N d = 1 N ( x d x ¯ ) ( F ( x d k ) F ¯ x k ) ,
where F ( x k ) is the cdf of variable x k . Let us denote r k = ( R ( x 1 k ) , , R ( x N k ) ) the vector decreasing rank of variable x k , in other words, the vector which assigns the lowest rank (1) of the observation with the highest value x d k , and so on:
R ( x d k ) : = N + 1 # { x x d k } no similar observation N + 1 d = 1 p # { x x d k } p if p similar observations x d k .
The generalized co-Gini operator is given by Schechtman and Yitzhaki [31]:
cog ν ( x , x k ) : = ν cov ( x , r k ν 1 ) ; ν > 1 .
The role of the co-Gini operator can be explained as follows. When ν 1 , the variability of the variables is attenuated so that cog ν ( x k , x ) tends to zero (even if the variables x k and x are strongly correlated). On the contrary, if ν , then cog ν ( x k , x ) allows one to focus on the distribution tails x . The use of the co-Gini operator attenuates the influence of outliers because the rank vector acts as an instrument in the regression of y on x (regression by instrumental variables) [32].
Thus, by proposing a Gini-PLS regression based on the ν parameter, we can calibrate the coefficient ν of the co-Gini operator in order to dilute the influence of the outlying observations. This generalized Gini-PLS regression becomes a regularized Gini-PLS regression where the parameter ν plays the role of a regularization parameter.

4.3. Generalized Gini-PLS Regressions

The first Gini-PLS algorithm was proposed by [11]. We describe below the new Gini-PLS algorithm based on the generalized co-Gini opetaror. The generalized Gini-PLS algorithm is depicted in Figure 4.
Step 1: A weight vector w 1 is first built to improve the link (in the co-Gini sense) between the predicted variable y and the predictors x :
max cog ν ( y , x w 1 ) , s . t . w 1 1 = 1 .
The solution of this program is:
w 1 = cog ν ( y , x ) cog ν ( y , x ) 1 .
As in the standard PLS case, the target y is regressed by OLS on the first component t 1 = x w 1 :
y = c ^ 1 t 1 + ε ^ 1 .
Step 2: The rank vector of each regressor R ( x k ) is regressed by OLS on t 1 (with residuals u ^ 1 ) :
R ( x ) = β ^ t 1 + u ^ 1 .
The second component t 2 is given by:
max cog ν ( ε ^ 1 , u ^ 1 w 2 ) s . t . w 2 1 = 1 w 2 = cog ν ( ε ^ 1 , u ^ 1 ) cog ν ( ε ^ 1 , u ^ 1 ) 1 t 2 = u ^ 1 w 2 .
Thereby, the components t 1 t 2 allow a link to be established between y and x by OLS:
y = c ^ 1 t 1 + c ^ 2 t 2 + ε ^ 2 .
Step h: Partial regressions are run up to t h 1 :
R ( x ) = β t 1 + + γ t h 1 + u ^ h 1 .
Then, after maximization:
w h = cog ν ( ε ^ h 1 , u ^ h 1 ) cog ν ( ε ^ h 1 , u ^ h 1 ) 1 t h = u ^ h 1 w h ,
we have by OLS,
y = c ^ 1 t 1 + + c ^ h t h + ε h .
A cross validation makes it possible to find the optimal number of h > 1 components to retain. To test for a component t h , we compute the model prediction with h components including document d, y ^ h d , and then without document d, y ^ h ( d ) . The operation is iterated for all d varying from 1 to N: each time we remove the observation d, we re-estimate the model. To measure the significance of the model, we measure the predicted residual sum of squared issued from the model with h components:
P R E S S h = d = 1 N y d y ^ h ( d ) 2 .
The sum of squared residuals of the model with h 1 components is:
R S S h 1 = d = 1 N y d y ^ ( h 1 ) d 2 .
The test statistics is:
Q h 2 = 1 P R E S S h R S S h 1 .
The component t h is retained in the analysis if P R E S S h 0.95 R S S h . In other terms, if Q h 2 0.0975 = ( 1 0 . 95 2 ) , t h is significant in the sense that it improves the power of prediction of the model. In order to test for t 1 , we use:
R S S 0 = d = 1 N y d y ¯ 2 .
As in the standard PLS regression, the V I P h j statistic is measured in order to select the word x j which has the most significant impact on the decision y. The most significant words are those including V I P h j > 1 with:
V I P h j : = m = 1 h R d ( y ; t ) w j 2 R d ( y ; t 1 , , t h )
and
R d ( y ; t 1 , , t h ) : = 1 m = 1 h cor 2 ( y , t ) = : = 1 h R d ( y ; t ) ,
with cor 2 ( y , t ) being Pearson’s correlation between y and component t . This information is back propagated into the model (only once) in order to obtain the optimal number of components (on training data). The target variable y is then predicted as follows:
c a t e g o r y ( x ) = 0 if y ^ < 0.5 1 otherwise .

Generalized LOGIT-Gini-PLS

As can be seen in the generalized Gini-PLS algorithm, the weights w j come from the generalized co-Gini operator applied to a Boolean variable y { 0 , 1 } . In order to find the weights w j which maximize the link between the words x j and the decision y, we propose to use the LOGIT regression—in other words, a sigmoid which is better adapted to Boolean variables. Thus, in each step of the Gini-PLS regression, we replace the maximization of the co-Gini by measuring the following conditional probability:
P ( y d = 1 / x = x d ) = exp x d β 1 + exp x d β
where x d is the d-th line of the matrix x of the predictors (being the words in judgment d). The estimation of the vector β is done by maximum likelihood. Therefore, at each step h of the PLS algorithm, the weights w h are derived as follows:
w h = β β 2
The generalized LOGIT-Gini-PLS algorithm is depicted in Algorithm 1.
Algorithm 1: Generalized LOGIT-Gini-PLS (training).
Stats 03 00027 i001

5. Experiments and Results

We discuss the performance of various popular algorithms and the impact of data quantity and imbalance, heuristics, and explicit restriction of judgments to sections (regions) related to the claim category, as well as their ability to ignore other requests in the judgment. These experiments also aim to compare the effectiveness of LOGIT-Gini-PLS with other machine learning techniques. As in Im et al. [33], we compare different combinations of classification algorithms and term weighting methods (used for text representation). These combinations represent 600 experimental configurations including: (See https://github.com/tagnyngompe/taj-ginipls to enjoy the Python code of the Gini-PLS algorithms).
  • 12 algorithms of classification: Naive Bayes (NB), Support Vector Machine (SVM), K-nearest neighbors (KNN), Linear and quadratic discriminant analysis (LDA/QDA), Tree, fastText, Naive Bayes SVM (NBSVM), generalized Gini-PLS (Gini-PLS), LOGIT-PLS [34], generalized LOGIT-Gini-PLS (LogitGiniPLS), and the usual PLS algorithm (StandardPLS);
  • 11 global weighting schemes (cf. Table 5): χ 2 , d b i d f , Δ D F , d s i d f , g s s , i d f , i g , m a r , n g l , r f , a v g g l o b a l (mean of the global metrics);
  • 6 local weighting schemes (cf. Table 6): t f , t p , l o g t f , a t f , l o g a v e , et a v g l o c a l (mean of the local metrics).

5.1. Assessment Protocol

Two assessment metrics are used: precision and F 1 -measure. To take into account the imbalance between the classes, the macro-average is preferred (As suggested by a reviewer, the MCC could be used for the data imbalance). It is the aggregation of the individual contribution of each class. It is calculated from the macro-averages of the precision ( P m a c r o ) and of the recall ( R m a c r o ), which are calculated according to the average numbers of true positives ( T P ¯ ) , false positives ( F P ¯ ), and false negatives ( F N ¯ ) as follows: [35]: P m a c r o = T P ¯ T P ¯ + F P ¯ , R m a c r o = T P ¯ T P ¯ + F N ¯ .
The efficiency of algorithms often depends on the hyper-parameters for which optimal values must be determined. The scikit-learn [36] library implements two strategies for finding these values: RandomSearch and GridSearch. Despite the speed of the RandomSearch method, it is non-deterministic and the values it finds give a less accurate prediction than the default values. It is the same for the GridSearch method, which is very slow, and therefore impractical in view of the large number of configurations to be evaluated. Consequently, the values used for the experiments are the values defined by default (Table 7).

5.2. Classification on the Basis of the Whole Judgment

By representing the entire judgment using various vector representations, the algorithms are compared with the representations that are optimal for them. We note from the results of Table 8 that the trees are on average better for all the categories even if on average the F 1 -measure is limited to 0.668. The results of PLS extensions are not very far from those of trees with differences of F 1 -measure around 0.1 (if we choose the right representation scheme).
The F 1 average scores of the NBSVM and fastText algorithms generally do not exceed 0.5 despite being specially designed for texts. It can be noticed that they are very sensitive to the imbalance of data between the categories (more rejections than acceptances). Furthermore, it is more difficult to detect the acceptance of the requests. Indeed, these algorithms classify all the test data with the majority label (meaning) i.e., rejection, and therefore, they hardly detect some request acceptance. The case of the categories doris and dcppc for the NBSVM ( F 1 m a c r o = 0.834 ) tends to demonstrate the strong sensitivity to negative cases of these algorithms since the F 1 -measure of “reject” is always higher than that of “accept” (Table 9).
PLS algorithms systematically exceed the performance ( F 1 -measurement) of fastText and NBSVM from 10 to 20 points. This tends to demonstrate the effectiveness of PLS techniques in their role of reduction of dimensions. Gini-PLS algorithms do not operate any better than conventional PLS algorithms. Presumably, the reduction of dimensions is done while still retaining too much noise in the data. This is confirmed by the results of the trees which remain very mixed for which the F 1 -measure (0.668) that exceeds barely that of Logit-PLS (0.648). It therefore seems necessary to proceed with zoning in the judgment to better identify relevant information and thereby reduce the noise.

5.3. Classification Based on Sections of Judgments Including the Vocabulary of the Category

Since the judgments are related to several categories of claim, we experiment with restrictions of the textual content based on different types of regions in judgments. The first types of regions are sections of the judgment: the description of facts and proceedings (Facts and Proceedings), judges’ reasoning to justify their decisions (Opinion), the summary of judges’ decisions (Holding). The sections are identified using a text labeling method [37]. Other types of regions are statements extracted from the sections related to the category of claim. They express either a claim, a result, a previous result (result_a), judges’ reasoning about the category (context). These statements are extracted using regular expressions and are used in the restriction only if they contain a key-phrase of the category. The region-vector representation-algorithm combinations are compared in Table 10. The accuracy rate ( F 1 ) increases significantly with the reduction of the judgment to regions, except for the doris category. The best restriction combines regions including the vocabulary of the category in the section Facts and Proceedings (request and previous result), in the Opinion section (context), and in the Holding section (result).
It is noteworthy that restricting the training of the model to the section facts and proceedings corresponds to the prediction of the judge’s outcome. When additional information is employed to train the model, such as opinion and holding, the task is reduced to an identification or an extraction of the judge’s outcome.
After reducing the size of the judgment, the trees provide excellent results, followed very closely by our GiniPLS and LogitGiniPLS algorithms. For example, in the dcppc category (see Table 5), Tree performance ( F 1 = 0.985) slightly exceeds the LogitPLS (0.94) and standard PLS (0.934) algorithms. In the category concdel, Tree performance ( F 1 = 0.798) is still closely followed by LogitGiniPLS (0.703) and standard PLS (0.657) algorithms.
The most interesting case is concerned with neighborhood disturbances (doris category). These judgments often involve multiple information that is sometimes difficult to synthesize, even for humans. The argumentation exposed in doris is related to multiple information (problems of views, sunshine, trees, etc.) so that the factual elements conditioning the identification of the judicial outcomes are sometimes complex. This information can be either under-represented or over-represented depending on the vectorization scheme. Our GiniPLS algorithm (like our LogitGiniPLS) seems to be particularly suitable for this category of request. The F 1 -measures obtained in this category amount to 0.806 (for GiniPLS and LogitGiniPLS) and 0.772 for StandardPLS while the trees of decisions are not part of the relevant algorithms for this category of request (or among the best three algorithms). This result reinforces the idea that our GiniPLS algorithms can sometimes compete with decision trees that act as a benchmark in the literature dealing with small datasets. This result would make it possible in the future to consider including our GiniPLS algorithms in ensemble methods to broaden the spectrum of algorithms robust to outliers and which at the same time play a role of data compression.

6. Conclusions

This article attempted to simplify the extraction of the meaning of the result rendered by the judges on a request for a given category of claims. It consisted in formulating the problem as a task of judgments’ binary classification. Ten classification algorithms were tested over 55 methods of vector embeddings. We observed that the classification results were mainly influenced by three characteristics of our data. First, the very small number of training examples disadvantaged certain algorithms (sensitivity to outliers), such as fastText, which requires several thousand examples to update its parameters. Second, the strong imbalance between the classes (“accept” vs. “reject”) made it difficult to recognize the minority class which is generally the “accept” class. This was shown by the strong gap between the errors on “reject” and those on “accept”, as well as the good results obtained on dcppc. Finally, the presence of other claim categories in the judgment degraded the efficiency of the classification because the algorithms did not manage alone to find the elements in a direct relation with the analyzed category. This was demonstrated by the positive impact of the restriction of the content to be classified in certain particular regions of the decision, even if the appropriate restrictions depended on the category.
Decision trees were suitable for the classification task, but the use of Gini-PLS and Gini-Logit-PLS made it possible to obtain performances fairly close to those of trees and sometimes higher. It would be interesting to combine these variants of PLS algorithms, with others such as Sparse-PLS which could perhaps help to solve the problem of sparse vectors. There is also a large number of neural architectures for the classification of judgment and very large numbers of weighting metrics for the representation of texts, but none seem to fit all categories. Therefore, a study on the use of semantic embedding representations like Sent2Vec [38] or Doc2Vec [39] would be interesting.

Author Contributions

Conceptualization, G.T.-N.; Methodology, G.T.-N., S.M., G.Z., S.H. and J.M.; Software, G.T.-N.; Validation, G.T.-N.; Formal Analysis, S.M.; Resources, G.Z.; Original Draft Writing Preparation, G.T.-N., S.M., G.Z., S.H. and J.M.; Review and Editing, G.T.-N., S.M., G.Z., S.H. and J.M.; Visualization, G.T.-N.; Supervision, J.M., S.M., S.H. and G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chalkidis, I.; Androutsopoulos, I. A Deep Learning Approach to Contract Element Extraction; JURIX: Luxembourg, 2017; pp. 155–164. [Google Scholar]
  2. Wei, F.; Qin, H.; Ye, S.; Zhao, H. Empirical study of deep learning for text classification in legal document review. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 3317–3320. [Google Scholar]
  3. Luo, B.; Feng, Y.; Xu, J.; Zhang, X.; Zhao, D. Learning to Predict Charges for Criminal Cases with Legal Basis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 2727–2736. [Google Scholar]
  4. Zhong, H.; Guo, Z.; Tu, C.; Xiao, C.; Liu, Z.; Sun, M. Legal Judgment Prediction via Topological Learning; EMNLP: Brussels, Belgium, 2018; pp. 350–354. [Google Scholar]
  5. Long, S.; Tu, C.; Liu, Z.; Sun, M. Automatic judgment prediction via legal reading comprehension. In Proceedings of the 18th China National Conference, Kunming, China, 18–20 October 2019; pp. 558–572. [Google Scholar]
  6. Guo, X.; Zhang, H.; Ye, L.; Li, S. RnRTD: Intelligent Approach Based on the Relationship-Driven Neural Network and Restricted Tensor Decomposition for Multiple Accusation Judgment in Legal Cases. Comput. Intell. Neurosci. 2019, 2019, 6705405. [Google Scholar] [CrossRef] [PubMed]
  7. Chalkidis, I.; Androutsopoulos, I.; Aletras, N. Neural legal judgment prediction in english. arXiv 2019, arXiv:1906.02059. [Google Scholar]
  8. O’Sullivan, C.; Beel, J. Predicting the Outcome of Judicial Decisions made by the European Court of Human Rights. In Proceedings of the 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, Dublin, Ireland, 6–7 December 2018. [Google Scholar]
  9. Lage-Freitas, A.; Allende-Cid, H.; Santana, O.; de Oliveira-Lage, L. Predicting Brazilian court decisions. arXiv 2018, arXiv:1905.10348. [Google Scholar]
  10. Tagny Ngomp, G. Mthodes Danalyse Smantique de Corpus de Dcisions Jurisprudentielles. Ph.D. Thesis, IMT Mines Ales, Ales, France, 2020. [Google Scholar]
  11. Mussard, S.; Souissi-Benrejab, F. Gini-PLS Regressions. J. Quant. Econ. 2018, 1–36. [Google Scholar] [CrossRef]
  12. Salton, G.; Buckley, C. Term-weighting Approaches In Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef] [Green Version]
  13. Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
  14. Wu, H.; Salton, G. A comparison of search term weighting: Term relevance vs. inverse document frequency. In Proceedings of the 4th Annual International ACM SIGIR Conference on Information Storage and Retrieval: Theoretical Issues in Information Retrieval; ACM: New York, NY, USA, 1981; Volume 16, pp. 30–39. [Google Scholar]
  15. Jones, K.S.; Walker, S.; Robertson, S.E. A Probabilistic Model Of Information Retrieval: Development And Comparative Experiments. Inf. Process. Manag. 2000, 36, 809–840. [Google Scholar] [CrossRef]
  16. Yang, Y.; Pedersen, J.O. A Comparative Study on Feature Selection in Text Categorization; ICML: Broken Arrow, OK, USA, 1997; Volume 97, pp. 412–420. [Google Scholar]
  17. Lan, M.; Tan, C.L.; Su, J.; Lu, Y. Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 721–735. [Google Scholar] [CrossRef] [PubMed]
  18. Schütze, H.; Hull, D.A.; Pedersen, J.O. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, USA, 9–13 July 1995; pp. 229–237. [Google Scholar]
  19. Ng, H.T.; Goh, W.B.; Low, K.L. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, USA, 27–31 July 1997; ACM: New York, NY, USA, 1997; Volume 31, pp. 67–73. [Google Scholar]
  20. Galavotti, L.; Sebastiani, F.; Simi, M. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the International Conference on Theory and Practice of Digital Libraries, Lisbon, Portugal, 18–20 September 2020; Springer: Berlin/Heidelberg, Germany, 2000; pp. 59–68. [Google Scholar]
  21. Marascuilo, L.A. Large-sample multiple comparisons. Psychol. Bull. 1966, 65, 280. [Google Scholar] [CrossRef] [PubMed]
  22. Paltoglou, G.; Thelwall, M. A study of information retrieval weighting schemes for sentiment analysis. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 1386–1395. [Google Scholar]
  23. Manning, C.D.; Raghavan, P.; Schütze, H. Scoring, term weighting and the vector space model. In Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2009; Chapter 6; pp. 109–133. [Google Scholar]
  24. Wold, H. Estimation of Principal Components and Related Models by Iterative Least Squares; Multivar. Anal. Academic Press: Cambridge, MA, USA, 1966; pp. 391–420. [Google Scholar]
  25. Lacroux, A. Les avantages et les limites de la méthode «Partial Least Square»(PLS): Une illustration empirique dans le domaine de la GRH. Rev. Gest. Ressour. Hum. 2011, 80, 45–64. [Google Scholar] [CrossRef]
  26. Kroll, C.N.; Song, P. Impact of multicollinearity on small sample hydrologic regression models. Water Resour. Res. 2013, 49, 3756–3769. [Google Scholar] [CrossRef]
  27. Liu, Y.; Rayens, W. PLS and dimension reduction for classification. Comput. Stat. 2007, 22, 189–208. [Google Scholar] [CrossRef]
  28. Durif, G.; Modolo, L.; Michaelsson, J.; Mold, J.E.; Lambert-Lacroix, S.; Picard, F. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 2017, 34, 485–493. [Google Scholar] [CrossRef] [PubMed]
  29. Bazzoli, C.; Lambert-Lacroix, S. Classification based on extensions of LS-PLS using logistic regression: Application to clinical and multiple genomic data. BMC Bioinform. 2018, 19, 314. [Google Scholar] [CrossRef] [PubMed]
  30. Zeng, X.Q.; Wang, M.W.; Nie, J.Y. Text classification based on partial least square analysis. In Proceedings of the 2007 ACM Symposium on Applied Computing, Seoul, Korea, 11–15 March 2007; pp. 834–838. [Google Scholar]
  31. Schechtman, E.; Yitzhaki, S. A family of correlation coefficients based on the extended Gini index. J. Econ. Inequal. 2003, 1, 129–146. [Google Scholar] [CrossRef]
  32. Olkin, I.; Yitzhaki, S. Gini regression analysis. Int. Stat. Rev./Rev. Int. Stat. 1992, 60, 185–196. [Google Scholar] [CrossRef]
  33. Im, C.J.; Kim, D.W.; Mandl, T. Text Classification for Patents: Experiments with Unigrams, Bigrams and Different Weighting Methods. Int. J. Contents 2017, 13, 66–74. [Google Scholar] [CrossRef]
  34. Tenenhaus, M. La regression logistique PLS. In Modles Statistiques Pour Donnes Qualitatives; Droesbeke, J.-J., Lejeune, M., Saporta, G., Eds.; Editions Technip: Paris, France, 2005; Chapter 12; pp. 263–276. [Google Scholar]
  35. Van Asch, V. Macro- and Micro-Averaged Evaluation Measures; Technical Report; Computational Linguistics & Psycholinguistics (CLiPS): Antwerpen, Belgium, 2013. [Google Scholar]
  36. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  37. Tagny Ngompé, G.; Harispe, S.; Zambrano, G.; Montmain, J.; Mussard, S. Detecting Sections and Entities in Court Decisions Using HMM and CRF Graphical Models. In Advances in Knowledge Discovery and Management: Volume 8; Pinaud, B., Guillet, F., Gandon, F., Largeron, C., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 61–86. [Google Scholar] [CrossRef]
  38. Pagliardini, M.; Gupta, P.; Jaggi, M. Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In Proceedings of the NAACL 2018 Conference of the North American Chapter of the Association for Computational Linguistics, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
  39. Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
Figure 1. Number of claims in judgments.
Figure 1. Number of claims in judgments.
Stats 03 00027 g001
Figure 2. Distribution of judges’ decisions within each category of claims.
Figure 2. Distribution of judges’ decisions within each category of claims.
Stats 03 00027 g002
Figure 3. Distribution of the size of the decision by tokens.
Figure 3. Distribution of the size of the decision by tokens.
Stats 03 00027 g003
Figure 4. Generalized Gini-PLS algorithm.
Figure 4. Generalized Gini-PLS algorithm.
Stats 03 00027 g004
Table 1. Categories of claims of the study.
Table 1. Categories of claims of the study.
DatasetDescriptionNumber of Judgments
ACPACivil fine for abuse of process246
CONCDELDamages for unfair competition238
DANAISDamages for abuse of process421
DCPPCdeclaration of claim to liabilities of the collective procedure218
DORISdamages for neighborhood disturbance164
STYXirrecoverable expenditure123
Table 2. Extract from “Cour d’appel, Paris, Pôle 6, chambre 9, 18 Mai 2016 n° 14/11380”.
Table 2. Extract from “Cour d’appel, Paris, Pôle 6, chambre 9, 18 Mai 2016 n° 14/11380”.
In FrenchIn English
claiml’audience, la SA SFP reprenant oralement ses conclusions vises par le greffier, result la cour de:
-
confirmer le jugement dfr-dbouter M. S. de l’ensemble de ses demandes
-
le condamner payer une amende civile de 1.500 € pour procdure abusive en application de l’article 32-1 du code de procdure civile
-
le condamner payer la somme...
At the hearing, SA SFP orally resuming its conclusions referred to in the clerk, requests the court to:
-
confirm the judgment referred
-
dismiss Mr S. of all his requests
-
order him to pay a civil fine of €1500 for abusive procedure in application of article 32-1 of the code of civil procedure
-
order him to pay the sum
decisionPAR CES discussion LA COUR, CONFIRME le jugement dfr en toutes ses dispositions; Y ajoutant, DIT n’y avoir lieu application des dispositions de l’article 700 du code de procdure civile; REJETTE le surplus des demandes; CONDAMNE M Khellil S. aux dpens d’appel.FOR THESE REASONS THE COURTYARD, CONFIRMS the judgment referred in all its provisions; Adding to it, SAID to take place there in application of the provisions of article 700 of the code of Civil Procedure; REJECTS excess requests; ORDERS M Khellil S. at costs of appeal.
Table 3. Class distributions per claim category.
Table 3. Class distributions per claim category.
DatasetAcceptedRejectedTotal
ACPA6 (26.09%)17 (73.91%)23
CONCDEL4 (22.22%)14 (77.78%)18
DANAIS21 (11.23%)166 (88.77%)187
DCPPC48 (66.66%)24 (33.33%)72
DORIS23 (52.27%)21 (47.72%)44
STYX4 (33.33%)8 (66.67%)12
Table 4. Notation used in formulas.
Table 4. Notation used in formulas.
NotationDescription
ta term
da judgment (document)
| d | size of d (number of tokens)
ca label
c ¯ the other labels
Dthe global set of documents ( N = | D | )
D c the set of documents labeled with c
D c ¯ the set of documents not labeled with c
N t the number of documents containing t
N t ¯ the number of documents without t
N t , c the number of documents of c with t
N t ¯ , c the number of documents of c without t
N t , c ¯ the number of documents of c ¯ with t
N t ¯ , c ¯ the number of documents of c ¯ without t
D F t | c proportion of documents of c with t ( D F t | c = N t , c | D c | )
D F c | t proportion of documents of c in the global set of documents with t
Table 5. Global weighting metrics.
Table 5. Global weighting metrics.
DescriptionFormula
Inverse document frequency (IDF) [13] i d f ( t ) = log 2 N N t
Probabilistic IDF [14] p i d f ( t ) = log 2 N N t 1
BM25 IDF [15] b i d f ( t ) = log 2 N t ¯ + 0.5 N t + 0.5
Frequency difference Δ D F ( t , c ) = D F t | c D F t | c ¯
Information gain [16] i g ( t , c ) =
N t , c N log 2 N t , c N N t + N t ¯ , c N log 2 N t ¯ , c N N t ¯ | D c | + N t , c ¯ N log 2 N t , c ¯ N N t | D c ¯ | + N t ¯ , c ¯ N log 2 N t ¯ , c ¯ N N t ¯ | D c |
Relevance frequency [17] r f ( t , c ) = log 2 + N t , c m a x ( 1 , N t , c ¯ )
χ 2 coefficient [18] χ 2 ( t , c ) = N ( ( N t , c N t ¯ , c ¯ ) ( N t , c ¯ N t ¯ , c ) ) 2 N t N t ¯ | D c | | D c ¯ |
Correlation coefficient [19] n g l ( t , c ) = N ( N t , c N t ¯ , c ¯ ) ( N t , c ¯ N t ¯ , c ) N t N t ¯ | D c | | D c ¯ |
GSS coefficient [20] g s s ( t , c ) = ( N t , c N t ¯ , c ¯ ) ( N t , c ¯ N t ¯ , c )
Marascuilo coefficient [21] m a r ( t , c ) = ( N t , c N t N t , c / N ) 2 + ( N t , c ¯ N t | D c ¯ | / N ) 2    + ( N t ¯ , c | D c | N t ¯ / N ) 2     + ( N t ¯ , ¯ N t ¯ | D c ¯ | / N ) 2 N .
Smoothed IDF delta [22] d s i d f ( t , c ) = log 2 | D c ¯ | ( N t , c + 0.5 ) | D c | ( N t , c ¯ + 0.5 )
BM25 IDF delta [22] d b i d f ( t , c ) = log 2 ( | D c ¯ | N t , c ¯ + 0.5 ) | ( N t , c + 0.5 ) ( | D c | N t , c + 0.5 ) ( N t , c ¯ + 0.5 )
Table 6. Local weighting metrics.
Table 6. Local weighting metrics.
DescriptionFormula
Gross term statement [12] t f ( t , d ) = Number of occurrences of t in d
Presence of the word [12] t p ( t , d ) = 1 if t f ( t , d ) > 0 0 otherwise
Log Normalization l o g t f ( t , d ) = 1 + log t f ( t , d )
Increased and standardized frequency of the word [12] a t f ( t , d ) = k + ( 1 k ) t f ( t , d ) max t V t f ( t , d )
Normalization based on the average frequency of the word [23] ( a v g is the average) l o g a v e ( t , d ) = 1 + log t f ( t , d ) 1 + log a v g t V t f ( t , d )
Table 7. Values of the hyperparameters of the algorithms.
Table 7. Values of the hyperparameters of the algorithms.
AlgorithmsHyperparameters
SVM C = 1.0 ; γ = 1 | V | × v a r ( X ) ; k e r n e l = R B F
KNN k = 5
LDA s o l v e r = s v d , n _ c o m p o n e n t s = 10
QDAno regularization of the covariance estimate
TreeGini criterion
NBSVMn-grams of 1 to 3 words
Gini-PLS h m a x = 10
Logit-PLS h m a x = 10
Gini-Logit-PLS h m a x = 10 ; ν = 14
Table 8. Comparison of word representation and algorithms to detect the the judicial outcome.
Table 8. Comparison of word representation and algorithms to detect the the judicial outcome.
RepresentationAlgorithm F 1 minCat. MinMaxCat. Max Best ( F 1 ) - F 1 Max-MinRank
t f g s s Tree0.6680.5doris0.92dcppc00.421
t f a v g g l o b a l LogitPLS0.6480.518danais0.781dcppc0.020.26313
t f a v g g l o b a l StandardPLS0.6360.49danais0.836dcppc0.0320.34624
t f Δ D F GiniPLS0.5860.411danais0.837dcppc0.0820.426169
t f Δ D F LogitGiniPLS0.5780.225styx0.772dcppc0.090.547220
-NBSVM0.4940.4styx0.834dcppc0.1740.434
-fastText0.4120.343doris0.47danais0.2560.127
Table 9. Evaluation of fastText and NBSVM for detecting judicial outcomes for each claim category.
Table 9. Evaluation of fastText and NBSVM for detecting judicial outcomes for each claim category.
Cat.Algo.Prec.Prec. equi.err-0err-1 F 1 ( 0 ) F 1 ( 1 ) F 1 macro
dcppcNBSVM0.8750.81200.3750.9160.7520.834
danaisfastText0.8880.5010.94100.47
danaisNBSVM0.8880.5010.94100.47
concdelfastText0.7750.5010.85300.437
concdelNBSVM0.7750.5010.87300.437
acpafastText0.7450.5010.85300.426
acpaNBSVM0.7450.5010.85300.426
dorisNBSVM0.50.4920.1670.850.630.1740.402
dcppcfastText0.6670.5010.800.4
styxfastText0.6670.5010.800.4
styxNBSVM0.6670.5010.800.4
dorisfastText0.5230.5010.68600.343
0 == “reject” et 1 == “accept”; Cat.: Categories of claim; Algo. : algorithm; err-0: error rate of “reject”; err-1: error rate of “accept’; Prec.: global precision ( a c c u r a c y = T P N ); Prec. equi.: 1 2 ( a c c u r a c y ( 0 ) + a c c u r a c y ( 1 ) ) .
Table 10. Accuracy of the classification with restriction of judgments to specific regions.
Table 10. Accuracy of the classification with restriction of judgments to specific regions.
CategoryRegionRepresentationAlgorithm F 1
claim_result_a_result_context t f d b i d f Tree0.846
acpafacts and proceedings_opinion_holding t f d b i d f StandardPLS0.697
facts and proceedings_opinion_holding t f a v g g l o b a l LogitPLS0.683
facts and proceedings_opinion_holding t f g s s Tree0.798
concdelopinion t f i d f LogitGiniPLS0.703
context l o g a v e d b i d f StandardPLS0.657
claim_result_a_result_context a v g l o c a l χ 2 Tree0.813
danaisclaim_result_a_result_context a t f a v g g l o b a l LogitPLS0.721
claim_result_a_result_context a t f a v g g l o b a l StandardPLS0.695
claim_result_a_result_context t f χ 2 Tree0.985
dcppcclaim_result_a_result_context t f χ 2 LogitPLS0.94
facts and proceedings_opinion_holding t p m a r StandardPLS0.934
facts and proceedings_opinion_holding t p d s i d f GiniPLS0.806
dorisfacts and proceedings_opinion_holding t p d s i d f LogitGiniPLS0.806
facts and proceedings_opinion_holding a t f i g StandardPLS0.772
opinion t f d s i d f Tree1
styxclaim_result_a_result_context l o g a v e d s i d f LogitGiniPLS0.917
facts and proceedings_opinion_holding t f r f GiniPLS0.833

Share and Cite

MDPI and ACS Style

Tagny-Ngompé, G.; Mussard, S.; Zambrano, G.; Harispe, S.; Montmain, J. Identification of Judicial Outcomes in Judgments: A Generalized Gini-PLS Approach. Stats 2020, 3, 427-443. https://doi.org/10.3390/stats3040027

AMA Style

Tagny-Ngompé G, Mussard S, Zambrano G, Harispe S, Montmain J. Identification of Judicial Outcomes in Judgments: A Generalized Gini-PLS Approach. Stats. 2020; 3(4):427-443. https://doi.org/10.3390/stats3040027

Chicago/Turabian Style

Tagny-Ngompé, Gildas, Stéphane Mussard, Guillaume Zambrano, Sébastien Harispe, and Jacky Montmain. 2020. "Identification of Judicial Outcomes in Judgments: A Generalized Gini-PLS Approach" Stats 3, no. 4: 427-443. https://doi.org/10.3390/stats3040027

Article Metrics

Back to TopTop