Neural Legal Outcome Prediction with Partial Least Squares Compression

: Predicting the outcome of a case from a set of factual data is a common goal in legal knowledge discovery. In practice, solving this task is most of the time difﬁcult due to the scarcity of labeled datasets. Additionally, processing long documents often leads to sparse data, which adds another layer of complexity. This paper presents a study focused on the french decisions of the European Court of Human Rights (ECtHR) for which we build various classiﬁcation tasks. These tasks consist ﬁrst of all in the prediction of the potential violation of an article of the convention, using extracted facts. A multiclass problem is also created, with the objective of determining whether an article is relevant to plead given some circumstances. We solve these tasks by comparing simple linear models to an attention-based neural network. We also take advantage of a modiﬁed partial least squares algorithm that we integrate in the aforementioned models, capable of effectively dealing with classiﬁcation problems and scale with sparse inputs coming from natural language tasks.


Introduction
With the emergence of deep learning algorithms (neural networks), significant works have been done related to the legal domain and predictive justice. While classification tasks are important topics [1], some applications have been developed around, such as information extraction [2], legal norms classification [3] or topic classification [4]. Predicting the outcome of a case from factual data is a major topic in this literature, as it can offer great benefits to practitioners and can also be useful for citizens. If a lawyer or a judge can estimate the likelihood of an outcome, it can translate into greater efficiency, better access to justice, and greater fairness assuming this tool plays a supporting role.
The task of predicting the outcome has received much attention recently. In specific situations, high accuracies have been achieved [5] mainly using deep neural networks [6,7] which are able to significantly outperform standard and widely used algorithms like logistic regression, tree based models or SVM [8]. Some attention has recently been paid to the European Court of Human Rights (ECtHR) for a number of reasons. One of them is the accessibility of court decisions, another one has to do with the ability to create an automatic annotation process of these judgments. Solving tasks related to the legal domain requires experts to manually put labels on documents, it is often expensive and time consuming, leading to scarce data. ECtHR decisions possess a specific structure which can be exploited using simple regex, making them a good candidat for automatic legal analysis and forecasting. As the literature focuses on english decisions, some performances comparisons have already been made around standard classification algorithms [9,10] and deep learning approaches using state-of-art models [11,12] which require high computation cost compared to linear classifiers.
In the paper, we compare several algorithms on french ECtHR decisions on different binary tasks where the target is the outcome (violation of a given article). Another task is built as a multiclass problem and aims to find which article has potentially been violated. We do not voluntarily perform meaningful processing outside of tokenization and extraction of facts and circumstances in order to  The majority of documents follow this structure, in both french and english versions. Note that the available decisions are not always translated, resulting in differences between the datasets coming from the two languages. Related papers [9,10] which focus on english already have different inputs because the extraction process is tricky and slightly differs between the two studies, new decisions are also periodically added to the database which may explain differences in size.

Extraction
Extraction is simple in theory because we can easily extract headings and paragraphs. In practice we observe a certain variability between documents which makes automatic extraction more challenging. Some titles may change or some section may move or disappear for no specific reason. For example, we sometimes find the procedure inside the fact section, or the fact section does exist but we cannot extract its subsections which should feed our models. The second challenge is removing ambiguous outcomes. As our main task is to find out whether a given article of the convention has been violated or not, we need to filter the decisions which have different outcomes for the same article. Everything is done using plain regex to first detect headings and paragraphs, check their validity, extract all the outcomes and compare them and their associated articles. The overall extraction procedure follows seven steps: • We scrape all french decisions available.
• We check using multiple regex if the four sections titles (and some variations) can be found inside the decision to make sure everything is well located and available. • We check if both facts subsections are well defined using regex also, namely circumstances and relevant domestic laws. • We process the outcome section (operative provision) of the case by extracting sentences containing words "violation" and "article". • We process all these sentences by listing all named articles that belong to the European convention using a regex. • For each retrieved article, we infer the outcome (violation or non-violation). This is simple given that these sentences always have the same structure (e.g., "Holds that there has been a violation of Article 6 of the Convention." in english). • We remove all decisions where a given article has been violated and non violated at the same time.
This happens if there are multiple claims based on the same article in a given decision.
After extraction, we only keep decisions related to six articles which are also the most frequent ones to ensure a sufficient number of observations: article 3 (torture or inhuman and degrading treatment), article 5 (right to liberty and security), article 6 (right to a fair trial), article 8 (respect for private and family life), article 10 (freedom of expression) and article 13 (access to justice).

Datasets
For each selected article, we build a dataset consisting of valid decisions and their associated binary label (violation or non-violation of the article). The extraction process yields Table 1. As shown, the outcome is highly unbalanced and favors most of the time the claimant against the respondent state. It is however influenced by our extraction process as we have removed ambiguous cases where we can find violation and non violation of a given article at the same time. For each selected set of judgments, we extract the circumstances of the case and the entire fact section as we only want to predict a fact-based outcome. The stack dataset is used to test a multiclass problem by trying to find, from a set of facts, whether a given item is relevant to the case.
To more accurately measure model performances and follow a similar approach as described in the related papers [9,10], we balance the datasets to ensure the same number of violations and non-violations. Note that circumstances and facts can last over a hundred sentences in few situations. They are also closely related: the facts are actually circumstances plus domestic legal arguments which are very heterogeneous as the applicants come from different countries.
We also ensure that the stack dataset is a single label problem by removing cases where two or more articles are relevant, however is it not balanced. Each article representing 13%, 10%, 35%, 22%, 11% and 9% of the stack respectively.
In order to solve these tasks, we compare various models and a transformer based neural network architecture. We also test a modified and optimized partial least squares algorithm (PLS) suitable for classification tasks and sparse input as we rely on it to skip any preprocessing outside of tokenization.

Word Embeddings
Word embeddings are a fundamental building block for solving natural language tasks as they provide a way to represent textual data in a low dimensional space. Given a sentence made up of words, subwords or characters, a word embedding maps a discrete representation into a vector of real values. For a sequence of tokens X = [x 1 , x 2 , . . . , x t ] ∈ R t×d , where t is the length of the sequence, d the vocabulary size and x i a one-hot vector, a linear pretrained word embedding provides a weight matrix W ∈ R d×d with d d such that it can project the sequence into a lower dimensional space. This embedding matrix can either be estimated in an unsupervised fashion using compression models like PCA or SVD, or through a self supervised approach with a neural network. In Word2Vec [14], the best-known method, the matrix is computed by solving a large classification task: finding the correct word given its context. Although this model is simple, scalable and efficient to estimate with backpropagation, it lacks certain properties. Firstly, it cannot handle out of vocabulary tokens, the tokens also have a fixed representation, thus it is difficult to disambiguate a word which may have a different meaning depending the sentence. Secondly, the model is fully linear which limits its capabilities, it also uses a small context window which cannot handle long term dependencies. Word embeddings have been improved several times by replacing words with sub-words (FastText [15]) and combining them to handle unseen tokens. Taking advantage of the contextual and the sequential aspect of sentences has also been improved by adding non linear connections and recurrent layers [16], namely LSTM [17], which are able to process and store information in memory cells. Finally, significant progress has been made with the emergence of transformer [13] based word embeddings like BERT [18], which relies on a masked objective and a stack of self attention modules: each token in a given sentence is weighted by all other elements, allowing fast convergence and strong performances across a wide range of tasks.

Attention Mechanism
While recurrent neural networks like LSTM are able to model long term dependencies, they require a sequential processing of the input, which is expensive and difficult to optimize for long sequences. The attention mechanism partially solves this problem by computing all dependencies in parallel using a score (context) matrix. This context matrix stores information between all pairs of tokens taken from two inputs. These scores are squashed using a softmax function to get a weighting matrix representing a probability distribution ( Figure 2).

Heads
x 1 x 2 x 3 x 4 Scores Values Attention can be computed between different sources, generally called query, key and value, respectively Q ∈ R t×d , K ∈ R t×d and V ∈ R t×d where t is the sequence length of the source, d the embedding dimension and sigma a softmax function: Using attention once for each pair of sentences is not enough to fully capture dependencies. In practice the computation is done several times in parallel by dividing the input into smaller pieces called heads. The embedding dimension can be decomposed into h heads of size d h such that d = h × d h . In the self attention setup, Q, K and V usually have the same shape and are distinct linear projection from a same input X ∈ R t×d . Given W q , W k , W v ∈ R d×d m learned matrices with d m the hidden dimension of the projection subspace, self attention is computed head wise as: In practice, these products are also done batch wise on a set of examples at the same time. In order to stack multiples sentences of different length in a single tensor, they are artificially padded with zeros.
As computing the score matrix is memory intensive as its size depends on t 2 , we use a slightly modified version which compute sequentially two score matrices in order to reduce the overall computational cost [19]: Finally we build a transformer architecture by stacking the attention mechanism to various feed forward layers with a skip connection [20] and a normalization: The transformer layer is represented in Figure 3: Feed forward X X Figure 3. A transformer layer.

The Reduction Algorithm
As we deal with sparse inputs, we want to rely on a dimensional reduction tool to densify our matrix. Note that in most natural language tasks, the number of features can be extremely high due to the size of the vocabulary. One way to solve this problem in a supervised manner is to rely on the PLS algorithm and its NIPALS variant [21] which is known to provide independent, very informative components and is fairly efficient to compute. However the algorithm has few caveats: it is built for regression and sometimes underperforms for classification problems, it has to loop over the target columns which is slow for multitargets tasks, it does rely on an iterative process which can cause some variations during the training phase. Many alternatives has been proposed such as the logit PLS [22] or the PLS-DA [23] but they can be expensive to compute for large inputs, especially the first one. We propose an alternative algorithm which relies on a point biserial covariance.

Point Biserial Covariance
Let X ∈ R n×m , a real matrix of regressors x j (j = 1, . . . , m) where x (j,i) denotes the ith observation (i = 1, . . . , n) of the jth variable and y ∈ {0, 1} n a target binary vector. All regressors are assumed to be centered and scaled to unit variance. The point-biserial r pb ∈ [−1, 1] does compute a correlation between a continuous and a dummy variable and is defined as : where n (1) and n (0) denote each class sample size, s the standard deviation andx is the jth variable conditional mean which can be efficiently computed as : Point-biserial covariance is defined as : More generally, the point biserial covariance matrix is similar to a vanille covariance matrix and can be efficiently computed as: cov pb (X, y) = X ỹ (8) Assumingỹ comes from:ỹ This formula can be extended to a larger binary matrix Y ∈ {0, 1} n×k which can be a target matrix for a multiclass or a multilabel classification task. In these conditions, we define n (0) , n (1) ∈ N 1×k for which n (1) i (i = 1, . . . , k) denotes the number of individuals with value 1 in column i of matrix Y. Thus, with denoting a row wise product the computation ofỸ is done as follow:

Binary Algorithm
The algorithm consists of computing h orthogonal latent variables t 1 , . . . , t h ∈ R n×1 purged from multicollinearity and able to explain y's binary class behavior. The first step consists of finding a weight vector w 1 ∈ R m×1 in order to compute the first component t 1 : The weight vector is found by maximizing the biserial covariance between t 1 and y in a similar manner as in the vanilla PLS algorithm: The solution to this program is straightforward, assuming C = X ỹ: To ensure the orthogonality between components, each variable x j from X is deflated by running a simple linear regression and extracting its residuals for the next step. For j = 1 to m: In practice, the deflation is done in parallel by solving all linear regressions at the same time using the OLS estimator β ∈ R 1×m with a regularization parameter γ ≥ 0: More generally, the input matrixX d at step d ∈ [1, . . . , h], withX 0 = X, is equal to: The maximization and the computation of a new component is done again using the deflated input matrix. The same process is followed until we reach the require number of latent variables. Note that for step d ≥ 2, the projection weight vector w d does project the deflated matrix and not the original input. By unrollingX d , we can extract the projection matrix W * = [w * 1 , . . . , w * h ] ∈ R m×h such that the component matrix T = [t 1 , . . . , t h ] ∈ R n×h : where w * 1 = w 1 and for d ≥ 2: In most situations, these weights are computed using a matrix P ∈ R m×m such that: The overall algorithm is shown in the next table.
The f function generally comes from a logistic regression which is parameterized by θ. As this algorithm is also linear, we can make a class prediction by computing: Where σ is a sigmoid function and θ bias is the intercept of the model. Note that in the PLS logistic model, each component requires to solve m logistic regressions. As this algorithm relies on backpropagation, it is computationally expensive even for moderately sized matrices.

General Algorithm
While Algorithm 1 works well for a one dimensional target y, it has some drawbacks for a multiclass problem. For a k classes target matrix Y ∈ {0, 1} n×k , it does compute k components at the same time which are not independent in this situation. One natural way to compute a single component is to solve the following problem instead: which can be rewritten as: max Setting C =Ỹ X , the Lagrangian is written as: Yielding the following eigen problem: In practice, this problem is solved by using an iterative algorithm called NIPALS which does approximate the solution using multiples projection matrices. While being very fast on small problems, it does not scale well for large inputs. Additionally, it requires to loop over Y columns which is inefficient outside binary problems. Solving the eigen problem is computationally expensive if m is large or if m > k which is the case in the majority of classification tasks. We solve the transposed problem instead to improve efficiency: with w * e denoting the eigenvector associated to the highest eigen value λ * , we immediately get the unormalized weight:

11:
Update P matrix: P ← P(I m − w d β d ) 12: end if 13: end for 14: 15: Estimate the model: y = f (XW * ; θ) + Note that if k = 1, w * e is a scalar, thus we get the same formula as in the binary algorithm. To further improve efficiency, we do not use a projection matrix P ∈ R m×m to compute the optimal weights W * because the product P d (I m − w d β d ) (d = 1, . . . , h) is expensive if m is large which is often the case with natural language tasks. One way to improve performances is to compute the whole W * matrix at the end of the algorithm. In practice, the component matrix T = XW * can be written as [24]: with B = [β 1 , . . . , β h ] and W = [w 1 , . . . , w h ]. Thus we can compute the final projection weight matrix W * : The overall algorithm is written in the following table:

Scalability
Most PLS models do not scale well to medium sized or large inputs due to their iterative nature. In practice, Algorithm 2 does scale better overall if we increase both n or m (Figure 4).
We also plot a lower precision variant which had no effect on performances presented in the next section. NIPALS is actually efficient if the target is one dimensional. However is fails to scale for larger targets as shown in Figure 5.
NIPALS can be volatile as it requires some convergence before computing the next component. Note also that the fast PLS variant complexity highly depends on k, more precisely on min(k, m), due to the eigenvalue decomposition. To limit this dependency, we may rely on batching on n and m or k which will increase the number of required components but will also considerably reduce the complexity. In practice, the number of classes k is often very limited except outside word embeddings estimations.

Algorithm 2 Fast PLS algorithm
Require: X ∈ R n×m , x j centered and scaled ∀j ∈ (1, . . . , m) Require: Y ∈ {0, 1} n×k a binary matrix Require: n (0) , n (1) ∈ N 1×k class count vectors Require: γ ≥ 0 a regularization parameter Require: h the number of components Solve the eigen problem (CC − λI k )w = 0 Get the highest eigen value λ * and its associated eigen vector w e Store normalized weights: Solve the eigen problem (C C − λI m )w = 0 Get the highest eigen value λ * and its associated eigen vector w d Store normalized weights:

Experiments and Results
We first compare different linear models using different inputs to show how sparsity can affect performances. We focus on two main tasks, the first relative to the violation of an article given a set of facts or circumstances which is a binary classification task. The second aims to find which of the convention articles is appropriate given factual informations, in this case it is a multiclass problem.
All these tasks are challenging, even more for the one with large base vocabulary size and small sample size. For example, the dataset related to article 5 only contains 122 documents for a vocabulary greater than 40,000 words for the circumstances and 50,000 for the facts which show how heterogeneous can the input be. Most of the time, there is no obvious relation between several documents from the same source. Few of the datasets are more manageable, the one related to article 13 has a smaller vocabulary size and rare words are less common which may improve overall performances.
For the first set of tasks, we use a simple preprocessing. We first tokenize at the word level by splitting the sequences on white spaces and punctuation then we build a binary count matrix and a TF-IDF representation. Since sentences are very long, we limit the vocabulary size to the 25,000 most frequent words. While using smaller sizes like 15 or 20 thousands words gives close performances, sparsity is less pronounced for simple classifiers. Choosing a size of 25,000 offered the best compromise between performance and sparsity to fully show how the PLS model handles this effect.
For the multiclass problem, we extend the limit to 50,000 words to further increase sparsity. Note that we are only relying on unigrams and we do not add any additional preprocessing step compared to the related literature [9,10], punctuation and stop words are not removed in this experiment. We also suppose that the modified PLS algorithm is able to discriminate and keep most informative components without the need of vocabulary selection, hand-crafted features or heuristics.
We compare three simple linear models: a logistic regression, a ridge regression where the regularization parameter as been optimized and a linear SVM. The choice of linear models is similar to related works which are relying on linear SVM. Since the sample size is very limited most of the time, these models are less prone to overfitting compared to tree-based approaches or kernel-based approaches which are expensive to compute for a large feature space.
We intentionally limit the number of PLS components to h = 8 in order to show the efficiency of the model given thousand of features. All experiments are compared using accuracy and a 10-cross validation approach.
For information, we copy the results from the related literature [9] obtained on english documents and similar tasks using a linear SVM (Table 2). Beside a significant difference on article 8 which is difficult to explain for the circumstances, the results presented in the following section on french documents are similar to what has been achieved in the literature for linear models at least. Note that english is known to be grammatically simpler, which may improve overall performances.

Binary Input
Results for binary inputs are reported in Tables 3 and 4. As expected at least for the binary cases, running the models without a PLS reduction bring slightly better results overall but with a higher computational cost. Computing components then estimating a model is significantly faster than running the model on raw inputs for the logistic regression and the SVM. Note that we only use 8 components and we are already achieving very close performances. For the multiclass problem, standard algorithms already fail to generalize to new inputs due to sparsity (50 k features). Regularization does not fix the problem as ridge regression does not bring strong performances.
Running models on full facts does significantly increase sentences length which provide additional information but also adds some noise due to domestic legal arguments. In most situations, performances increase slightly except for article 5 where standard models struggle to keep their previous accuracy.
For the multiclass problem, we do not observe any significant difference in accuracy using different inputs. As shown in Table 5, some articles are easier to predict such as article 10 which refers to the freedom of expression and article 3 relating to torture or inhuman treatment. The reason is linked to the structure of their respective vocabulary, with predominant themes such as violence or the media, making it easier to detect them. On the other hand, articles 5 related to the right to liberty are generally more vague with a much broader vocabulary. Finally, one can observe that as in the binary case, classes with a lot of rare words tend to be more difficult to recognize.

TF-IDF Input
We now run the same experiment but replace the binary input with a TD-IDF representation. The input size is the same as before but does provide a different information as it relies on frequencies and inverse frequencies. The results for binary inputs are reported in Tables 6 and 7. The additional variance due to the nature of the input significantly affects the overall performances without PLS reduction. Ridge regression has slightly higher accuracy due to the regularization parameter. We observe that the PLS algorithm is still consistent and even show better scores because the TF-IDF representation is generally more informative than a simple binary matrix. Accuracy on the multiclass problem is even worse and is very close to the frequency of the larger class of this dataset which is related to article 6.
We observe the same problem as before. PLS reduction does bring efficiency and consistency. Computing components and adding a linear classifier on top provides very similar performances regardless of the model, this is a consequence of a low number of components. Note that increasing h did not provide significant gains on average. The choice should be based on the dataset, but we choose to keep h the same to make fair comparisons.
On the multiclass task (Table 8), the same behavior as in Table 5 is generally observed with slightly better performances with facts as inputs, this was not the case before.

Neural Approach
The neural network approach is different as it tales advantage of the temporal aspect thanks to the attention mechanism. The choice of attention is indeed natural considering the current state of the NLP literature which does rely entirely on transformer-based models. Training a RNN as an alternative is not appropriate here because convergence is extremely slow when sequences are very long. Another approach would have been to only use feedforward layers, but it requires using some averaging to remove the sequential aspect. Using a mean embedding over thousand of tokens provides a very noisy representation as all words are equally weighted in this case.
We estimate our models by mixing two different inputs, one comes from a pretrained FastText (128 dimensions) which has been trained on 10Gb of french legal documents (2 billions tokens). The second is either a binary input or a TF-IDF input. We use the attention mechanism on the embedding inputs while we simply rely on feedforward layers for the other one. We stack everything up before doing our prediction. Using this combination generally improves speed of convergence as one part of the network does focus on word discrimination and acts like a simple linear model while the second part is oriented toward disambiguation and the handling of rare words. We rely on cross entropy as the loss function, we stack 3 layers of transformers and use Adam [25] as the optimizer with a learning rate of 10 −3 and a weigth decay of 0.01.
We also train similar models with a dimensional reduction provided by the PLS algorithm. All results are reported in Tables 9 and 10. The use of neural networks greatly improves performance, but the computational cost is also much higher. Applying a reduction does hurt performances with a loss of 1.5 points in accuracy in average but the model is also easier and much faster to train which can be a good trade off in some situations.
We observe a similar behavior on facts. In either case, the binary or TF-IDF have little effect on performances due to the embedding which may already incorporate similar information.
In the multiclass task, one can observe huge performance gains over simple linear models with more than 10 points difference on average for each individual accuracy (Table 11). The most difficult classes to predict remain the same compared to those observed without the use of a neural network and the attention mechanism.

General Observations
Overall, article 13 is the easiest to predict while article 8 is the most difficult. For attention-based approaches, article 5 seems easier to predict compared to simple linear models as we can observe a significant performance gain from 10 to 15 points. If we pay attention to the overall vocabulary size, datasets with many rare words tend to be harder to predict for simpler models as they cannot properly handle words that appear only once. On the other hand, word embeddings are able to infer these words because they are trained on a large corpus before and may have seen them during pre-training. This advantage is also visible for the multiclass task where the overall vocabulary is very large (more than 100 K words) with a majority of rare words.
The natural ability of neural networks to disambiguate expressions also helps a lot when simplest models cannot properly handle sequences. However, while attention-based approaches are in theory able to provide some explainability, the exploitation of attention weights makes few sense for very long sequences (thousands of words). Most of them are actually very small due to the fact that they are positive and add up to one, they also provide very noisy outputs which cannot be properly exploited by practitioners.

Conclusions
This paper attempts to provide a dimension reduction algorithm for classification tasks when applying a model to raw inputs can lead to a bad solution or a high computational cost. We show that this PLS variant retains goods performances with varying sparse inputs size and can also be used to reduce the number of parameters of a large neural networks with limited effects on accuracy. Finally we observe that linear models while being simple and computationally cheap, can provide decent performances on complex tasks even when the sample size is small and the feature space large. However they are largely outclassed by sophisticated neural networks, capable of correctly processing sequences and handling rare words using a word embedding pretrained on a large corpus.