Next Article in Journal
Child Labor in Sindh, Pakistan: Patterns and Areas in Need of Intervention
Previous Article in Journal
Evaluating Imputation Methods to Improve Prediction Accuracy for an HIV Study in Uganda
Previous Article in Special Issue
Investigating Self-Rationalizing Models for Commonsense Reasoning
 
 
Communication
Peer-Review Record

Article 700 Identification in Judicial Judgments: Comparing Transformers and Machine Learning Models

Stats 2024, 7(4), 1421-1436; https://doi.org/10.3390/stats7040083
by Sid Ali Mahmoudi, Charles Condevaux, Guillaume Zambrano and Stéphane Mussard *
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Stats 2024, 7(4), 1421-1436; https://doi.org/10.3390/stats7040083
Submission received: 14 August 2024 / Revised: 2 October 2024 / Accepted: 7 November 2024 / Published: 26 November 2024
(This article belongs to the Special Issue Machine Learning and Natural Language Processing (ML & NLP))

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article presents a well-executed analysis and comparison of machine learning models for identifying Article 700 in judicial decisions, particularly highlighting the efficacy of transformer models like Judicial CamemBERT. The methodologies are sound, and the results are clearly presented, demonstrating the model's superiority in handling long legal documents. However, there are a few minor issues that need correction: references in lines 104 and 129 are marked with question marks and should be appropriately cited or corrected. Additionally, the accuracy figure of 99.2% mentioned in line 300 needs clarification regarding its derivation. Once these issues are addressed, the article would be suitable for publication. 

Author Response

Thank you very much for your report. The errors are now corrected.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper concerns a very important application of machine learning to predictive justice. Its main contribution is in the application of the methods to a novel application and case study. To reach the quality bar required for publication the authors should satisfactorily reply to the following:

a) Polish the paper to fix omissions,typos and mispells. For example, ?: question marks, in lieu of references

b) Extend model comparison to include not only comparison of accuracy, but also of robustness (eg stability of the results as the training/test sample varies). 

c) mention, possibly as a future work, extension of evaluation criteria. Not only accuracy and sustainability (robustness) but also explainability and fairness, quoting appropriate references on SAFE artificial intelligence models and/or on AI risk measurement

d) is there variability between the two expert annotators? can that be included in the model

 

e) the model behind the results in Table 7 should be better explained. For example, are the different models run only on texts that contain either reason, conclusion, claim or judgement?

Comments on the Quality of English Language

language good but check for typos

Author Response

The paper concerns a very important application of machine learning to predictive justice. Its main contribution is in the application of the methods to a novel application and case study. To reach the quality bar required for publication the authors should satisfactorily reply to the following:

  1. a) Polish the paper to fix omissions,typos and mispells. For example, ?: question marks, in lieu of references

Thank you very much for your report. The errors are now corrected.

  1. b) Extend model comparison to include not only comparison of accuracy, but also of robustness (eg stability of the results as the training/test sample varies). 

=>5-fold cross validation experiments were added for both binary classification and JC models. Now the results are quite different, so we have added new comments about the performance (Table 7 to 9 and Table 10, page 9-11).

  1. c) mention, possibly as a future work, extension of evaluation criteria. Not only accuracy and sustainability (robustness) but also explainability and fairness, quoting appropriate references on SAFE artificial intelligence models and/or on AI risk measurement

=> Future work and work extension were added in the conclusion on XAI and the use of the Shapley value to assign scores to important words and expressions related to the prediction (page 12).

  1. d) is there variability between the two expert annotators? can that be included in the model

=>Kappa has been calculated for 100 documents taken at random (page 6).

  1. e) the model behind the results in Table 7 should be better explained. For example, are the different models run only on texts that contain either reason, conclusion, claim or judgement?

=> Table 7 is Table 6 now, more clarification text was added. Each line corresponds to the input text of the binary model.

 

Reviewer 3 Report

Comments and Suggestions for Authors

The paper reports on detecting mentions of Article 700 in legal documents.

The topic holds relevance for the predictive justice community, although I question the relevance of this topic for the Stats journal, which is focused on statistics, probability, stochastic processes and innovative applications of statistics. Also, the proposed approach lacks novelty, since it employs existing machine learning models.

The task of identifying mentions of Article 700 is highly specific and its broader utility remains unclear, apart from lucrative purposes on the part of lawyers (as suggested by the authors). The authors suggest that identifying these mentions could aid in predicting litigation outcomes, but they do not properly substantiate this claim with literature references or a preliminary study that demonstrates the connection between mentions of Article 700 and litigation outcomes.

Concerning the data used in this study, the authors mention an annotation process but omit crucial details, such as the annotation guidelines, the demographics of the annotators, and anonymization practices. There is no information on how the data were collected, the structure of the documents, the frequency of Article 700 mentions annotated in the whole documents or its subparts, or the agreement between annotators. These omissions leave significant gaps in the dataset description and its value for future research. Furthermore, the dataset is not made publicly available nor do the authors mention this aspect, affecting the reproducibility of their experiments. 

The methodology section is also confusing. It’s unclear what baseline is being used, as there are two sections labelled "baseline" (3.2 and 4.2), each describing different approaches: TF-IDF and a set of ML models with n-grams as features. The features of these models are not clearly explained, and there’s no indication of how many documents were used to fine-tune Judicial CamemBERT. Additionally, not all models are reported in the results tables, leaving the reader without a complete understanding of the comparative analysis.

Lastly, since the reference to juridical camembert is missing, I can only assume that the authors relied on Legal-CamemBERT-base (https://huggingface.co/maastrichtlawtech/legal-camembert-base), which is pre-trained on legal articles from the Belgian legislation, not the French one. If the authors actually relied on this model, I believe that this mismatch might have an impact on the results. The authors should discuss whether this foreign training corpus impacts the performance of their model.

While the paper addresses a possibly relevant task, it suffers from significant gaps in clarity, presentation, and rigour. Further revisions are needed to improve the coherence of the methodology, ensure a thorough presentation of the results, and provide a clearer discussion of the contributions and broader implications of the work.

Below, I discuss in more detail several major issues that need to be addressed. 

- Firstly, the Artice 700 should be better introduced and described.  The significance of Article 700 in French law should be clearly explained from the beginning. This is not common knowledge, especially for an international audience, and assumptions should not be made. 

Additionally, I wonder what is the generalisability of this issue outside the French legal system, or if this is relevant only in France. Obviously, this should not be considered a limitation of this work. However, the authors should discuss the generalizability of their task and method to other jurisdictions.

- Figures and References: There is a sample caption for Figure 2, but the figure itself is missing. Additionally, numerous references are missing, showing up as '?'.  While such errors can occur when writing a paper, the authors should check that the version submitted to the journal does not present so many distraction issues. Their frequency gives the impression of carelessness towards the journal and the reviewers, which is unpleasant.

- The paper also suffers from an overall lack of references, especially in the introduction and motivation sections, where claims should be backed by relevant literature.

- pag2, points 1 and 2. The paper implies that simple models perform comparably to more complex and computationally heavy models for this task. If this is the authors' main claim, it should be explicitly stated and discussed in greater depth, as it would be an important and impactful finding.

- pag 6 preprocessing steps. The preprocessing steps are unclear. Initially, it seems these steps are applied to the data, but later it appears they might just be examples of possible normalizations. The authors need to clarify this. Additionally, the claim that stemming and lemmatization yield identical results in French—a morphologically rich language—is questionable. If true, this claim should be supported with evidence and discussed further.

- Tables. Tables 1 and 2 are redundant. Table 3 could be improved by including details such as the average length of text segments and the frequency of Article 700 mentions. Also, Table 4 is referenced as being in the appendix, but is placed at page 8 instead.

- Paper title: Please do not use the abbreviation (ML) in the title.

- pag 4: The problem of document length exceeding 512 tokens, while familiar to those working with language models, is not universally known. Since the authors address this issue explicitly, the paper should discuss this problem for readers unfamiliar with such constraints.

- pag 4: "irrelevant parts". The term "irrelevant parts" is vague. The paper should define what is considered irrelevant in the context of the legal documents analyzed.

- page 5: The paper incorrectly refers to TF-IDF as a model, when in fact it is a metric. Unless the authors are referring to a ML model that exploits the TF-IDF score as a feature... In this case, 'TF-IDF model' can be used as a shorter name for the model, but it should be made explicit earlier in the text. 

- "analysis of Article 700", "importance of the task underlying the Article 700". What is the task underlying art700, and in which sense does the layer analyse it? I get the general sense of these expressions, but I would argue that they are too vague. 

- page 6 "CLAIM" and "JUDGMENT": these classes are not clearly defined, so it is not clear to which part of the document they correspond.

- pag 7 "the previous results may suffer...": which are the previous results?

- The paper incorrectly uses "one-grams" instead of the "unigrams".

- pag 9. F-scores are presented in some tables on a 0-1 scale, but elsewhere as percentages. Consistency should be maintained across the paper.

- page 9 "both models trained on the section CONCLUSION provide better scores compared with the baseline". If the results of the baseline model are those reported in Table 5, then Table 6 shows equal or lower performance on reasons and conclusions annotations.  Additionally, results are not provided for all models described earlier in the paper, leading to an incomplete analysis.

- Conclusions. The conclusion section lacks depth. It should not only summarize the results but also discuss unresolved challenges, limitations, and possible future directions for the research.

Comments on the Quality of English Language

There are only a few minor typos in this paper, so the overall grammatical quality of the English language is good.   

Author Response

Firstly, the Artice 700 should be better introduced and described.  The significance of Article 700 in French law should be clearly explained from the beginning. This is not common knowledge, especially for an international audience, and assumptions should not be made. 

Additionally, I wonder what is the generalisability of this issue outside the French legal system, or if this is relevant only in France. Obviously, this should not be considered a limitation of this work. However, the authors should discuss the generalizability of their task and method to other jurisdictions.

=> Thank you very much for your report. We have added many things in this revision thanks to your comments.

 

The Motivation section first paragraph (page 2) which explains the article 700 importance is rewritten to be clearer and more understandable with an example of usage.

=> We have added 13 experiments corresponding to 13 claim categories (Table 11) with the same models (binary classifiers and Transformers) page 11. We notice that both binary classification models and judicial camembert models can detect other categories as well as article 700 and even better sometimes, even though datasets are smaller. We can say that the technique can be generalized to other French law categories. For other language, other versions of BERT as multilingual BERT can be employed, because CamemBERT is pretrained on French language. Otherwise, the TF-IDF based models can handle other languages with the condition of a sufficiently large dataset.

 

- Figures and References: There is a sample caption for Figure 2, but the figure itself is missing. Additionally, numerous references are missing, showing up as '?'.  While such errors can occur when writing a paper, the authors should check that the version submitted to the journal does not present so many distraction issues. Their frequency gives the impression of carelessness towards the journal and the reviewers, which is unpleasant.

=> The error was corrected.

- The paper also suffers from an overall lack of references, especially in the introduction and motivation sections, where claims should be backed by relevant literature.

=> We added some references that we consider relevant especially in introduction and motivation. You can check references 3,4, 8 and 27.

 

- pag2, points 1 and 2. The paper implies that simple models perform comparably to more complex and computationally heavy models for this task. If this is the authors' main claim, it should be explicitly stated and discussed in greater depth, as it would be an important and impactful finding.

=>Binary classification models are robust and they perform as well as Camembert Base model, they beat Camembert Judicial in other categories. We have added a comment in page 10 (good results for binary models but at the cost of human annotations).

 

- page 6 preprocessing steps. The preprocessing steps are unclear. Initially, it seems these steps are applied to the data, but later it appears they might just be examples of possible normalizations. The authors need to clarify this. Additionally, the claim that stemming and lemmatization yield identical results in French—a morphologically rich language—is questionable. If true, this claim should be supported with evidence and discussed further.

=>We have clarified the preprocessing steps (page 5), and the ambiguous section has been removed. The steps have been listed and clarified (with lemmatization only).

- Tables. Tables 1 and 2 are redundant. Table 3 could be improved by including details such as the average length of text segments and the frequency of Article 700 mentions. Also, Table 4 is referenced as being in the appendix, but is placed at page 8 instead.

=>Table 2 has been removed, Table 2 is now improved by adding average sections length and frequency ‘article 700’ mentions (page 7).

- Paper title: Please do not use the abbreviation (ML) in the title.

=>Title has been changed to “ART 700 identification in Judicial judgments: Comparing Transformers and machine learning models”

- pag 4: The problem of document length exceeding 512 tokens, while familiar to those working with language models, is not universally known. Since the authors address this issue explicitly, the paper should discuss this problem for readers unfamiliar with such constraints.

=>A paragraph to explain the problem was added (page 10).

- pag 4: "irrelevant parts". The term "irrelevant parts" is vague. The paper should define what is considered irrelevant in the context of the legal documents analyzed.

=>An explanation has been added page 4.

- page 5: The paper incorrectly refers to TF-IDF as a model, when in fact it is a metric. Unless the authors are referring to a ML model that exploits the TF-IDF score as a feature... In this case, 'TF-IDF model' can be used as a shorter name for the model, but it should be made explicit earlier in the text. 

=>Error corrected (TF-IDF vectorization or TF-IDF-based models).

- "analysis of Article 700", "importance of the task underlying the Article 700". What is the task underlying art700, and in which sense does the layer analyse it? I get the general sense of these expressions, but I would argue that they are too vague. 

=>Refers to the modified paragraph in motivation (page 2 with reference [8]).

- page 6 "CLAIM" and "JUDGMENT": these classes are not clearly defined, so it is not clear to which part of the document they correspond.

=>Claim section description modified (in Data subsection), page 6.

- pag 7 "the previous results may suffer...": which are the previous results?

=>Error corrected, page 7.

- The paper incorrectly uses "one-grams" instead of the "unigrams".

=>Error corrected

- pag 9. F-scores are presented in some tables on a 0-1 scale, but elsewhere as percentages. Consistency should be maintained across the paper.

=>Error corrected

- page 9 "both models trained on the section CONCLUSION provide better scores compared with the baseline". If the results of the baseline model are those reported in Table 5, then Table 6 shows equal or lower performance on reasons and conclusions annotations.  Additionally, results are not provided for all models described earlier in the paper, leading to an incomplete analysis.

=>Paragraph rewrote: we have good results on Conclusion before 5-fold validation, and on CLAIMS with 5-fold validations, tables 7 to 9).

- Conclusions. The conclusion section lacks depth. It should not only summarize the results but also discuss unresolved challenges, limitations, and possible future directions for the research.

=> Conclusion revised (pages 11-12), thanks a lot for all comments.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

I would like to express my appreciation to the authors for considering the majority of my feedback and revising the paper. 

However, I still find the originality and relevance of the methodology somewhat debatable. As the authors themselves note, the ART700 is frequently introduced by recurring expressions, making it unsurprising that n-gram models perform well or even better than LLMs in this context. In fact, one might question whether machine learning models are necessary at all. A pattern-based model that simply searches for these recurrent expressions and relevant keywords could likely achieve comparable results.

 

Below are my detailed comments on this second version of the paper:

- Line 27: The acronym "ML" appears for the first time and should be expanded as "machine learning (ML) model."

- Introduction, lines 55-56: It would be helpful to clearly state that the research is specifically relevant to the French legal system and focuses on that particular scenario and language.

- Line 139: The reference to the "HEADER" section should be contextualized. It is only defined later in the paper, along with the CLAIMS, REASONS, and CONCLUSION sections, which are mentioned a few lines below.

- Line 187 " recurrent (french) words (le, la, à ...etc)" > I would argue that these are not recurrent French words, but these are French prepositions and articles. I agree that these words belong to closed classes of parts-of-speech, and in this respect they convey limited meaning. Did you filter out function words and only retain content words (e.g., nouns, adjectives, verbs)?

- Line 224 >  I am curious about the perfect agreement (1.0) reported between annotators. Given such high agreement, it suggests that the document sections could perhaps be split using rule-based patterns, given how well-defined these sections are in the types of documents analyzed. Could this be a valid assumption?

- Line 225: "This sample includes 25 examples from each of the following categories: CLAIM, REASONS, CONCLUSION, and the overall JUDGMENT." Aren't CLAIM, REASONS, CONCLUSION, and JUDGMENT sections of the legal document rather than categories? Categories is used in other parts of the paper to refer to specific articles of the law, if I understood well.

- Line 247: Why was the agreement score for annotating the presence of Article 700 not reported? Since this is the core classification task, the agreement here is the actually interesting metric.

- Line 286: "The previous results (in Table 4)"—Table 4 presents the first set of results, so perhaps you meant "preliminary results"? 

- The authors introduce 10 binary models but only present results for the best-performing model in each class. While I understand this approach, it would be helpful to include an overview of the performance of all models, at least in the appendix, for completeness.

- Line 302: Since the authors conducted separate 5-fold cross-validations, how were the scores in Tables 4, 5, and 6 computed? What dataset was used as the test set? This is unclear from the paper.

- Line 326 "Compared to binary models, both CamemBERT models (see the first two columns of Table 10), [...] yielded better results than the JC". Isn't JC one of the two CamemBERT models? What does it mean that JC is better than JC? Please be more accurate in the description. 

- "when both models were trained on the entire judgment, their accuracy and F-measure decreased, though the JC model consistently outperformed the binary models" Also CamemBERT base accuracy is higher than binary models.  

- "Taking either the whole JUDGMENT or the CONCLUSION section does not allow the binary model to detect the presence or the absence of ARTICLE 700 which can explain the low precision and F-measures are outlined in Table 10 on 5-fold cross validation." This sentence should be revised. The word "are" is unnecessary. Furthermore, is a score of 0.812 really indicative of the model failing at this task? In NLP, such a score is generally considered quite good.

- Table 11: Throughout the paper, the authors focus on the best-performing models, but for this table, they present results from the worst-performing setup. Why is this the case? Additionally, why not compare them also against the CamemBERT base model?

 

Comments on the Quality of English Language

- Please revise the inconsistent use of capital letters in the title. 

- "In order to show that our models can be generalized to other datasets related to new claims different from the ARTICLE 700 category (a description of the other claim categories are provided in Appendix A)." > this sentence doesn't have a main clause. 

-"Taking either the whole JUDGMENT or the CONCLUSION section does not allow the binary model to detect the presence or the absence of ARTICLE 700 which can explain the low precision and F-measures are outlined in Table 10 on 5-fold cross validation." > "are" is unnecessary.

- Typos: "Judidial CamemBert"; CamemBERT is sometimes spelled Camembert or CamemBert in the last part of the paper;  Table 11 caption:  CJ Mode > JC model

Author Response

I would like to express my appreciation to the authors for considering the majority of my feedback and revising the paper. However, I still find the originality and relevance of the methodology somewhat debatable. As the authors themselves note, the ART700 is frequently introduced by recurring expressions, making it unsurprising that n-gram models perform well or even better than LLMs in this context. In fact, one might question whether machine learning models are necessary at all. A pattern-based model that simply searches for these recurrent expressions and relevant keywords could likely achieve comparable results.

Dear Reviewer,

Thank you very much again for your feedback on our paper. Hereafter we provide some precisions about the experiments. While pattern-matching may appear capable to capture the common words for specific categories, it cannot capture the context around the claim. It misses unfamiliar expressions or typo cases without doubt as it is literal technique which means that it isn’t generalizable and adaptable for unseen variations.  Also, rule-based methods need manual corrections and interventions over time. In addition, some claim categories have overlapping key words with other ones or do not contain specific ones. We have realized another experiment with rule-based model (regex) on the Article 700 category based on frequent words search. The results (in appendix C) show a lower Accuracy and F-measure than binary classification models and it falls short in generalization, especially when faced with complex or unstructured data.  

- Line 27: The acronym "ML" appears for the first time and should be expanded as "machine learning (ML) model."

Error corrected

- Introduction, lines 55-56: It would be helpful to clearly state that the research is specifically relevant to the French legal system and focuses on that particular scenario and language.

Remark considered

- Line 139: The reference to the "HEADER" section should be contextualized. It is only defined later in the paper, along with the CLAIMS, REASONS, and CONCLUSION sections, which are mentioned a few lines below.

References to definitions added

- Line 187 " recurrent (french) words (le, la, à ...etc)" > I would argue that these are not recurrent French words, but these are French prepositions and articles. I agree that these words belong to closed classes of parts-of-speech, and in this respect they convey limited meaning. Did you filter out function words and only retain content words (e.g., nouns, adjectives, verbs)?

We used nltk stop words, we confirm that « le, la, à » are well included there.

- Line 224 >  I am curious about the perfect agreement (1.0) reported between annotators. Given such high agreement, it suggests that the document sections could perhaps be split using rule-based patterns, given how well-defined these sections are in the types of documents analyzed. Could this be a valid assumption?

Sectioning french judgments is a complicated task and cannot be deduced by inter agreement scores. See for example the work of : Detecting Sections and Entities in Court Decisions Using HMM and CRF Graphical Models, avec G. Tagny Ngompé, S. Harispe, G. Zambrano, S. Mussard, Jacky Montmain, Advances in Knowledge Discovery and Management, Springer, pp.61-86, 2019. The interagreement was calculated for the annotations of the article 700 category dataset not for sectionning the judgment. The perfect agreement is due to the fact that it is easy to detect for lawyers (not necessarily true for other categories).

- Line 225: "This sample includes 25 examples from each of the following categories: CLAIM, REASONS, CONCLUSION, and the overall JUDGMENT." Aren't CLAIM, REASONS, CONCLUSION, and JUDGMENT sections of the legal document rather than categories? Categories is used in other parts of the paper to refer to specific articles of the law, if I understood well.

Error corrected.

- Line 247: Why was the agreement score for annotating the presence of Article 700 not reported? Since this is the core classification task, the agreement here is the actually interesting metric.

To clarify things, the interagreement was calculated for the annotations of the article 700 category dataset not for the sectioning of the judgment. We have moved the interagreement paragraph at the end of section 4.1 to avoid confusion. See page 6

- Line 286: "The previous results (in Table 4)"—Table 4 presents the first set of results, so perhaps you meant "preliminary results"? 

Error corrected

- The authors introduce 10 binary models but only present results for the best-performing model in each class. While I understand this approach, it would be helpful to include an overview of the performance of all models, at least in the appendix, for completeness.

The results of the 10 models are added in appendix B and C. Additionally, we realized that the experience related to tables 7, 8 and 9 (5-fold) was not explained. The compression strategy was different, then a new paragraph has been added to explain this.

- Line 302: Since the authors conducted separate 5-fold cross-validations, how were the scores in Tables 4, 5, and 6 computed? What dataset was used as the test set? This is unclear from the paper.

Tables 4,5 & 6 titles were modified (train-test split 80/20 1-fold)

- Line 326 "Compared to binary models, both CamemBERT models (see the first two columns of Table 10), [...] yielded better results than the JC". Isn't JC one of the two CamemBERT models? What does it mean that JC is better than JC? Please be more accurate in the description. 

Error corrected

- "when both models were trained on the entire judgment, their accuracy and F-measure decreased, though the JC model consistently outperformed the binary models" Also CamemBERT base accuracy is higher than binary models.  

Remark considered and text rectified

- "Taking either the whole JUDGMENT or the CONCLUSION section does not allow the binary model to detect the presence or the absence of ARTICLE 700 which can explain the low precision and F-measures are outlined in Table 10 on 5-fold cross validation." This sentence should be revised. The word "are" is unnecessary. Furthermore, is a score of 0.812 really indicative of the model failing at this task? In NLP, such a score is generally considered quite good.

Error corrected (lower precision)

- Table 11: Throughout the paper, the authors focus on the best-performing models, but for this table, they present results from the worst-performing setup. Why is this the case? Additionally, why not compare them also against the CamemBERT base model?

 Experiments in Table 11 were conducted only with camembert judicial as it outperforms the camemBERT base mode (explanation added in page 11).

 

Comments on the Quality of English Language

- Please revise the inconsistent use of capital letters in the title. 

Title corrected

- "In order to show that our models can be generalized to other datasets related to new claims different from the ARTICLE 700 category (a description of the other claim categories are provided in Appendix A)." > this sentence doesn't have a main clause. 

Sentence was corrected

-"Taking either the whole JUDGMENT or the CONCLUSION section does not allow the binary model to detect the presence or the absence of ARTICLE 700 which can explain the low precision and F-measures are outlined in Table 10 on 5-fold cross validation." > "are" is unnecessary.

Sentence was corrected

- Typos: "Judidial CamemBert"; CamemBERT is sometimes spelled Camembert or CamemBert in the last part of the paper;  Table 11 caption:  CJ Mode > JC model

Thanks for all your precise comments, this is very helpful.

 

Back to TopTop