Evaluating Human versus Machine Learning Performance in a LegalTech Problem

Many machine learning-based document processing applications have been published in recent years. Applying these methodologies can reduce the cost of labor-intensive tasks and induce changes in the company’s structure. The artificial intelligence-based application can replace the application of trainees and free up the time of experts, which can increase innovation inside the company by letting them be involved in tasks with greater added value. However, the development cost of these methodologies can be high, and usually, it is not a straightforward task. This paper presents a survey result, where a machine learning-based legal text labeler competed with multiple people with different legal domain knowledge. The machine learning-based application used binary SVM-based classifiers to resolve the multi-label classification problem. The used methods were encapsulated and deployed as a digital twin into a production environment. The results show that machine learning algorithms can be effectively utilized for monotonous but domain knowledgeand attention-demanding tasks. The results also suggest that embracing the machine learning-based solution can increase discoverability and enrich the value of data. The test confirmed that the accuracy of a machine learning-based system matches up with the long-term accuracy of legal experts, which makes it applicable to automatize the working process.


Introduction
Finding relevant court decisions is a cornerstone of legal research. It is a timeconsuming part of the lawyers' job when preparing for a lawsuit. This mainly involves looking for arguments to convince the court to decide in favor of their clients [1,2]. These manual searches are often inaccurate [3]. Many pieces of research have been published examining the effectiveness of attorney teams. Blair and Maroon showed in their research that, although the attorneys thought that they found 75% of the related documents, they found only about 20% of them [4,5].
One reason for this difficulty is that legal documents contain a detailed description of the case, which uses a wide variety of language and synonyms to describe the same issues. Therefore, the human user has to use many possible combinations and synonyms of the keywords to find the connecting cases.
A good example of this is the case of an employee who committed complicity in smuggling. The employee sued his/her former employer for equal treatment violation and because of the non-payment of wages and cafeteria benefits, and he/she claimed that they terminated employment wrongfully. However, this was not a criminal case. When lawyers receive such a case, they find themselves in a difficult situation to find similar judgments. If they use words that refer to the illegal smuggling of goods from a foreign country in their search queries, they will obtain mainly criminal and non-labor cases; if they look for termination of employment in general or violation of equal treatment, the result list is also likely to be misleading.
Categorization of the court decisions by their subject matter of the lawsuit can significantly improve the performance of these searches, and many research works have dealt with legal document categorization in the last years [6][7][8][9][10][11]. However, using human experts for this task is very time consuming and expensive because the documents are relatively long, usually containing thousands of words, and it is a multi-labeling task, meaning that one document can fit into more than one category [12]. Moreover, another research has shown that texts categorized in a binary manner (relevant/irrelevant for specific litigation) by two independent groups of human experts reached only 28% in F 1 score, agreeing on labels in only 70% of the documents [5,[13][14][15][16][17]. Hence, human categorization often cannot be handled as a ground truth solution.
Many machine learning-based classification solutions have been published in the literature, but so far, no study has directly compared the performance of ML algorithms to humans in terms of accuracy, and reliability [6,7,18]. Guodong et al. [8] created a method for categorizing Chinese legal documents using Graph LSTM (long short-term memory) network [19][20][21] combined with domain knowledge extraction [22]. They compared their algorithm with the traditional classification methods of support vector machine (SVM [23,24]) and LSTM. Thammaboosadee et al. [9] made a classifier that uses a two-stage model to identify legal charges and the punishment range, given case facts and attributes, which could exceed 90% precision. However, these researches calculated the absolute accuracy and the absolute performance of the given solutions. Legal firms and companies want to know when the machine learning performance can reach or even surpass human level performance and implement it in their business processes.
Significant research has been conducted in wide variety of other fields that compared the accuracy and performance of human and automated classification ( [25][26][27][28][29][30][31][32]). Generally, more and more AI-based solutions are created to replace human activities for industrial applications [29,31,32]. Goh et al. [25] used the support vector machine (SVM) algorithm to classify European Research Council Starting Grant project abstracts and compared the results to human labelers. They found that while the best human classifiers can outperform the algorithm, on average, the algorithm is more accurate and more reliable than human classifiers. The results also showed that using a machine learning algorithm is a costeffective method to classify different texts. Simundic et al. [27] compared automated detection and visual inspection of preanalytical interference, such as lipemic, icteric, and hemolyzed samples. They found that human inspection is unreliable and automated system should be a standard protocol. Weismayer et al. [28] compared the categorization of TripAdvisor reviews by traditional manual content analysis and fully automated domainspecific aspect-based sentiment analysis tools. They found that the automated tools can analyze the reviews better, and the manual analysis is more time consuming.
This survey compares the performance of humans and machine learning-based algorithms on a multi-labeling task, namely, the classification of jurisprudence documents by their subject matter. The goal of the survey is to highlight when and how a machine learning-based application can be applied in business processes: when these methods can replace humans in data annotation tasks, and how can they improve the quality and the discoverability of a legal database. This experiment differs from the previous ones in the way that the participants had to read relatively long texts, and every document could be categorized into multiple classes. The performance comparison of human versus machine learning methods on the classification of long texts into multiple categories is an open question in research and an interesting question for firms in deciding when and how machine learning-based methodologies can be implemented into their business processes.

Research Questions
From the business point of view, the most interesting questions are regarding when the machine learning algorithm-based classifier reaches human-level performance and how these algorithms can be applied in business processes to accelerate the work or increase the discoverability of documents.
A study was designed to answer the following five major questions: • How much time would the human categorizers need to label the whole dataset (about 170,000 documents)? How could this work be accelerated by the assistance of the computer? • How much information would a human expert find with or without the aid of the machine learning classifier? • Are machine learning algorithms more reliable than humans for classifying legal documents? • Can machine learning algorithms hide the differences between the performance of legal experts and laymen or non-expert lawyers? • How much is the inter-annotator agreement between legal experts on a specific task?

Study Design
The legal system in Hungary is a limited precedent-based system, and the judicial practice formally distinguishes six different groups of matters, in other words, law areas: criminal law, military criminal law, administrative law, labor law, civil law, and economic law. The published court decisions counted more than 170,000 documents when the research was done. These documents are relatively long. An average text contains 3330 words. The published case law is entirely in Hungarian, due to the special agglutinative property of the Hungarian language, which makes most natural language processing tasks quite difficult [33].
We selected 220 documents for this survey. These documents were pre-labeled and cross-checked by legal experts. We used this test set as a reference for further evaluations. It was an important point to select a similar amount of documents from the six different groups of matters by the following two different aspects. Firstly, it has only one exact solution and it can be classified easily; secondly, it has many possible categorizations, and it is very hard to find both.
We selected a roughly similar set of documents, where only one label and another set with multiple labels could be added. Moreover, we chose an equal proportion of rare categories, where there was little training data for training the algorithm, and common categories, where there were many documents for training data for the machine learning algorithm. We did this in order to simulate the real working conditions and the effect of the monotonicity of the task, the fatigue, and the different learning patterns of the humans had on performance [5,34]. During the labeling process, the participants had to proceed in the same fixed order. During the sorting of the test sets, we put the hardly categorizable decisions after a similar, simple case.
The participants had three hours to label as many documents as they could. There were 18 participants involved in this study with three different competence levels: • Laymen: Never received formal legal training in their life, so they were not a student of any law university and had not received any law-related training. They only met law in their everyday life. • Lawyers: At least fourth-grade law students or people with law degrees. • Legal editors: Legal database editors employed by Wolters Kluwer Hungary, whose task is to categorize legal documents and manually enrich them with other metadata.
Every group was composed of six people, and they were divided into two subgroups. The first subgroup could use the assistance of the machine learning labeler, while the second subgroup did the labeling independently.

Evaluation Metrics
The selection of the legal categories followed the Hungarian legal system. We reduced the number of the categories to 167, and every document could receive a maximum of four distinct labels during the labeling process. During the reduction of the category labels, we strived to exclude and merge those categories where the number of the possible elements was under twenty (Figure 1). That was essential to provide enough training data for the machine learning classifier. Figure 1 shows the estimated sizes of the different label groups in the full dataset. The size of the different areas are not uniform. The number of the elements varies from 30 to 30,000 in the different categories. The information content of a label is in an inverse relation with its element size. This is because when a document with a rare subject matter label is found during a search, it reduces the size of the similar documents set significantly. Hence, the information content of a smaller subject matter label is higher than a very general label to which thousands of documents belong.
We introduced a scoring system to compensate for these differences and measure better the information content. Those labels that have been tagged on more than 200 documents were worth 1 point, between 50 and 200 documents were worth 5 points, which had less than 50 documents were worth 10 points. Every good label counts, and there was no penalty applied for the bad labels during the calculation. Applying this scoring system the area, which represents the value of the information, is in the same range in Figure 1. The total score, which can be calculated in the reference set, was 1020 points.

Machine Learning-Based Classification
Due to the fact that each document could have more than one label, the original problem was decomposed into different multiple binary classification problems [35,36]. Since subject matters belonged to more than one law area, 229 different binary classification models were trained. As a machine learning algorithm to perform the labeling task, support vector machines were chosen, partly because the SVMs tend to perform well in the case of high dimensional vector space [37] and previous studies have also shown the superiority of this algorithm in similar categorization tasks [25,38]. In the case of small categories, text augmentation techniques (EDA, Word Vectors [39]) were used to generate synthetic samples to improve the performance of the training. The machine learning model was developed and deployed via the openly accessible digital-twin-distiller computation platform (https://github.com/montana-knowledge-management/digital-twin-distiller, accessed on 12 December 2021), where we used its plugins and the most important natural language processing libraries to accelerate the development [40,41].
The machine learning solution was elaborated by harnessing the following characteristics of legal documents: the legal expressions in texts may refer to the subject matter, and legal references can be helpful to determine the subject matter of a document (e.g., certain acts or paragraphs of acts). Hence, to tackle the problem, as a vectorization process, TF-IDF (term frequency inverse document frequency) vectorization was chosen [42,43]. From the texts, law references were extracted and normalized by using a regular expression-based solution. The law reference extractor returns a list of the law references found in the legal document in the most specific form possible.
The detailed description of the proposed machine-learning-based solution is a subject of another paper [44].

Throughput
The first question of the research was to estimate how long it takes to make the labeling process by hand for all 170,000 documents. The conducted survey measured how many documents the different groups of participants could categorize after 3 h according to their level of competence and whether they received pre-labeled decisions or not. The average number of labeled judgments for the different groups of participants are shown in Figure 2. The results gave back the expectations that the experienced legal editors processed the most documents, 108.7 on average, almost double that of an inexperienced person, since the laymen processed only 56 documents on average. The result also illustrates that even the most competent participants could categorize approximately 300 judgments per day without computer assistance. It means that if a database provider wanted to label all the available Hungarian judicial decisions, which is approximately 170,000 documents, with human work, it would take more than two years if the company employs a professional editor for this low added value task. If the employer uses laymen, a cheap workforce, this task will take about double the time, about 4.5 years. During these calculations, the accuracy of the work and the discoverability of the data were not considered. If the data provider wants a reference set quality result, they have to employ three professional editors for this task. In this case, about seven years of work is needed to process these documents. However, in this case, the discoverability of the data will be two times better. On the contrary, the applied machine learning algorithm labels a batch of 300 judicial decisions in minutes and only several hours are enough to label all of the datasets.
There was a surprising result, shown in Figure 2. Those participants who received pre-labeled documents labeled slower (61.0 documents on average) than those who re-ceived unlabeled decisions (99.5 documents). However, these participants working with pre-labeled document sets extracted more information (50%) from the same amount of document. It seems the pre-labeled documents forced the labelers to read the decisions more thoroughly.

Accuracy
We applied different metrics to compare the results. Firstly, we calculated the accuracy of the labeled documents ( Figure 3). This accuracy means the proportion of the documents that were completely or partially labeled correctly by each group. A document was considered partially labeled when at least one correct label was found for a given judgment. The results were based only on the documents that the participants managed to label, not the whole dataset. It can be seen that even laymen were capable of finding at least one correct label for a document in 69% of the tagged documents. The laymen who worked with the pre-labeled documents reached the accuracy of those professionals who did not use the unlabeled documents. From this point of view, there is no significant difference between the lawyers and the professional legal editors. However, we got a different picture when we calculated the accuracy of those documents which can fit at least three categories. Here, we accepted a solution from the participant if they found at least three labels correctly for a given document (Figure 4). It can be seen that those participants who could use the support of the machine learning algorithm reached significantly higher accuracy. It can be seen that there is no significant difference between the laymen and the legal editors in this type of contest. The application of the computer can increase the accuracy of the participants by more than 50%.

Performance
The performance of the different participants was calculated with the aid of the previously introduced scoring sheet (Section 2.3). The score of the legal editors and that of the machine learning algorithm are compared in Figure 5. The machine learning code achieved 488 points from the possible 1020, which seems to be a relatively low performance. If we compare it with the performance of the legal editors, it found 50% more information on the same reference set than the human editors in three hours. There is a surprising result that those editors who could not use the assistance of the computer found more than 40% more information in the dataset than those who used the prelabeled labels. Checking the normalized values on Figure 6, we obtain the previous findings that those editors who used the computer assistance discovered 50% more information than the others and the computer. Figure 6 shows that the machine learning algorithm performance reaches the performance of the human level.     Even the best performing legal editor could only retrieve the points after more than two hours, and they scored 3 points more than the machine learning algorithm. We can see the effect of the fatigue on the picture, where the performance of the editor group started to decrease. This means that a machine learning system can be used in ways that are different to a Legaltech business process. It can replace the work of human experts or be used to increase the discoverability of the dataset.

Inter-Annotator Agreement
The aim of measuring the inter-annotator agreement is to assess an annotation process' reliability (IAA). During this evaluation, we did not use the reference set for the comparison.
The three members of each group were considered as an annotator. The reliability of their annotations measured with Krippendorffś alpha (K α ) [45,46], which is a widely used statistical measure, it differs from most of the other IAA methods because it calculates disagreement between the different voters [47]. The K α statistical measure is selected because it can handle the missing data, various sample sizes, categories. The reliability measure is easy to interpret and does not depend on the number of categories [49? ], The simplest form of K α can be calculated by the following formula [45]: where D 0 stands for the observed disagreement among values, and D e is for "disagreement one would expect when the coding of units is attributable to chance rather than to the properties of these units" [45]. The result of the K α calculation is a number between −1 and 1, where 1 indicates perfect agreement, 0 indicates no agreement beyond chance and negative values indicate inverse agreement. K α ≥ 0.8 means usually the acceptance limit. Here the tentative results are also acceptable. The lowest limit for an acceptable agreement is when K α ≥ 0.667. This is the minimal requirement to consider an interagreement calculation reliable [45,47]. The metric has an important component specific to the actual problem called the difference function, which is used to weight the numerator and denominator. The measuring agreement on set-valued items distance metric seemed to be the most appropriate in our case, due to a large number of possible annotation categories [50]. Figure 8 shows the calculated K α scores in percentage for each annotator group. There is a surprising result; the group of the laymen who used the computer assistance achieved the highest score in K α . They achieved 55% reliability, which is significantly higher than the reliability of the professional editors. There are two reasons for this surprising result. Firstly, if we compare the result of the three groups who could not use the assistance of the machine learning methodology, we can see from (Figure 8) that these groups achieved similar reliability, independently from their experience. This reliability score was very low in these cases. The results suggest that the laymen group, who trusted the result of the computer annotation, achieved the highest reliability score in the survey. This indicates that the machine learning-based methodologies produce more consistent solutions for large databases than the humans. However, this highest reliability score is lower than the required minimum (K α ≥ 66%). The second reason is that the categorization group should be revised. The poor agreement between the human experts suggests that the labels are not straightforward and independent from each other, and they did not look for every possible combination due to the time limitations. The percentage agreement between the different groups was also calculated to examine further this reason for the poor reliability between the professional editors ( Figure 9). This percentage agreement examines a more permissive measure than the K α . It shows the ratio of documents, where all three members of the group gave at least one similar label to the given document. The resulting values give back our expectations that the professional legal editors achieved the best results (K α 50%), and non-specialists were the worst (K α 23%) in this comparison. It seems to be surprising that those editors who did not use the assistance of the machine learning methodology achieved significantly higher (about 5%) scores than those editors who worked with the computer assistance. These observations seem to justify that the human professional's learning curve is different from the machine learning methodology; they found that the simple labels had much higher accuracy and reliability than the computer. However, the human expert's reliability is worse than the machine learning solution for complex cases. This measurement justifies that the machine learning methodologies can significantly increase the discoverability of large and complex datasets.

Conclusions
The paper has shown the result of an experimental study that compares the human performance with a machine learning-based solution on a fixed length, low added value monotonous task, a legal text classification, where the solutions are not exactly defined. Enriching the unstructured legal documents with specific labels is necessary to help the lawyers, judges and prosecutors to find similar cases. However, the classification of these texts is very time consuming, and it requires an unacceptable amount of time from the legal editors. The development time and cost of a machine learning-based solution mainly depends on the complexity of the problem. The motivation behind the experiment was to estimate the performance of the legal editors on this task because if a machine learning solution can reach human performance, it can be worth replacing human work. This assumption can significantly reduce the requirements and the cost of the implementation. During the experiment, the performance of three competence groups were examined: legal editors, lawyers and laymen. Every group was divided into two parts. The first group could use the results of the machine learning algorithm as assistance. The second group completed the labeling without assistance. The results showed that the proposed machine learning solution, which found 48% of the information in the reference dataset, significantly outperformed the average of the legal editors in the whole test. Surprisingly, those participants who used the computer assistance were slower, but their precision increased by more than 50%. Moreover, the computer assistance increased the score of the laymen participants significantly. They achieved comparable performance to expert participants. The results show that the application of a machine learning algorithm in solving a legal tech problem can have positive impacts. It can improve the workflow by replacing the human in the loop and reducing the cost of the production, it can improve the quality of the data or decrease the learning curve of new colleagues working on data enrichment. The study results show that the applied machine learning algorithm can reach the average performance of human experts. Moreover, machine learning methodologies can be advantageous for those monotonous tasks where finding the correct solution needs deep focus and unique expertise, or it is hard to define the exact solution, as in the case of law. Another insight gained by this study is that the label set should be reviewed from a legal perspective, and other domain knowledge should be taken into account to increase the agreement between the legal experts and create a new ontology for the labeling system. This new ontology and the newly trained models can further increase the legal database's discoverability, usability, and value.