A Survey of Research on Data Analytics-Based Legal Tech

: Data analytics provides important tools and methods for processing the data generated during legal services. This paper aims to provide a systematic survey of the research papers on the application of quantitative data analytics algorithms in the legal domain. To this end, relevant research papers were collected and used to analyze topics and trends of research on data analytics-based Legal Tech. The key ﬁndings of this paper are as follows. Firstly, the number of research papers about Legal Tech has increased dramatically recently. Secondly, the application of supervised learning techniques to legal judgment data is a very popular approach in this research area. Thirdly, preprocessing legal documents is a very important procedure as many legal documents exist in text form. Fourthly, artiﬁcial neural networks and their variations are widely used in research on data analytics-based Legal Tech. Fifthly, data analytics-based Legal Tech is a multidisciplinary research topic related to computer science and social science, etc.


Introduction
Recently, Industry 4.0 has received much attention from both researchers and practitioners. Industry 4.0 technologies, including artificial intelligence (AI), machine learning (ML), robotics, Internet of Things (IoT), wireless communication, big data, and clouds, are greatly affecting entire economies and society [1,2]. Industry 4.0 technologies are recognized as powerful tools for innovating the productivity and competitiveness of a wide range of industries, including manufacturing [3], education [4], and healthcare [5]. The legal industry is no exception, and this paper focuses on the Industry 4.0 technologies applied to legal services. Traditionally, legal services are provided by human experts; however, modern technologies can be used to automate the service procedures of the legal domain. Consequently, Legal Tech has emerged as an important research topic for the legal and IT industries [6,7].
Legal Tech can be defined as modern technologies and IT solutions that can be used to provide some types of legal services [8], and the application of Legal Tech can be classified into eight sub-areas, as shown in Table 1 [9].
Applications in the first sub-area-involving lawyer marketplace, lawyer-to-lawyer outsourcing, and social and referral networks-are used to find appropriate legal service providers conveniently. The second sub-area, document automation and assembly, includes information systems that can be used to create and process electronic documents in the legal domain. The objective of the third sub-area, involving practice management, case management for specific practice areas, and legal billing, is to provide tools and methods for managing the business data of lawyers and judges, which include work schedules, client information, and law articles, etc. The fourth sub-area, legal research, focuses on legal data search services based on technologies that can be used to parse and interpret text data in the legal domain. The fifth sub-area of Legal Tech aims to apply predictive analysis methods to a training set collected from a law court in order to obtain patterns or models that can be used to predict the trial results of future law cases. The sixth sub-area, electronic This paper focuses on data analytics-based Legal Tech applications, primarily identified from sub-areas 4 and 5 in Table 1. Data analytics can be defined as a procedure of creating value by processing, analyzing, and interpreting raw data [10]. Popular approaches and techniques of data analytics are AI, ML, and data mining, which are key Industry 4.0 technologies [11,12]. Moreover, data analytics can be a particularly useful tool for legal services, since many legal service procedures involve various data that need to be examined by human experts [13]. In other words, data analytics is a promising tool for automating and innovating existing legal services. There are several examples of commercial data analytics-based Legal Tech solutions, such as CaseText and Ross [7].
Several survey papers related to Legal Tech are listed in Table 2. Table 1 suggests that Legal Tech is a multidisciplinary research area associated with a wide range of disciplines, from law to IT and engineering. Hence, previous literature reviews on Legal Tech tended to consider research papers with a wide range of objectives. For instance, Chen [14] provided a survey of application areas of various Legal Tech solutions. Hongdao et al. [7] investigated on markets and business models of Legal Tech. Janoski-Haehlen [15] reviewed curriculums and legal education programs that consider Legal Tech. Salmerón-Manzano [16] grouped the research area of Legal Tech into several clusters including computer science, justice, legal profession, legal design, law firms, and legal education. These papers have focused primarily on commercial aspects of Legal Tech or its influences on society and legal industries.
In contrast, Chalkidis and Kampas [17] provided a literature review on a specific area of Legal Tech, deep learning (DL) applications in the legal domain. Moreover, they suggested three application areas for DL-based Legal Tech, including text classification, information extraction, and information retrieval. DL is one of the AI techniques that can be used to analyze the data collected during legal service procedures. In other words, the paper provided a literature review on Legal Tech applications that can be classified as sub-areas 4 and 5 in Table 1. However, there are many other approaches and methodologies that can also be utilized in those sub-areas. To fill this gap, this paper aims to provide a more comprehensive survey of data analytics applications in the legal domain.
The major contributions of this paper are two-fold. Firstly, a wide range of techniques and algorithms including AI, ML, and data mining are considered. Typically, they are applied to perform conventional data analytics tasks such as classification, regression, clustering, and association, etc. This paper provides a systematic survey of Legal Tech from the perspective of such data analytics tasks. Secondly, additional issues related to data analytics, such as data source and data structure, are also discussed. In tradition, many unstructured data, such as text documents, are generated and utilized during legal service procedures. Unstructured data are not suitable for conventional data analytics methods and algorithms. Thus, the issues related to data, including data types and preprocessing procedures are important for data analytics-based Legal Tech. This paper provides a comprehensive insight into both quantitative methods and data. The remainder of this paper is organized as follows. Section 2 outlines our survey procedure and the research scope of this paper. Section 3 offers some summary statistics of relevant research papers, which suggest recent trends in research on data analytics-based Legal Tech. Section 4 provides discussions on the objectives, algorithms, and data sets of data analytics-based Legal Tech. Finally, Section 5 concludes this paper with some future research directions. Figure 1 depicts the overall research procedure of our survey. The first step was to determine the search keywords that would be used to search for research papers relevant to our survey from academic databases. As shown in Table 3, a combination of 2 keywords, keyword 1 and keyword 2, was used for a single search. Keyword 1 was a word chosen from a set of words {'Legal', 'Law'}, while keyword 2 was an element of {'Data Mining', 'Data Analytics', 'Text Mining', 'Classification', 'Machine Learning', 'Deep Learning', 'Prediction', 'Clustering', 'Tech'}. The sets of words for keyword 1 and keyword 2 were obtained based on a pilot study and the opinions of experts in data analytics. In contrast, Chalkidis and Kampas [17] provided a literature review on a specific area of Legal Tech, deep learning (DL) applications in the legal domain. Moreover, they suggested three application areas for DL-based Legal Tech, including text classification, information extraction, and information retrieval. DL is one of the AI techniques that can be used to analyze the data collected during legal service procedures. In other words, the paper provided a literature review on Legal Tech applications that can be classified as sub-areas 4 and 5 in Table 1. However, there are many other approaches and methodologies that can also be utilized in those sub-areas. To fill this gap, this paper aims to provide a more comprehensive survey of data analytics applications in the legal domain.

Survey Procedure
The major contributions of this paper are two-fold. Firstly, a wide range of techniques and algorithms including AI, ML, and data mining are considered. Typically, they are applied to perform conventional data analytics tasks such as classification, regression, clustering, and association, etc. This paper provides a systematic survey of Legal Tech from the perspective of such data analytics tasks. Secondly, additional issues related to data analytics, such as data source and data structure, are also discussed. In tradition, many unstructured data, such as text documents, are generated and utilized during legal service procedures. Unstructured data are not suitable for conventional data analytics methods and algorithms. Thus, the issues related to data, including data types and preprocessing procedures are important for data analytics-based Legal Tech. This paper provides a comprehensive insight into both quantitative methods and data.
The remainder of this paper is organized as follows. Section 2 outlines our survey procedure and the research scope of this paper. Section 3 offers some summary statistics of relevant research papers, which suggest recent trends in research on data analyticsbased Legal Tech. Section 4 provides discussions on the objectives, algorithms, and data sets of data analytics-based Legal Tech. Finally, Section 5 concludes this paper with some future research directions. Figure 1 depicts the overall research procedure of our survey. The first step was to determine the search keywords that would be used to search for research papers relevant to our survey from academic databases. As shown in Table 3, a combination of 2 keywords, keyword 1 and keyword 2, was used for a single search. Keyword 1 was a word chosen from a set of words {'Legal', 'Law'}, while keyword 2 was an element of {'Data Mining', 'Data Analytics', 'Text Mining', 'Classification', 'Machine Learning', 'Deep Learning', 'Prediction', 'Clustering', 'Tech'}. The sets of words for keyword 1 and keyword 2 were obtained based on a pilot study and the opinions of experts in data analytics.    The second step was to collect research papers from academic databases and journals. This paper focuses on research papers published in SCI (Science Citation Index), SCIE (Science Citation Index Expanded), SSCI (Social Science Citation Index), and A&HCI (Arts and Humanities Citation Index)-indexed journals, which can be collected by using the search engine provided by the Web of Science. The Web of Science research engine was chosen because it can be used to search research papers published in high-quality academic journals in a wide range of research fields, such as science, technology, social science, and humanities, etc. In other words, research papers published in conference proceedings or other journals are out of the scope of this paper. Table 3 shows that 3167 research papers were identified from the Web of Science database. Additionally, research papers published in an individual academic journal related to data analytics-based Legal Tech, Artificial Intelligence, and Law, are collected in the second step in Figure 1. However, many of them were irrelevant for our survey on data-analytics-based Legal Tech. For instance, many research papers in the fields of mineralogy or geology were included in the initial search results due to the use of the term 'Mining' in keyword 2. Thus, the objective of the third step was to filter out irrelevant research papers from the initial search results. To this end, the authors manually examined the titles, abstracts, and author affiliations of the collected papers. Note that this paper focuses primarily on data analytics applications designed to analyze and process data generated during legal and judicial procedures, which include legal judgments, court records, regulations, and law articles, etc. Such data have been analyzed and processed by human experts, such as lawyers and judges; however, this procedure is expected to be automated by using modern Industry 4.0 technologies. On the contrary, research papers focusing on the analysis of data collected from outside of legal and judicial procedures, including behavioral data of citizens, are not considered in this paper. For instance, research topics related to law enforcement, such as crime detection and digital forensic, and patent mining based on technical documents were filtered out at this step. Furthermore, research papers on logical analysis of legal argumentation, architecture of legal information systems, and legal issues related to AI are excluded from our survey study.

Survey Procedure
Among the research papers collected by using the Web of Science search engine, 57 papers were identified as being relevant for our survey, as shown in Table 3; however, 18 of them were duplicates. Thus, only 39 papers were used in this paper. In addition, 25 relevant research papers were identified from the aforementioned journal, Artificial Intelligence and Law. Consequently, 64 research papers remained for our survey study after the filtering of irrelevant papers and the removal of duplicates.
In the fourth step of our survey procedure, the 64 identified research papers were carefully reviewed by the authors. Specifically, the authors extracted key features of the research papers, including the publication year, subject area of the journal, country, and continent of the first author's institute, data source, approaches and algorithms for data analytics, etc. Additionally, these features were summarized using tables or charts.
Finally, the results of the review of the relevant research papers were analyzed and discussed in the fifth step. Consequently, remarkable trends and future research directions for data-analytics-based Legal Tech are provided by our survey study. Figure 2 shows recent changes in the number of research papers that deal with data analytics-based Legal Tech. The topic was not so popular in the early 2000s, which might be due to the immaturity of related technologies such as AI, ML, and computer hardware. In contrast, published research papers on data analytics-based Legal Tech have been increasing since about 2010. In particular, research on data analytics-based Legal Tech has shown an exponential increase in the three years from 2018 to 2020. In other words, data analytics-based Legal Tech has recently been emerging as a popular research topic.

Number of Research Papers Related to Data Analytics-Based Legal Tech
In the fourth step of our survey procedure, the 64 identified research papers wer carefully reviewed by the authors. Specifically, the authors extracted key features of th research papers, including the publication year, subject area of the journal, country, an continent of the first author's institute, data source, approaches and algorithms for dat analytics, etc. Additionally, these features were summarized using tables or charts.
Finally, the results of the review of the relevant research papers were analyzed and discussed in the fifth step. Consequently, remarkable trends and future research direc tions for data-analytics-based Legal Tech are provided by our survey study. Figure 2 shows recent changes in the number of research papers that deal with dat analytics-based Legal Tech. The topic was not so popular in the early 2000s, which migh be due to the immaturity of related technologies such as AI, ML, and computer hardware In contrast, published research papers on data analytics-based Legal Tech have been in creasing since about 2010. In particular, research on data analytics-based Legal Tech ha shown an exponential increase in the three years from 2018 to 2020. In other words, dat analytics-based Legal Tech has recently been emerging as a popular research topic.  Specifically, Figure 4 reveals that the number of research papers from Asia was no significantly higher than that from other continents a couple of years ago. In contras there was a dramatic increase in the number of research papers from Asia from 2019 Moreover, Figure 5 shows that most of the research papers from Asia are authored b researchers working for institutes in China. In other words, data analytics-based Lega Tech has recently emerged as a popular research topic, especially in China.

Geographical Distribution of Research Papers on Data Analytics-Based Legal Tech
The Chinese government is supporting the development of Industry 4.0 technologies including Legal Tech, and this seems to contribute to the achievements of Chinese re searchers [18]. The Internet court at Hangzhou, China, established in 2017, is the firs online court in the world, which provides various services required for judicial proceed ings [19]. Other Internet courts in Beijing and Guangzhou were established in 2018, an Chinese media reported that millions of legal cases are handled by this innovative lega  Specifically, Figure 4 reveals that the number of research papers from Asia was not significantly higher than that from other continents a couple of years ago. In contrast, there was a dramatic increase in the number of research papers from Asia from 2019. Moreover, Figure 5 shows that most of the research papers from Asia are authored by researchers working for institutes in China. In other words, data analytics-based Legal Tech has recently emerged as a popular research topic, especially in China.

Geographical Distribution of Research Papers on Data Analytics-Based Legal Tech
The Chinese government is supporting the development of Industry 4.0 technologies, including Legal Tech, and this seems to contribute to the achievements of Chinese researchers [18]. The Internet court at Hangzhou, China, established in 2017, is the first online court in the world, which provides various services required for judicial proceedings [19]. Other Internet courts in Beijing and Guangzhou were established in 2018, and Chinese media reported that millions of legal cases are handled by this innovative legal service [20]. In an Internet court, data and documents required for legal proceedings are created, collected, and stored electronically. Such electronic data and documents can be processed more conveniently and efficiently by applying data analytics that can be used to extract meaningful knowledge and useful patterns from large volumes of data. In this context, modern legal services such as Internet courts can provide promising application areas for data analytics-based Legal Tech.
Sustainability 2021, 13, x FOR PEER REVIEW 6 of 23 service [20]. In an Internet court, data and documents required for legal proceedings are created, collected, and stored electronically. Such electronic data and documents can be processed more conveniently and efficiently by applying data analytics that can be used to extract meaningful knowledge and useful patterns from large volumes of data. In this context, modern legal services such as Internet courts can provide promising application areas for data analytics-based Legal Tech.   service [20]. In an Internet court, data and documents required for legal proceedings are created, collected, and stored electronically. Such electronic data and documents can be processed more conveniently and efficiently by applying data analytics that can be used to extract meaningful knowledge and useful patterns from large volumes of data. In this context, modern legal services such as Internet courts can provide promising application areas for data analytics-based Legal Tech.    Figure 6 summarizes the subject areas of the journals that have published relevant research papers. The subject area of a journal can be identified on the Web of Science website, where a single journal can be associated with two or more subject areas. From Figure  6, the following observations can be made. Firstly, the most popular subject area for research papers on data analytics-based Legal Tech is computer science, where 57 out of 64 research papers are published in journals associated with computer science. This implies that the recent remarkable achievements of computer science, such as AI, ML, big data, and cloud, are important technological enablers for Legal Tech. Additionally, Legal Tech is emerging as a promising application domain for modern Industry 4.0 technologies.

Subject Area of Journals
Secondly, the second most popular subject area is social sciences, which includes "law" as one of its sub-areas. In other words, Legal Tech is being paid much attention by researchers from the field of law since Legal Tech is expected to have significant impacts on legal industries and services. Moreover, the first and the second most popular subject areas in Figure 6 indicate that data analytics-based Legal Tech is a kind of multidisciplinary research topic.
Thirdly, engineering, mathematics, and decision sciences are the third, the fourth,  Figure 6 summarizes the subject areas of the journals that have published relevant research papers. The subject area of a journal can be identified on the Web of Science website, where a single journal can be associated with two or more subject areas. From Figure 6, the following observations can be made.  Figure 6 summarizes the subject areas of the journals that have published relevant research papers. The subject area of a journal can be identified on the Web of Science website, where a single journal can be associated with two or more subject areas. From Figure  6, the following observations can be made. Firstly, the most popular subject area for research papers on data analytics-based Legal Tech is computer science, where 57 out of 64 research papers are published in journals associated with computer science. This implies that the recent remarkable achievements of computer science, such as AI, ML, big data, and cloud, are important technological enablers for Legal Tech. Additionally, Legal Tech is emerging as a promising application domain for modern Industry 4.0 technologies.

Subject Area of Journals
Secondly, the second most popular subject area is social sciences, which includes "law" as one of its sub-areas. In other words, Legal Tech is being paid much attention by researchers from the field of law since Legal Tech is expected to have significant impacts on legal industries and services. Moreover, the first and the second most popular subject areas in Figure 6 indicate that data analytics-based Legal Tech is a kind of multidisciplinary research topic.
Thirdly, engineering, mathematics, and decision sciences are the third, the fourth, and the fifth most popular subject areas, respectively. These subject areas are related to Firstly, the most popular subject area for research papers on data analytics-based Legal Tech is computer science, where 57 out of 64 research papers are published in journals associated with computer science. This implies that the recent remarkable achievements of computer science, such as AI, ML, big data, and cloud, are important technological enablers for Legal Tech. Additionally, Legal Tech is emerging as a promising application domain for modern Industry 4.0 technologies.
Secondly, the second most popular subject area is social sciences, which includes "law" as one of its sub-areas. In other words, Legal Tech is being paid much attention by researchers from the field of law since Legal Tech is expected to have significant impacts on legal industries and services. Moreover, the first and the second most popular subject  Figure 6 indicate that data analytics-based Legal Tech is a kind of multidisciplinary research topic.
Thirdly, engineering, mathematics, and decision sciences are the third, the fourth, and the fifth most popular subject areas, respectively. These subject areas are related to quantitative analysis procedures for extracting meaningful knowledge and useful patterns from large volumes of data, which is an integral part of data analytics-based Legal Tech. In other words, such algorithms and methodologies are useful tools for implementing Legal Tech applications.
Lastly, other subject areas are classified as "Others" in Figure 6. These subject areas include materials science, physics and astronomy, medicine, and environmental science, which are not directly related to data analytics or Legal Tech. This reveals that the research papers on data analytics-based Legal Tech are published in a wide range of academic journals.

Data Sources for Data Analytics-Based Legal Tech Research
In this paper, the types of data sources for data analytics-based Legal Tech research are grouped into five categories, as shown in Table 4 and Figure 7. Note that two or more data sources can be utilized together in a single research paper. quantitative analysis procedures for extracting meaningful knowledge and useful patterns from large volumes of data, which is an integral part of data analytics-based Legal Tech. In other words, such algorithms and methodologies are useful tools for implementing Legal Tech applications. Lastly, other subject areas are classified as "Others" in Figure 6. These subject areas include materials science, physics and astronomy, medicine, and environmental science, which are not directly related to data analytics or Legal Tech. This reveals that the research papers on data analytics-based Legal Tech are published in a wide range of academic journals.

Data Sources for Data Analytics-Based Legal Tech Research
In this paper, the types of data sources for data analytics-based Legal Tech research are grouped into five categories, as shown in Table 4 and Figure 7. Note that two or more data sources can be utilized together in a single research paper.  'Legal judgment' is a historical record about previous cases, which includes court decisions and case descriptions, such as defendant profiles and information about the associated law articles, etc. The majority (67.2%) of the papers applied data analytics techniques to legal judgment data. Trial for lawsuits is the most fundamental service of the legal industry, and it is clear that much data is created, collected, and processed during the process of trials. In this context, legal judgment data are the most popular data source for research on data analytics-based Legal Tech.
'Law' includes clauses and articles of laws that can be found in legal codes, and 31.3% of the relevant research papers utilized this type of data. Law data also play an important role in the process of trials in that they provide a basis for judgments and court decisions. 'Legal judgment' is a historical record about previous cases, which includes court decisions and case descriptions, such as defendant profiles and information about the associated law articles, etc. The majority (67.2%) of the papers applied data analytics techniques to legal judgment data. Trial for lawsuits is the most fundamental service of the legal industry, and it is clear that much data is created, collected, and processed during the process of trials. In this context, legal judgment data are the most popular data source for research on data analytics-based Legal Tech.
'Law' includes clauses and articles of laws that can be found in legal codes, and 31.3% of the relevant research papers utilized this type of data. Law data also play an important role in the process of trials in that they provide a basis for judgments and court decisions.
Other data source types include 'Court record', 'Legislative document', and 'Civil petition'. Court record data contain records of testimonies, statements, and discussions collected during trials in court [21,22]. Five research papers analyzed this type of data by applying data analytics techniques. Legislative documents are historical records on the legislative process in congress, which include bill text, legislator profiles, and past vote histories [23]. This type of data is used in four relevant research papers. A civil petition is a request by a civil petitioner to an administrative agency to take a disposition or other specific action. This type of data is used in one relevant research paper.
Traditionally, the types of data in Figure 7 would be processed and interpreted by human experts in order to provide a wide range of legal services. However, data analytics techniques can be used to extract meaningful knowledge and useful patterns from various data, and this allows for providing automated or semi-automated legal services. For instance, AI-based legal decision support systems, including an AI judge that can provide a sentencing recommendation on pending lawsuits, emerged as a promising tool for enhancing judicial decision-making procedures. In general, AI judges are designed to analyze legal judgment data, the most popular data source in Figure 7, and make decisions on pending lawsuits by using decision models extracted from the data. Human judges can make judicial decisions efficiently by using preliminary decisions provided by AI judges [24]. In this way, data analytics can be a powerful tool for innovating legal service procedures by analyzing legal data as shown in Figure 7.

Algorithms and Methods of Data Analytics
The main data analytics techniques include data mining, AI, statistics, etc. In this study, the approaches and algorithms of the relevant research papers are categorized by applying the taxonomies of these techniques.
Data mining can be defined as an automated or semi-automated procedure for extracting knowledge, rules, and patterns from large volumes of data [25]. Typically, data mining techniques are grouped into two categories, namely supervised learning (predictive analytics) and unsupervised learning (descriptive analytics), as shown in Figure 8. Other data source types include 'Court record', 'Legislative document', and 'Civil petition'. Court record data contain records of testimonies, statements, and discussions collected during trials in court [21,22]. Five research papers analyzed this type of data by applying data analytics techniques. Legislative documents are historical records on the legislative process in congress, which include bill text, legislator profiles, and past vote histories [23]. This type of data is used in four relevant research papers. A civil petition is a request by a civil petitioner to an administrative agency to take a disposition or other specific action. This type of data is used in one relevant research paper.
Traditionally, the types of data in Figure 7 would be processed and interpreted by human experts in order to provide a wide range of legal services. However, data analytics techniques can be used to extract meaningful knowledge and useful patterns from various data, and this allows for providing automated or semi-automated legal services. For instance, AI-based legal decision support systems, including an AI judge that can provide a sentencing recommendation on pending lawsuits, emerged as a promising tool for enhancing judicial decision-making procedures. In general, AI judges are designed to analyze legal judgment data, the most popular data source in Figure 7, and make decisions on pending lawsuits by using decision models extracted from the data. Human judges can make judicial decisions efficiently by using preliminary decisions provided by AI judges [24]. In this way, data analytics can be a powerful tool for innovating legal service procedures by analyzing legal data as shown in Figure 7.

Algorithms and Methods of Data Analytics
The main data analytics techniques include data mining, AI, statistics, etc. In this study, the approaches and algorithms of the relevant research papers are categorized by applying the taxonomies of these techniques.
Data mining can be defined as an automated or semi-automated procedure for extracting knowledge, rules, and patterns from large volumes of data [25]. Typically, data mining techniques are grouped into two categories, namely supervised learning (predictive analytics) and unsupervised learning (descriptive analytics), as shown in Figure 8. The ultimate goal of supervised learning is to predict the value of a target variable for new input data. To this end, supervised learning techniques are designed to build a model that explains the relationships between a target variable and predictors by analyzing a training set where the values of the target variable are known [26]. In other words, training sets for supervised learning contain both predictors and target variables. A target variable is an output value or dependent variable to be estimated, while predictors are independent variables that can affect the target variable [27]. Supervised learning is subdivided into two types: classification and regression. Classification is used for estimating the value of a categorical target variable [28]. Examples of classification techniques are decision trees, Bayesian classifiers, and nearest neighbor, etc. [26,29,30]. In contrast, the objective of regression is to estimate a numerical target variable. Examples of regression The ultimate goal of supervised learning is to predict the value of a target variable for new input data. To this end, supervised learning techniques are designed to build a model that explains the relationships between a target variable and predictors by analyzing a training set where the values of the target variable are known [26]. In other words, training sets for supervised learning contain both predictors and target variables. A target variable is an output value or dependent variable to be estimated, while predictors are independent variables that can affect the target variable [27]. Supervised learning is subdivided into two types: classification and regression. Classification is used for estimating the value of a categorical target variable [28]. Examples of classification techniques are decision trees, Bayesian classifiers, and nearest neighbor, etc. [26,29,30]. In contrast, the objective of regression is to estimate a numerical target variable. Examples of regression techniques are linear regression, ridge regression, lasso regression, and artificial neural networks (ANNs), etc. [31].
Unsupervised learning is used to understand and describe the structure of a given data set. Typically, techniques and algorithms of unsupervised learning do not consider a target variable. Unsupervised learning includes clustering analysis and association analysis. The objective of clustering is to find groups of records, such that similar records belong to an identical group while dissimilar records belong to different groups. Examples of clustering algorithms are k-means, DBSCAN, and hierarchical clustering, etc. [26,29,30]. On the contrary, association analysis is used to extract interesting association rules, which represent the cause-effect relationships among the variables, from a transactional data set [32]. Association analysis can be performed by applying the well-known Apriori algorithm and its variations [32,33] or the FP growth algorithm [34].
Supervised learning techniques can be used to analyze the relationship between target variables and predictors related to legal decisions. For instance, those techniques enable the prediction of the trial result if appropriate models and predictor variables are given. In contrast, unsupervised learning techniques are typically used to analyze similarities or correlations between legal documents. For instance, sub-groups of similar cases can be identified by applying clustering analysis to legal judgment data.
Recently, much attention has been paid to AI and ML. AI techniques are used to develop computers or machines that can mimic human intelligence. ML provides algorithms that enable machines to learn from given examples. Thus, the definitions of AI and ML are slightly different from that of data mining. Nevertheless, the aforementioned traditional data mining tasks of classification, regression, clustering, and association can also be performed by applying algorithms and techniques of AI and ML [31,35].
Additionally, two modern techniques, namely text mining and network analysis, are also important approaches for data analytics-based Legal Tech according to our survey. Text mining is the process of extracting useful information from text data. Text data are unstructured data that are hard to process and analyze. Since raw data for Legal Tech often exist in text form, preprocessing for raw data is an important issue of data analytics-based Legal Tech. The ultimate goal of preprocessing is to obtain useful and refined data to be analyzed by data analytics techniques. Well-known preprocessing techniques include feature selection, feature construction, missing value imputation, data integration, and data transformation, etc. These techniques help to obtain high-quality data more suitable for data analytics [29]. Among them, data transformation techniques are very important for Legal Tech, because most traditional data analytics algorithms cannot deal with text data directly.
Text mining algorithms provide non-trivial procedures for transforming text data into structured data [36][37][38]. Well-known text mining techniques include TF-IDF, a bag of words (BoW), and word embedding, etc. TF-IDF generates structured data in the form of a document-term matrix, where the importance of a term is calculated by using frequency and inverse document frequency. For a single term, TF (term frequency) and IDF (inverse document frequency) denote the number of occurrences and their reciprocal, respectively [39]. Let us consider three short documents: document 1, 'Autonomous Weapons and International Humanitarian Law: Advantages, Open Technical Questions and Legal Issues to be Clarified'; document 2, 'Autonomous Weapons Systems and the Law of Armed Conflict'; and document 3, 'On the Right of Citizens to Assemble Peacefully, without Weapons, Freely Conduct Meetings and Demonstrations'. The TF values for the terms within documents 1-3 are summarized in Table 5.
The IDF value for a specific term is calculated by using DF (document frequency), the number of documents containing the term, as follows: where n is the number of given documents. Table 6 summarizes the IDF values for the terms within documents 1-3. Then, TF-IDF values can be calculated by using (2), and the TF-IDF values for this example are summarized in Table 7.
In a TF-IDF matrix, a term is frequently found in given documents if the sum of values in the corresponding column is small. For instance, 'Weapons' and 'and' are frequent terms in Table 7.
Similarly, BoW generates a document-term matrix by using the frequencies of given terms [40]. Assume three documents: document 1, 'Autonomous Weapons and International Humanitarian Law: Advantages, Open Technical Questions, and Legal Issues to be Clarified'; document 2, 'Autonomous Weapons Systems and the Law of Armed Conflict'; and document 3, 'On the Right of Citizens to Assemble Peacefully, without Weapons, Freely Conduct Meetings and Demonstrations'. A BoW-based document-term matrix for the terms within documents 1-3 is shown in Table 8. Note that this document requires the length of each document. In Table 8, a document is represented as a row vector containing frequency values of the given terms. These vectors can be used to calculate similarity or dissimilarity between documents.
Word embedding is used to convert a term into a dense vector containing continuous values, which can be obtained by learning given documents or corpus data [41]. Word embedding enables one to represent a term by a vector with lower dimensionality. Moreover, the dense vector of word embedding can reflect similarity or dissimilarity between terms.
Typically, raw data generated and collected during legal service procedures are in the form of text. However, many data analytics algorithms are designed to handle structured data such as table data. Thus, text mining techniques, including TD-IDF, BoW, and word embedding, are important in that they provide useful preprocessing methods for data analytics-based Legal Tech. The transformed data in Tables 5-8 are in tabular form, where a record is characterized by a number of variables. Since many data analytics algorithms, supervised and unsupervised techniques, assume tabular structured data, this data structure is most commonly used in the fields of AI, ML, and data mining. In contrast to preprocessing, postprocessing techniques are used to interpret and utilize knowledge and patterns obtained by applying supervised and unsupervised learning techniques more effectively. Data summarization and visualization are examples of postprocessing tasks, however, the postprocessing procedure is out of the scope of this paper.
Another important modern data analysis approach is network analysis. Network analysis is the process of understanding the structures of a given network and the relationships between nodes therein [23]. For instance, legal documents and their citation relationships can be represented as nodes and edges of a network. Network analysis techniques can be used to find important nodes or a community of some nodes, which are useful for information visualization and document recommendation [42,43].

Input Data and Algorithms
The research papers that applied supervised learning algorithms to legal data are listed in Table 9, while research papers that used unsupervised learning algorithms are shown in Table 10. In addition, research papers that are not contained in Table 9 or Table 10 are listed in Table 11.
The 'Data structure' columns of Tables 9-11 indicate the type of input data for the data analytics algorithms. This paper considers six types of data structure, including bag of words, TF-IDF, word embedding, segmented document, structured document, and text. Data in the form of the first three types-namely, bag of words, TF-IDF, and word embedding-are generated by applying text-mining algorithms. A segmented document can be defined as a set of elements obtained by splitting the contents of a given document. For instance, keywords, sentences, and paragraphs can be used as the elements for creating a segmented document. In structured document-type data, a single document is represented by using a number of features that can be identified from a given document. Examples of features for structured document-type data include information on the victim or defendant, location of judgment, and amount of money involved, etc. [44,45]. Moreover, this type of data is sometimes provided in the form of an XML (Extensible Markup Language) document [44,46,47] or electronic database [42,48,49]. Inherently, structured document-type data are a sort of table data. Thus, this type of data can be processed and analyzed in a convenient way if appropriate features are carefully developed [50,51]. Finally, documents containing plain text are classified as text-type data in Tables 9-11.   For each research paper, the algorithm(s) applied by the authors can be found in the 'Algorithm' columns of Tables 9-11, where all the ANN-based algorithms, such as multi-layer perceptron (MLP) and DL, are classified as ANN&DL.
Tables 9-11 provide the following observations. Firstly, preprocessing plays a significant role in most research papers. In other words, input data for data-analytics-based Legal Tech are generally obtained by applying preprocessing techniques to raw legal documents. Several research papers in Tables 9 and 10 proposed methodologies that can be applied to text-type data; however, those methodologies often contain their own preprocessing procedures that transform text-type data into a more structured form [87,88]. Thus, it can be concluded that it is difficult to directly use legal documents in the form of text to develop data analytics-based Legal Tech applications.
Secondly, a structured document is also a popular data structure. It is well known that data quality has a significant impact on the usefulness of analysis results [29]. Moreover, structured documents of high quality can be obtained by creating meaningful features or variables that describe the contents of the raw data well. Such features can be created automatically by using relevant tools such as natural language processing (NLP) [64]. In this context, it is expected that structured document-type data will continue to be widely adopted by Legal Tech applications.
Thirdly, a significant number of research papers applied supervised learning algorithms to legal judgment data. One reason is that legal judgment is quite an important decision-making process in the legal industry. The other reason is the structure of legal judgment data. In order to apply a supervised learning algorithm, input data should contain both a target variable and predictor variables, such that predictors affect the target variable. In legal judgment data, trial results such as the length of imprisonment and amount of penalty are affected by other information such as the associated law articles and defendant profiles [87]. In other words, trial results and other information can be used as target variables and predictors, respectively. Thus, legal judgment data can be regarded as a good data source for supervised learning tasks. This led the supervised learning task to be dealt with more frequently than unsupervised learning and other tasks.
Fourthly, among the traditional unsupervised learning tasks, clustering is dealt with more often than association, as shown in Table 10. For instance, clustering analysis is a useful tool for discovering a group of legal documents to be focused on [79,91].
Furthermore, the research papers listed in Table 11 generally provide methodologies and applications for information extraction. The topics of the research papers include entity recognition [43,48,93,99], similarity-score-based recommendation or information retrieval [96,98], information visualization [42], and opinion mining [100], etc.
Lastly, 21 of 64 (32.8%) research papers utilized ANN&DL algorithms, which revealed the good performance and wide applicability of ANN and its variations.
The tasks and algorithms of previous research papers are summarized in Table 12. Among supervised learning tasks, the classification task is more frequently tackled than regression is. In other words, most research papers that utilize supervised learning techniques consider categorical target variables. For instance, length of imprisonment, which can be used as a target variable related to legal judgment, can be discretized into two intervals, [0, 1 year ] and [ 1 year, ∞ ], in order to apply classification algorithms. Since trial results, such as length of imprisonment, are sometimes specified by using intervals in law articles, classification is frequently adopted in research on data analytics-based Legal Tech. The most widely used algorithm for supervised learning tasks is ANN&DL. An ANN is a network of artificial neurons (nodes) and connections between them, used to generate output values for given input values. The nodes within an ANN form two or more layers. The input layer contains input nodes that indicate input values, while the output layer consists of output nodes that produce output values. Moreover, the layers between the input layer and output layer are called hidden layers. Typically, an ANN with many hidden layers is complex and time-consuming to train, although the hidden layers can contribute to obtaining output values appropriate for the given input values [35,103]. However, modern computer hardware and efficient activation functions allow for utilizing a number of hidden layers for a wide range of practical purposes. An ANN with many hidden layers is called a deep neural network (DNN), and DL is a set of algorithms that use a DNN [104]. Table 12 shows that ANN&DL techniques are also widely used in the legal domain.
The most frequently used clustering and association algorithms are k-means and Apriori, respectively. K-means is a well-known clustering algorithm that is used to find centroid-based and non-overlapping k clusters from a given data set. The number of clusters, k, should be prespecified by the analyzer, and a single cluster should contain records similar to each other [26,29,30]. In the legal domain, a clustering algorithm is often used to find clusters of similar documents. Association analysis is rarely applied in the legal domain. The Apriori algorithm is a traditional algorithm used for association analysis that is designed to extract useful, interesting association rules from a given transaction data set [32]. An association rule indicates cause-and-effect relationships or correlations between items within a given transaction data set, and a single association rule is useful if and only if its support and confidence measures simultaneously satisfy minimum threshold values [29]. Typically, association analysis is used to identify a set of items frequently found together in identical transactions. In the legal domain, Liu et al. [71] applied the Apriori algorithm to analyze citation relationships between statutes.

Target Variable
As discussed in the previous section, supervised learning techniques that consider a target variable is widely used in research on data analytics-based Legal Tech. In a training set for supervised learning, both predictors and target variable values should be known, and supervised learning algorithms are generally designed to build models that reflect the relationship between predictors and the target variable. If a model is obtained, it is used to estimate the value of the target variable for a new data object, where only the predictors are known. The target variables of the research papers listed in Table 9 are summarized in Table 13. Note that a single research paper can consider two or more types of target variables. In Table 13, the most frequently considered target variable type is the trial result, which includes the length of imprisonment, amount of penalty, guiltiness of defendant, and validity of the patent, etc. [44,51,65,73,79]. This type of target variable is very popular in research on data analytics-based Legal Tech since it is the primary output of the most important legal service, i.e., legal procedures.
Element type as a target variable is used to specify the type of entity identified in legal documents. For instance, some noun phrases can indicate information, such as the name of the person, location, and time, in legal documents [21,45,64]. Ji et al. [22] used ANN&DL techniques to classify the type of paragraphs in court record data. Examples of application areas of classification models for element type target variables include annotation of legal documents and transformation of legal documents into structured documents. Document type is the third most popular target variable type. Typically, legal experts have to examine a large volume of legal documents in order to provide legal services. Sometimes, a legal expert or a department of an organization will specialize in specific types of documents. Similarly, different types of legal documents are often processed in different ways. Thus, the research papers that focused on document type as the target variable aimed to improve the efficiency of legal service procedures by classifying the involved legal documents into appropriate groups. Examples of document type target variables are case type, complaint type, accusation type, and topic type, etc. [57,62,72,77,78].
Law article as a target variable denotes law articles associated with a specific case. In other words, models for this type of target variable are able to find law articles relevant for a given case conveniently [71]. Moreover, the information about law articles determined by data analytics techniques can be used to estimate trial-result-type target variables [68,86].

Key Findings
This paper provides a systematic survey of research on Legal Tech, which provides innovative legal services in the Industry 4.0 era. The key findings of our survey are as follows: Firstly, the number of published research papers on data analytics-based Legal Tech dramatically increased in recent years, especially in Asia. Previous surveys suggested that the Legal Tech industry is rapidly growing across the world [7,16]. In contrast, this paper reveals that a specific sub-area of Legal Tech can be a popular research topic in specific regions.
Secondly, many of the associated research papers applied supervised learning techniques to legal judgment data. The most popular type of such application is trial result prediction, where the trial result, such as length of imprisonment and amount of penalty, is used as a target variable. Legal Tech applications for trial result prediction can help human judges to make legal decisions more efficiently. Furthermore, clients can take advantage of such applications in order to assess their chances of winning a case. Another interesting application of supervised learning techniques to legal judgment data is law article prediction, which aims to identify law articles relevant to a case. Law article prediction can also be used as a preprocessing procedure for trial result prediction in that law articles related to a case typically provide valuable information that affects the trial result. In this manner, data analytics enable the provision of legal services more efficiently and accurately [14].
Thirdly, the structured document is the most popular data structure for research on data analytics-based Legal Tech. It is difficult to process and analyze raw documents in text format. Thus, many researchers extract useful features from legal documents and utilize them to convert the documents into tabular form. Such features have to be identified and collected by applying nontrivial techniques such as text mining and NLP.
Fourthly, ANN&DL is the most frequently used algorithm type in research on data analytics-based Legal Tech. ANN&DL can be applied to both classification and regression analyses. Additionally, unsupervised learning tasks can be dealt with by using ANN&DL techniques. In other words, ANN&DL techniques have wide applicability. Moreover, ANN&DL techniques often produce good performances even if the training set is very complex. In this context, ANN&DL is a promising approach for developing data analyticsbased Legal Tech applications.
Fifthly, data analytics-based Legal Tech is a multidisciplinary research topic related to computer science, social science, engineering, and mathematics, etc. Especially, Legal Tech emerged as a promising application domain for quantitative methods and algorithms, which were popular research topics for researchers from computer science and engineering.

Future Research Topics and Challenges
The authors conclude this paper by listing several future research topics and challenges for data analytics-based Legal Tech. The first topic is legal issues related to the applications of data analytics in the legal industry. For instance, misjudgments of data analytics-based Legal Tech applications can cause significant loss of stakeholders and spark arguments over who is responsible. Furthermore, the decision-making model in data analytics-based Legal Tech applications can contain some biases related to discrimination. Thus, reliability, fairness, and scope of application will be relevant research topics for data analytics-based Legal Tech.
The second topic is business and service models based on data analytics-based Legal Tech. Modern Legal Tech enables innovative business and service models in the legal industry [7]. Data analytics-based Legal Tech can also contribute to providing new legal services. For instance, trial result prediction can be used to evaluate the difficulty of a lawsuit, which might affect the pricing of related legal services. The intelligent pricing policy can enable new business models in the legal industry. Acceptance intention of such business and service models is also an important future research topic.
Thirdly, the language of legal documents should be carefully considered in research on data analytics-based Legal Tech. Previous research papers generally consider only a single language. In particular, the structure of the specific language can affect the performance of the preprocessing procedure. An application successfully applied to one language can fail to process legal documents written in another language. Thus, diversity of language is an important challenge for data analytics-based Legal Tech.
Fourthly, the application of unsupervised learning techniques will be increased in the future. Thus far, these techniques are less popular than supervised learning techniques in research on data analytics-based Legal Tech. However, they are successfully applied in a wide range of application areas, including e-commerce, customer relationship management (CRM), marketing, manufacturing, education, and healthcare, etc. Similarly, the legal industry can be another promising application area for unsupervised learning and other techniques.
In this paper, previous studies on Legal Tech are reviewed from the perspective of data analytics, one of the most important Industry 4.0 technologies. The authors believe that this paper provides meaningful insights into the concepts, approaches, and research topics of data analytics-based Legal Tech.