Using Rule-Based Decision Trees to Digitize Legislation

: This article introduces a novel approach to digitize legislation using rule based-decision trees (RBDTs). As regulation is one of the major barriers to innovation, novel methods for helping stakeholders better understand, and conform to, legislation are becoming increasingly important. Newly introduced medical device regulation has resulted in an increased complexity of regulatory strategy for manufacturers, and the pressure on notiﬁed body resources to support this process is making this an increasing concern in industry. This paper explores a real-world classiﬁcation problem that arises for medical device manufacturers when they want to be certiﬁed according to the In Vitro Diagnostic Regulation (IVDR). A modiﬁcation to an existing RBDT algorithm is introduced (RBDT-1C) and a case study demonstrates how this method can be applied. The RBDT-1C algorithm is used to design a decision tree to classify IVD devices according to their risk-based classes: Class A, Class B, Class C and Class D. The applied RBDT-1C algorithm demonstrated accurate classiﬁcation in-line with published ground-truth data. This approach should enable users to better understand the legislation, has informed policy makers about potential areas for future guidance, and allowed for the identiﬁcation of errors in the regulations that have already been recognized and amended by the European Commission.


Introduction
Regulations contain rules setup by (governmental) authorities to control specific aspects of certain industries, which often influences the way companies operate. These rules affect how industries are managed, and the importance of regulations is such that many companies create specific divisions focused solely on the regulatory strategy [1]. Regulatory frameworks encourage consumers to adopt innovations by ensuring that their safety and effectiveness has been evaluated. Regulations promote better utilization of technologies and encourage the identification of novel technologies within industries [2]. When used effectively, regulation drives the direction of innovation [3], and can stimulate innovative ecosystems [4]. However, regulations also create barriers that can hold up the innovative process [5]. For innovative firms, they are one of the most significant barriers of perceived environmental uncertainty [6]. Organizational characteristics can have a significant influence on the effect of the regulation with regards to innovation [7]. SMEs need more information to better identify the relevant regulation and understand the requirements for conformity based on these regulations [8], which are the first steps in the regulatory navigation pathway. A better understanding of legislation can help to alleviate the barrier to innovation that regulation presents. One useful approach to provide a better understanding of complex legislation is by applying engineering methods to represent legislations in a more quantitative manner. Digitizing legislation is proposed as a method to present the rules within legislation in a simpler and logic-based format for the stakeholders that need to conform to the regulation. There are two types of classification problems resulting from legislation: identification of relevant regulation, and classification Prosthesis 2022, 4 114 of certain use cases for conformity requirements within the legislation. While all products and services need to consider relevant legislation, a smaller subset of legislations contain risk-based classification problems within the legislative document. Examples of this include the classification of data types within data protection regulation (e.g., the General Data Protection Regulation) and the classification of devices within EU medical device regulation (e.g., the Medical Device Regulation and In Vitro Diagnostic Regulation). Legislations containing risk-based classes have different conformity requirements depending on the risk level of the use case in question. This article presents a novel method for using rulebased decision trees to digitize classification problems within legislation, using the In Vitro Diagnostic Regulation (IVDR) as a case study example. Rule-based decision trees were selected as classification is based on the rules within the legislation. It is not dependent on a specific training dataset. This use case will focus on the classification problem within the IVDR legislation, although it is also applicable for determining whether the devices are governed by the IVDR. These classifications problems appear in complex legislation with multiple conformity requirements. Stakeholders require greater understanding of these types of legislation as the conformity requirements are dependent on the risk level of their product or service.

Digitizing Legislation
Early attempts to create digital tools for classification problems within legislation were constrained by challenges arising from differences between legal and technical semantics. Where engineering lexicon is based on precision [9], legal language is characterized by its construction to remain as generalized as possible to the class of problems that the legislator is addressing [10]. Legal requirements present multiple problems for managing compliance including ambiguities, cross-referencing, acronyms, domain-specific definitions, and frequent amendments due to revisions of legislation and case law [11]. One of the first attempts to digitize legislation was undertaken by Sergot et al. (1986) [12], when the British Nationality Act was converted into a logic program. They concluded that representation of rule-based legislation using logic programming is optimal, as it is simple for both naive users and experts to understand. This method allowed users to infer meaning from the legislative rules and made modifications of the system straightforward. Translation of laws into a logic program was shown to provide a better understanding of the rules within the legislation and identified specific interpretation issues within legislation [13]. The use of logic programming for representation of legislation was met with a scathing rebuttal by Leith (1986) [14]. The professor of law argued that 'legislation could not be formalized in a truly logical format'. The use of a logic program was criticized for embedding assumptions into the logic program arising from vague concepts. Challenges regarding the interpretation and ambiguity within the legislation were highlighted. Other attempts to digitize legislation consisted of a proof of concept for modelling the Italian data protection legislation created by Massaccia et al. (2005) [15], and an evaluation tool for the compliance of websites to Canada's Personal Information Protection and Electronic Documents Act (PIPEDA) [16].
This article builds on the concept of using decision trees to better understand medical device regulation proposed by Bergmann et al. [17]. It will focus on the use of decision trees for representation of legislation, as they are suited to deal with issues arising from the differences between legal and technical semantics. Where engineers previously embedded assumptions into logic programs, decision trees allow users to directly evaluate the rules within the legislation at the level of attributes along the decision pathway.

Decision Trees for Digitizing Legislation
Bergmann et al. [17] proposed a data-driven methodology using the Iterative Dichotomiser 3 (ID3) decision tree learning methodology, introduced by Quinlan (1985) [18], to classify medical devices according to the rules set out in the Medical Device Directive (MDD). The alternative method to building a decision tree using a data-driven approach is the use of a rule-based approach. Knowledge is stored in the form of a set of rules, which are converted into a decision tree when a decision-making process is required [19]. Rule-based decision trees can be built from either static or dynamically changing decision rules [20]. If classification rules are stored as a set of declarative statements, then there are no constraints on the order in which the attributes can be evaluated, meaning that it is much easier to update the decision tree when new information and data are available. This is preferable for decision trees based on legislation, as there are often amendments added, revisions made in the legal text, and real-world examples published to further clarify ambiguities and issues with interpretation.
Decision trees also have advantages over 'black-box' classification models, including neural networks, as the logical rules in a decision tree are much easier to interpret than the complex relationship between hidden nodes in a neural network for the decision makers [21]. This allows stakeholders to obtain a one-to-one mapping between their product or service and the classification rules in the legislation, often required for conformity assessment. In addition to the ease of interpretability, a decision tree built from a pre-defined set of decision rules allows the classification of examples where there is no available data for training.
An important advantage that rule-based decision trees have over both data-driven decision trees and neural network approaches is the accuracy of the classifications from the legislative text. Classification is directly based on the rules that are set out in the legislation, and the accuracy of the classification is not dependent on training the decision tree using available data.

Overview of Rule Based Decision Tree Algorithms
The earliest paper identified that introduces the concept of a rule-based decision tree was Imam et al. [19]. The paper introduces a method called the AQDT-1 (AQ-derived Decision Tree-1), which builds a decision tree from decision rules generated using AQ-type inductive learning. The algorithm optimizes the order in which the rules are assessed in the decision tree structure. It is important to note that while this paper uses a set of rules that has been generated using inductive learning, the algorithm can be used to build a decision tree from any set of rules, including those in legislation. Rules at each node of the decision tree will subdivide the examples into smaller groups depending on whether the attribute at the node applies or not. The attributes are derived from the rules and placed at respective nodes along the decision tree. An attribute selection criterion is used to analyze the relationship between the attributes and the classification rules, and the order, in which they are evaluated, is optimized based on multiple selection criteria. This method for optimizing the ordering of the attributes in the decision tree is determined by an attribute utility ranking comprising of three criteria: disjointness, dominance, and extent. These, respectively, measure the effectiveness an attribute has in determining the final decision class for an example, the frequency in which this attribute is present in the rules and the number of values an attribute can take within the rules. The algorithm aims to maximize the disjointness and dominance of an attribute while minimizing its extent. Similar to greedy splitting decision trees, this method creates a large number of leaf nodes in the decision tree and decision rules are often pruned depending on the rule strength.
The AQDT-1 tree, whilst being novel in terms of the idea that it proposed, was far from optimal when evaluating the complexity of the decision trees created. This led to a publication shortly after by Michalski et al. [22], which outlined a refined method for this idea: the AQDT-2 algorithm. This method was shown to outperform the AQDT-1 in terms of both accuracy and decision tree complexity-determined by the number of decision tree leaf nodes. The notion of the cost of evaluating against certain attributes was introduced in this paper. It was deduced that the "inexpensive" attributes should be evaluated first, and thus should be assigned close to the root node. The "expensive" attributes should only be evaluated when necessary, therefore placed further away from the root node. This cost is quantified using an importance score. While this optimization was shown to improve the predictive accuracy and reduce the complexity of the decision tree, it was accompanied by the drawback that to calculate the cost of an attribute, a training data set is required.
The next rule-based approach was created when Abdelhalim et al. (2014) [20] introduced the concept of the Rule Based Decision Tree (RBDT-1) algorithm, which does not depend on a training dataset for optimization. This paper highlighted the fact that rule-based approaches are often used in situations with limited data, and commented that the previously developed AQDT-2 required a training set for optimization. The RBDT-1 algorithm depends on only the rules themselves and uses three parameters for optimization question sequencing: attribute effectiveness, attribute autonomy and minimum value distribution. Attribute effectiveness (AE) is the first of these to be considered. It depends on the influence that the attribute has in determining the class of an example. If an attribute does not contribute to the decision process of assigning a class to an example, then it has a lower AE score than one that does. The attribute with the highest AE score is chosen for the root node and then subsequent attributes are selected depending on the next highest AE score until a leaf node is reached. The attribute autonomy (AA) is considered when two attributes have the same AE score. This criterion selects attributes that create fewer subsequent nodes before the leaf node in an attempt to reduce the depth of the decision. The minimum value distribution (MVD) is only taken into account when two attributes have both the same AE and AA score. It also aims to minimize the complexity of the decision tree by favoring attributes that have fewer values in a given rule. An attribute with fewer values will in turn generate fewer branches, which will result in a smaller and less complex decision tree. Table 1 introduces an example problem, which is a modified version of a decision tree classification problem outlined in the literature [20]. Information is given in Table 1 for a range of different medical device companies, including the expertise level of the staff, size of the company (small <50 employees; medium 50-250 employees; large >250 employees) and investment attracted. The final column shows what level of (medical) risk the product developed by that specific company carries. More stringent regulations govern higher risk devices, which translates into additional costs and time for businesses. Table 1. Device risk dataset used to compare the different decision tree algorithms. The "Expertise" column captures the human capital within a company with regards to medical devices. The "Size" indicates the number of employees within a company. The "Investment" shows how much investment that company had attracted, whilst the "Device risk" shows the risk of the medical devices they develop.

Company
Expertise Size Investment Device Risk The RBDT-1 is compared to previous rule-based and data-driven decision trees across a variety of decision tree classification problems, and is shown to outperform the AQDT-1, AQDT-2 as well as the ID3 (entropy driven) algorithms in terms of complexity, as shown in Figures 1 and 2. This is exhibited by the smaller number of attribute nodes and leaf nodes generated by the RBDT-1 decision tree. The RBDT-1 algorithm generated three attribute nodes and five leaf nodes, whilst the other algorithms ended up with five attribute nodes and seven leaf nodes.
The RBDT-1 is compared to previous rule-based and data-driven decision trees across a variety of decision tree classification problems, and is shown to outperform the AQDT-1, AQDT-2 as well as the ID3 (entropy driven) algorithms in terms of complexity, as shown in Figures 1 and 2. This is exhibited by the smaller number of attribute nodes and leaf nodes generated by the RBDT-1 decision tree. The RBDT-1 algorithm generated three attribute nodes and five leaf nodes, whilst the other algorithms ended up with five attribute nodes and seven leaf nodes.  Table 1 using the RBDT-1 algorithm [20]. It consists of three attribute nodes and five leaf nodes.  Table 1 using the RBDT-1 algorithm [20]. It consists of three attribute nodes and five leaf nodes.
The RBDT-1 is compared to previous rule-based and data-driven decision trees across a variety of decision tree classification problems, and is shown to outperform the AQDT-1, AQDT-2 as well as the ID3 (entropy driven) algorithms in terms of complexity, as shown in Figures 1 and 2. This is exhibited by the smaller number of attribute nodes and leaf nodes generated by the RBDT-1 decision tree. The RBDT-1 algorithm generated three attribute nodes and five leaf nodes, whilst the other algorithms ended up with five attribute nodes and seven leaf nodes.  Table 1 using the RBDT-1 algorithm [20]. It consists of three attribute nodes and five leaf nodes.  Table 1 for the AQDT-1, AQDT-2 and ID3 algorithms [20]. It consists of five attribute nodes and seven leaf nodes.
While there is some literature introducing design of rule-based decision trees, there is a gap in the literature regarding the optimization of rule-based decision trees with additional case specific constraints. Currently, rule-based decision tree algorithms require modification to incorporate any additional constraints arising from the legislation. A rule-based decision tree that allows for constraints to be set (RBDT-1C) is introduced in this paper and the In Vitro Diagnostic Regulation (IVDR) is used as a case study example.

IVDR Legislation as a Case Study Example of RBDT-1C
Previous European medical device regulation promoted innovation by favoring faster regulatory approval and patient-access to novel technologies, which resulted in a series of patient outcome scandals-most notably regarding the Poly Implant Prosthesis (PIP). This led to a major overhaul of European medical device regulation in 2017, where the new regulation aims to increase the safety and effectiveness of medical devices by placing more emphasis on clinical evaluation and post market surveillance [23]. The new regulation has reacted to changes in the industry, including a rise in the use of medical device software [24]. However, Arnould et al. [5] have demonstrated that this has led to an increased complexity of Medical Device Regulations (both the MDR and IVDR). There is currently a shortage of resources within organizations responsible for certifying European medical devices due to the introduction of the new legislation [25], political factors [26], and other factors. The case study introduced in this paper is the use of a digital decision tree for classification of medical devices according to the new IVDR legislation. This can help to alleviate the strain for the regulatory stakeholders by providing resources to help medical device manufacturers classify their devices according to the new regulation. The proposed RBDT-1C algorithm was used to design a decision tree for the classification of in vitro diagnostic (IVD) devices into the risk-based classes outlined in the IVDR. The risk-based classes in the IVDR combine both patient and population risk as follows: Class A, low patient and population risk; Class B, moderate to low patient risk and low population risk; Class C, high patient risk and low population risk; and Class D, high patient and population risk. Population risk is important for IVD devices as a false negative (meaning the patient is positive) could endanger others if the disease is transmissible and life-threatening.
Decision tree attributes were created from the classification and implementing rules in Annex VIII of the IVDR [27]. Rules with multiple classification criteria were split into multiple attributes, as prior research has shown that complex questions lead to greater complexity of comprehension [28] and interfere with the mapping process [29], where the mapping process relates to the cognitive process, by which a user simultaneously recalls the question and answers based on the example they are testing. This additional complexity is likely due to the increased demand on the working memory of the user [30]. All complex and lengthy rules can be split into multiple attributes. Where rules contained many subrules, natural language processing (NLP) techniques were used to assess the semantic similarity of the sub-rules to cluster sub-rules into semantically meaningful groups. While long and complex questions are undesirable for a decision tree, excessive splitting of the sub-rules would lead to a deeper and more complex decision tree. Semantically similar subrules do not overcomplicate attributes, meaning multiple sub-rules can be asked at a single node of the decision tree if they relate to similar devices. The semantic properties of the sub-rules were assessed using word embedding methods, which convert words into vectorrepresentations, capturing and displaying their semantic properties in the vector space. The word embeddings for each sub-rule were determined using the pre-trained BioBERT sentence embedding algorithm [31]. BioBERT's pretrained and fine-tuned deep neural network allows it to create vector representations where both the complex characteristics of words such as syntax and semantics, as well as the linguistic context of the word to model polysemy, are considered and reflected in the sentence embedding. BioBERT is also pre-trained on all PubMed data, meaning it can capture the semantic properties of biomedical specific terminology.

Building the RBDT-1C
After the attributes were created from the classification and implementing rules in the IVDR, the decision tree was built using a modified version of the RBDT-1 algorithm proposed by Abdelhalim et al. (2014) [20]. The use of rule-based decision trees for the representation of legislation is possible when using the RBDT-1 algorithm, as no data is required to train the decision tree. Additional constraints proposed by legislation can be designed into the decision tree by introducing supplementary criteria to the RBDT-1 algorithm, creating the RBDT-1C methodology-detailed in Supplementary Materials.
For the IVDR, the classification of a device is determined by the rule with the highest risk-based classification. If multiple rules applied to one device, then the rule with the highest risk-based classification determines the class of the device (IVDR, Annex VII, Implementing rule 1.9), where Class D is the highest risk-based classification, and Class A is the lowest. The risk-based classification is based on the personal risk and public health risk associated with an IVD device. The conformity constraint for the classification of devices is that all classification and implementing rules had to be considered when classifying a device (IVDR, Annex VIII, Implementing rule 1.7). This means that even if a higher risk-based rule applies, the lower risk-based rules must be assessed despite not contributing towards the final classification due to the "hierarchy" of classification. The conformity constraint was adapted into the tree using a novel attribute hierarchy criterion (AH). This is defined for each attribute as the number of subsequent attributes that can be excluded owing to the attribute taking a specific value. This is calculated multiple times for each attribute, once for each of the potential values that the attribute can take. These exclusions were then designed into the final decision tree structure, where the ordering of the IVDR attributes was determined by the three criteria within the original RBDT-1 algorithm: attribute effectiveness, attribute autonomy and minimum value distribution.
If devices are not independent medical devices (e.g., calibrators) they are classified according to the risk-level of the 'Parent Device'. The 'Parent Device' is the device that they are used with. The dotted line in Figure 3 represents the choice that the medical device manufacturer is given to proceed to classify the 'Parent Device' when a device does not have an independent classification. Additionally, Implementing Rule 1.4 (Table 2) refers to medical device software, which is only classified in its own right if this is independent of any other medical devices. If the software drives or influences the use of another medical device, then it is classified according to the 'Parent Device' that it is used with.  Figure 3. IVDR Decision Tree with Rules, classification rules and device risks corresponding to the IVDR Classification Rules and criteria in Table 2. The final classification given to a device consisted of Class A, B, C or D.  Table 2. The final classification given to a device consisted of Class A, B, C or D.

Classification Results from the IVDR Decision Tree, Build Using the RBDT-1C Algorithm
The IVDR decision tree was tested using 55 example IVD devices with known classifications according to the IVDR. This list of devices, which included devices from all of the classification and implementing rules, was compiled from multiple sources including: Medicines and Healthcare products Regulatory Agency (MHRA) In Vitro Diagnostic Directive (IVDD) guidance document [32]; British Standards Institute (BSI) classification of IVD devices whitepaper and infographic [33,34]; Australian Therapeutic Goods Administration (TGA) classification of IVD devices [35]; Food and Drug Administration (FDA) IVD device examples for human genetic tests and companion diagnostics [36,37]; and an International Society of Blood Transfusion (ISBT) report on human blood grouping systems [38]. Each of the devices was classified using the IVDR decision tree, independently by three experts, which consisted of medical and regulatory specialists. The independent classifications were assigned using the IVDR decision tree without prior knowledge of the known classifications. The results were discussed collaboratively to highlight different interpretations of the rules within Annex VIII of the IVDR and human error within the results. Ethical approval was in place (R63968/RE001) when collecting this data. Classification of 52 of the 55 device examples were agreed by all members of the research team, and these results were compared to the ground truth classifications. The decision tree was shown to accurately classify all IVD devices where there was no ambiguity in interpretation of the rules in the IVDR. Classification was not agreed for three devices, which identified ambiguity and interpretation issues within the classification rules of the IVDR.
The process of design and testing of the RBDT-1C identified difficulties in interpretation of the legislation. These issues resulted from differences between legal and technical semantics, and consisted of: ambiguity in terms used to distinguish between different device classifications; interpretation issues within classification rules; vagueness in healthcare domain specific vocabulary; and a lack of coherence between cross-referenced regulatory texts.

Discussion
The approach of converting rules within legislation into a rule-based decision tree identified specific interpretation issues within the IVDR. These interpretation issues were identified when implemented in January 2020, many of which were independently clarified in guidance documents that were published in November 2020 [39] by the designated European Medical Device Coordination Group (MDCG). This demonstrates that the introduced RBDT-1C algorithm can help legislators identify areas where guidance and further clarification is required. As well as this, digitizing the IVDR legislation identified cross referencing errors in the IVDR (and Medical Device Regulation), which have already been recognized and amended by the European Commission.
The medical device regulation field is highly leveraged on experience, and this process gives knowledge and insight into the regulation that usually takes years to capture, which is especially important for new and recent legislations. It was shown that this kind of algorithm could help IVD device manufacturers conform to, and better understand the IVDR legislation as the logic-based evaluation allows understanding of the interaction between the classification rules within the legislation, and provides insight into how users navigate the complex regulation. This approach can be used for other legislation and is not limited to the medical device industry. The conversion of legislation into a digital decision-making tool can also aid engineering management decisions when considering regulatory strategy within organizations.
It should be noted that these algorithms should keep in line with any rule changes. Outdated representation of the rules within an algorithm can lead to incorrect outcomes. Applying these kinds of algorithms should therefore also include a clear process on how to keep them up-to-date. This is particularly important when these algorithms are integrated within larger digital platforms. The digitization of legal frameworks can also take into account these considerations in order to allow software engineers to build more robust systems.
In general, resources that aid regulatory strategy can increase the understanding of legislation for all stakeholders within an industry. This can help overcome the barrier to innovation, which is often presented by regulation. While this is not limited to the medical device industry, this research example showed that digital solutions can be built to help users navigate the regulations. Nonetheless, this rule-based methodology is limited to legislation with structured and well defined rules, as the attributes need to be clearly defined according to the legislation in order to allow for accurate classification based on a rule-based decision tree. An additional, limitation of this method is the interpretation issues within the legislation, as ambiguity within the legislation resulted in divergent classification when using the IVDR decision tree. However, we do believe that highlighting and identifying these challenges can also lead to further additional clarification and guidance from the legislators. This could in turn result in a better understanding of regulations themselves. The introduced RBDT-1C algorithm provides a valuable tool within the field of regulatory science and engineering management.