An Interpretable Predictive Model for Health Aspects of Solvents via Rough Set Theory

: This paper presents a machine learning (ML) approach to predict the potential health issues of solvents by uncovering the hidden relationship between substances and toxicity. Solvent selection is a crucial step in industrial processes. However, prolonged exposure to solvents has been found to pose signiﬁcant risks to human health. To mitigate these hazards, it is crucial to develop a predictive model for health performance by identifying the contributing factors to solvent toxicity. This research aims to develop a predictive model for health issues related to solvent toxicity. Among various algorithms in ML, Rough Set Machine Learning (RSML) was chosen for this work due to its interpretable nature of the generated models. The models have been developed through data collection on the toxicity of various organic solvents, the construction of predictive models with decision rules, and model veriﬁcation. The results reveal correlations between solvent toxicity and the Balaban index, valence connectivity index, Wiener index, and boiling points. The generated predictive model using RSML has successfully provided insightful observations about the correlation between human toxicity and molecular attributes.


Introduction
Solvents have been widely used in the chemical industry for dissolving, suspending, diluting, and separating substances. The paints and coatings sector holds the largest share in the global solvent market, followed by the printing inks segment, with the industrial cleaning industry in the third position [1]. However, prolonged exposure to solvents, particularly those containing volatile organic compounds (VOCs) can adversely affect human health, especially on respiratory, nervous, and reproductive systems. According to the National Institute of Occupational Safety and Health (NIOSH), approximately 9.8 million workers [2] are regularly exposed to high dosages of solvents each year through various exposure pathways. To address these health issues, organizations, such as the Occupational Safety and Health Administration (OSHA) and the U.S. Environmental Protection Agency (EPA), have established guidelines, such as permissible exposure limits (PEL), to detect and classify the associated health consequences.
While current research focuses on determining the potential issues associated with the use of organic solvents, recent research has highlighted the health issues related to solvent exposure [3]. The research findings revealed that the N-methyl-2-pyrrolidone (NMP) solvent, which is conventionally used to manufacture membranes, has neurotoxic, hepatoxic, genotoxic, carcinogenic, and mutagenic effects on humans [3]. It is crucial to replace the toxic solvent with less toxic alternatives. Moreover, other researchers also reviewed the toxicokinetic, general toxicity, and reproductive toxicity associated with various frequently used solvents that primarily expose humans through inhalation routes [4]. The reports suggested that more detailed information on solvents is needed to understand the potential toxicity of solvents. However, the current research lacks a generalized predictive model for determining the health performance of solvents in a systematic approach. Although harmful substance concentration can be determined, the effect of toxicity can only be known after we apply a certain solvent into a process. Therefore, it is crucial to identify which molecular attributes will affect toxicity, whereby the results could then be applied in solvent design. This research aims to bridge the research gap between health issues and predictive models by determining the underlying relationship between solvents and health hazards based on their molecular structure.
This paper is divided into five main sections. The first section discusses the literature review on solvent toxicity, topological indices, and rough set machine learning (RSML), which led to the identification of research gap. Section 2 provides the detailed proposed methodology to close the identified research gap. Sections 3 and 4 mainly focus on the results and discussion by also highlighting the main insights gained from the obtained results. The summary and key contribution of this work, as well as potential future work, are illustrated in the last section.

Toxicity
Aromatic organic solvents represent 35% [5] of industrial utilization with high solvency [6] in forming solutions by dissolving a significant amount of solute. Based on the characteristic of high vapor pressure and low boiling point, solvents with a boiling point in the range of 50 • C to 260 • C [7] are known as volatile organic compounds (VOCs) and are easily emitted into the atmosphere at room temperature. The vaporized solvents can be readily absorbed by the human body through inhalation, ingestion, and skin, thereby affecting human health performance.
The impairment of health performance caused by exposure to VOCs varies depending on the exposure routes, levels, and type of solvents. The factors of exposure level, considering both concentration and duration, have been classified into short-term and long-term effects on human health, mainly through inhalation routes. The short-term health effects are dizziness, headache, nausea, and irritation in the eyes, nose, and throat [8]. The dispersed chemicals attach to the mucous layer of the membrane, leading to irritation and inflammation. In contrast, prolonged exposure to solvents has long-term health effects in terms of mutagenicity, toxicity, and carcinogenicity [8]. Thus, there is a necessity to investigate the long-term health effects of the solvents on the toxicity and carcinogenic parameter.
In the context of occupational health, the likelihood of solvent exposure through inhalation is significantly greater in terms of both quantity and frequency when compared to exposure through the oral and dermal routes [9]. Depending on the dosage, the ranking on the degree of toxicity can be classified according to Hodge and Sterner scale in lethal concentration 50% (LC50). Table 1 summarizes the toxicity rating by Hodge and Sterner Scale. Death caused by low dosage with less than or equal to 10 ppm indicates the substance would be classified as highly toxic as Class 1. Based on the rating of toxicity, organic solvents with a rating of 1 to 4 are considered toxic and must be avoided for extreme exposure.
Since the toxicity of organic solvents is related to their chemical structures, structural descriptors, such as Topological Indices (TI), have the potential to form a predictive model for toxicity. TIs are valuable tools in providing unique information about the structure of a molecule and are commonly used in predicting physicochemical properties of molecules. Table 1. Toxicity rating based on Hodge and Sterner scale [10].

Topological Indices
Topological indices (TIs) are numerical values that characterize the structural features or properties of chemical compounds based on their molecular graphs [11]. The molecular topologies are influenced by the structure of the molecules in dimensions, configuration, bonding, symmetry, and degree of complexity. The most commonly used topological indices are the connectivity index, Wiener index, Randic index, Balaban index, and Zagreb index [12]. The TIs can be computed for any molecule based on their molecular structure. These indices can then be used to develop correlations in the studies of quantitative structure-activity (QSAR)/property (QSPR)/toxicity relationship (QSTR) [11].
TIs have been utilized for predicting toxicity and properties of chemicals [11]. For instance, the valence connectivity index, the Balaban index, and the electropy index were used to predict the underlying relationship between the molecular structure of ethers and toxicity in mice [13]. In toxicology, molecular topological indices have been applied to study the toxicity of alcohols, pesticides, and ionic liquids. In assessing the accuracy of the TIs to the toxicity, a figure of log LC50 versus TIs was plotted. A high accuracy and correlation will represent a quadratic model. In addition, the toxicity of organophosphorus pesticides is determined by the Randic-Kier-Hall connectivity indices and Topological Charge Indices (TCI) [14]. The indices are linked to the corresponding regression equation to calculate the LD50 value, and then, the calculated LD50 cross-validates with the experimental LD50 to identify the most correlated TIs. The studies have proven that the topological indices have an excellent correlation to the LD50. Besides, topological studies significantly reduce costs and save time compared to conducting experiments, making it an efficient approach [11]. However, there exists a research gap in connecting topological indices to a predictive model for assessing human health issues. With the incorporation of RSML, a predictive model can be developed based on topological indices of organic solvents.

Rough Set Machine Learning (RSML)
Machine learning (ML) is a specialized area of artificial intelligence (AI) that focuses on the automatic learning, analysis, and discovery of data [15]. Machine learning is capable of analyzing massive and fuzzy databases to discover patterns, reveal the underlying structure and relationships, and subsequently develop a robust prediction model. Based on learning methods, supervised learning in machine learning techniques determines the underlying data by imposing learning on the relationship between the past input-output training data through supervision. The supported algorithms are decision trees, support vector machines (SVM), K-Nearest Neighbors (KNN), RSML, Artificial Neural Network (ANN), and Bayesian Networks.
In a recent study, individual sets of mathematical tools, including fuzzy sets, rough sets, and soft sets, have been combined into a framework named Z-fuzzy soft β-coveringbased rough matrices to solve a multiple attributes group decision-making (MAGDM) problem [16]. This combination has shown satisfactory results in recruiting the best applicant for the assistant professor job and can be further applied for decision-making problems. It further shows the capability and flexibility of rough sets to be combined with other tools in developing decision rules that would be useful for real-world problems. RSML has its advantages in data processing without prior or complete information, whereas SVM approaches require maximum information in solving the problems of binary classification [17]. Moreover, RSML is preferred over ANN as RSML does not require human intervention and achieves similar accuracy within a shorter period [18]. Even though ANN have been utilized for decision-making tasks, such as Hepatitis B prediction and control in medical applications, there are challenges using ANN when it comes to uncertain data and uncommon diseases [19]. Although Frequent Pattern (FP) Growth algorithm is good in identifying patterns and generating association rules between the items based on support and confidence, it is unable to provide explanations understandable by humans and is less preferred when the data is uncertain and incomplete. Another commonly used machine learning method includes random forest (RF) due to its high predictive precision and flexibility. Nevertheless, there is a lack of interpretability that does not allow us to understand how a decision is made when compared to RSML. This is mainly because RF combines multiple decision trees, which makes it harder to interpret and understand the individual contributions of features. It becomes rather challenging to unveil meaningful physical insights through those aforementioned "black-box" ML approaches as relevance is often disclosed instead of focusing on the cause and effect [20]. The importance of having interpretable ML to drive knowledge generation has been emphasized in various research fields. For example, interpretable ML models are important in the electrocatalysis field to offer new insights into identifying novel catalytic materials and their mechanisms [20]. The recent advances of interpretable ML for estimating reactivity properties of solid surfaces and their existing challenges were also critically discussed in a recent contribution [21].
Due to the aforementioned benefits, RSML is capable of identifying and predicting the health performance of solvents by handling incomplete and uncertain data during important feature selection with excellent data efficiency and high versatility. Furthermore, RSML generates if-then decision rules in discovering the relationship between the conditional attributes of the objects to the decision attribute in a straightforward manner, then to be further applied to establish predictive models. In short, RSML benefits in data analysis and decision-making for datasets without requiring probability statistics and assumptions [22].
Rough set theory (RST) is a mathematical approach to analyzing imprecise and uncertain data or knowledge [23]. RST performs data classification, feature selection, and knowledge discovery tasks on large datasets. In theory, RSML functions are based on the approximation method. In the forms of approximation, the indiscernibility data exists within an elementary set bounded by lower and upper approximation. This is illustrated in Figure 1 [22]. The approximation method can be applied to toxic solvents, incorporating various pieces of information to perform classification based on the boundaries. With the use of lower and upper approximation to handle uncertainty caused by missing data, RSML is capable of handling uncertain and incomplete data. Likewise, RSML selects features by determining reducts, which are the minimal subsets of attributes affecting the decision attribute. This is important in generating meaningful decision rules that help in understanding and interpreting the data.
A rough set theory presents the information in a decision table consisting of an object, conditional attributes, and decision attributes. An example of an information table is presented in Table 2. In the decision table, the object is the desired element for analysis, along with the conditional attributes [24] for the properties and characteristics of the corresponding object. The decision (class) attribute is determined based on the conditional input attributes with generated decision rules [24]. By referring to Table 2, type of chemical is the object, conditional attributes comprise both boiling point and Wiener index, whereas toxicity class is the decision attribute. Both attributes can be quantitative (integer or decimal values) or nominal characteristics for numerical, categorical, and binary attributes. A structured information table in RSML benefits data classification, analysis, and discovery of the object-attribute relationship. A rough set theory presents the information in a decision table consisting of an object, conditional attributes, and decision attributes. An example of an information table is presented in Table 2. In the decision table, the object is the desired element for analysis, along with the conditional attributes [24] for the properties and characteristics of the corresponding object. The decision (class) attribute is determined based on the conditional input attributes with generated decision rules [24]. By referring to Table 2, type of chemical is the object, conditional attributes comprise both boiling point and Wiener index, whereas toxicity class is the decision attribute. Both attributes can be quantitative (integer or decimal values) or nominal characteristics for numerical, categorical, and binary attributes. A structured information table in RSML benefits data classification, analysis, and discovery of the object-attribute relationship.
RSML have been extensively applied for data mining and pattern recognition. For example, RSML integrates CO2 capture and storage (CCS) for carbon management in selecting potential CO2 storage sites [25]. RSML approaches are capable of performing a high certainty prediction on the storage sites, improving the CCS and negative emissions technologies (NETs) process. Besides storage sites, the RSML approach can also be applied to predict storage depth and geographical location. Moreover, the pyrolysis bio-oil properties can be predicted by RSML algorithms with the data on pyrolysis temperature and feedstock characteristics [26]. The focused pyrolysis bio-oil properties are the higher heating value and pH. The identified characteristics of feedstock samples are related to the amount of carbon, nitrogen, and oxygen content. In addition, another contribution was made by applying RSML in constructing a predictive model of the odor of fragrances [27]. RSML relates the odor properties in the form of topological indices and dilution as conditional attributes to the odor characteristics. Based on the rules induced, the fragrant topological indices in Kappa 3 and Kappa 2 dominate the characteristics. The fragrance prediction model through RSML greatly impacts the development of chemical products. Because of its interpretability, RSML was also used to construct predictive models for the estimation of physical and transport properties of polymers, such as glass transition tem- RSML have been extensively applied for data mining and pattern recognition. For example, RSML integrates CO 2 capture and storage (CCS) for carbon management in selecting potential CO 2 storage sites [25]. RSML approaches are capable of performing a high certainty prediction on the storage sites, improving the CCS and negative emissions technologies (NETs) process. Besides storage sites, the RSML approach can also be applied to predict storage depth and geographical location. Moreover, the pyrolysis bio-oil properties can be predicted by RSML algorithms with the data on pyrolysis temperature and feedstock characteristics [26]. The focused pyrolysis bio-oil properties are the higher heating value and pH. The identified characteristics of feedstock samples are related to the amount of carbon, nitrogen, and oxygen content. In addition, another contribution was made by applying RSML in constructing a predictive model of the odor of fragrances [27]. RSML relates the odor properties in the form of topological indices and dilution as conditional attributes to the odor characteristics. Based on the rules induced, the fragrant topological indices in Kappa 3 and Kappa 2 dominate the characteristics. The fragrance prediction model through RSML greatly impacts the development of chemical products. Because of its interpretability, RSML was also used to construct predictive models for the estimation of physical and transport properties of polymers, such as glass transition temperature and cohesive energy [28]. The promising rules generated from RSML were then incorporated as property constraints in the computer-aided molecular design (CAMD) model to determine potential polymeric membrane molecular structure for air separation [28].
With the RSML applications above, rough sets theory with feature selection, clustering, and rule induction has promising outcomes in dealing with vague, ambiguous, and fuzzy datasets. As a result, the RSML algorithm is appropriate and applicable in developing an interpretable predictive model for determining the health performance of organic solvents. RSML efficiently aids in identifying the hidden structure and relationship with minimal resources and time consumption. Hence, this research aims to develop a predictive model of the health performance of organic solvents. The potential conditional attributes for the toxicity of organic solvents in human health focused on the topological indices and physical properties of solvents.

Methodology
This section explains the steps to identify the underlying structure and develop predictive models for health performance. The main steps include data collection (Step 1), developing a rough set model (Step 2), and verifying the prediction model (Step 3). The proposed methodology is illustrated in Figure 2.  Step 1 : Data collection on organic solvents The first step involves collecting the toxicity data for a variety of chemicals which include aromatics, alcohols, ketones, ethers, esters, alkanes, and alkenes. The objects involved in this research are organic solvents for which 100 data points were collected from various sources and chemical compositions. Training data accounts for 70% of datasets for modeling, whereas 30% of datasets [26] preserve for validating the generated decision rules to ensure the accuracy of the predicted model. The database of organic solvents used for training and validation can be found in Appendix A Tables A1 and A2 respectively.
Step 2a : Identify the health indices as the decision attribute Organic solvents cause various human health issues, such as toxicity and carcinogenicity. Therefore, this step identifies the targeted health issue of toxicity as a decision attribute in quantifying the toxic effects. The toxicity within this study encompasses all the toxic effects in the human body and is not restricted to particular human organs or systems.
Step 2b : Classify the decision attribute Once the decision attribute has been identified, the attribute has to be classified and well-defined to distribute the object into the corresponding class. Based on the type of expression, the classification can be expressed in either category (categories 1,2,3), labels (high or low), or binary (yes/no) form. The toxicity as a decision attribute has been classified according to the Hodge and Sterner scale of standard LC50 through inhalation routes expressed in the form of categories (categories 1, 2, and 3).
In the predictive model, the toxic substances are classified into a simplified version based on Hodge and Sterner scale for rules induction, as shown in Table 3. For instance, the simplified toxicity rating 1 merges classes 1 and 2 from Hodge and Sterner scale. The toxicity classification for the predictive model of the health performance of solvents is then defined and classified into three main classes. Table 3. Simplified toxicity rating for the prediction model.

Toxicity Rating
Commonly Used Term Inhalation LC50 (Exposure of Routes for 4 H) ppm 1 Extremely to highly Toxic <10-100 2 Slightly Toxic 100-100,000 3 Non-toxic 100,000 Step 2c : Identify the possible conditional attributes A conditional attribute describes how the objects contribute to the decision attribute, expressed in a range or an exact value. The possible conditional attributes can be linked to the solvent's topological indices and physical properties.
In identifying conditional attributes, several topological indices related to human toxicity are chosen, including the Balaban Index, Wiener Index, and molecular connectivity index, whereas the chosen physical property is the boiling point of the solvent. The chosen topological indices have been supported by literature on the correlations of the Balaban index and valence connectivity index to toxicity [13].
Step 2d : Establish a decision table The identified conditional attributes are then translated into quantifiable properties through calculations. For example, numerical values of relevant topological indices such as the Balaban index [29], the valence connectivity index [30], and the Wiener index [30] are calculated for each molecule. A decision table serves as a fundamental for the RSML model. Table 4 represents the simplified decision table for the organic solvent in the toxicity study. From Table 4, the object represents the organic solvent, and the condition attributes comprise topological indices (Balaban Index, Valence Connectivity Index, Wiener Index) and the physical property (boiling point). The decision attribute in the toxicity classification is the three main classes based on the toxicity rating in Table 1. In rough set theory (RST), a decision table, also known as an information table made up of the universe, U (nonempty sets), and a set of attributes, A expressed in S = (U,A). The attribute comprised a set of values, V a in the form of a ∈ A.
Step 2e : Perform data processing The collected data must be pre-processed to eliminate redundant and unessential data before simulating the prediction model. In data processing, RST has the indiscernibility relation that describes the equivalence relation in a set of condition attributes, which denotes B⊆A [31]. In training for rule induction, the conditional attributes and respective classes undergo classification with approximation theory to indiscernibility. There are lower and upper approximations expressed in Equations (1) and (2) [31], respectively. The lower approximation is the set of attributes certain to belong to the class. In contrast, the upper approximation is the set of attributes that possibly fall in the subset.
The difference between the upper and lower approximation is named boundary region, expressed in Equation (3).
where B(X) represents concept of decision X, U depicts the dataset, and BN B represents boundary of the concept. Furthermore, the data pre-processing process involves utilizing the concepts of reduct and core. In attribute reduction, reduct has the definition in determining the exact conditional attributes that impact the decision attribute, as shown in Equation (4) [32].
where A represents a subset of the attribute set A and X is a decision class. A is referred to as a reduct when the dependency of X on A is identical to its dependency on A. It is worth noting that multiple reducts can exist for the same dataset. When this occurs, one can select reducts by considering criteria such as mechanistic plausibility or consistency with first principles. Whereas the core is defined as the identification of the intersection of all minimal subsets of attributes, as represented by Equation (5).
where R i represents the ith reduct set. Step 2f : Simulate the prediction model using Rough Set Data Explorer2 (ROSE2) The generation of cores and reducts have been generated for model development via a software system, Rough Set Data Explorer, version 2 (ROSE2). ROSE2, based on rough set theory, has the core module coded with C++ programming [33] for the interface modules. ROSE2 performs data pre-processing with core and reduct functions and rules induction based on the Learning from Examples Module (LEM2) algorithm [33]. ROSE2 has a significantly lower 5% error rate than feature selection [34]. The generated rules from LEM2 are in the form of "IF-THEN" decision rules. The example of "IF-THEN" rule is as follows: "(Balaban Index in [34,63)) & (Boiling Point ≥ 104), Decision: 1". The explanation for the example is if the molecule structure of an organic solvent has the Balaban index in the range of 34 to 63, along with a boiling point equal, and greater than 104 • C, then it will be classified as class 1 toxicity.
Step 3a : Perform validation on the developed predictive model The developed predictive model with generated rules is validated with datasets from 30% of the collected data that are excluded from the training data. Validation is a vital step in rough set theory in evaluating the model and determining the appropriateness of the model for evaluating human toxicity from an organic solvent. The validation techniques numerically determine the strength, certainty, and coverage of models with Equations (4)- (6). High certainty and coverage indicate high satisfaction in the model performance with a determined underlying relationship. However, the low coverage and accuracy model can be re-modeled based on the previous steps.
Strength σ x (C, D) refers to the degree of supportiveness of the objects in the decision rule. Equation (6) shows the number of supported objects supp x (C, D) over the total available objects (U) in the decision table.
where C and D are the denotes for condition and decision attributes. The certainty (cer x (C, D)) shown in Equation (7) measures the probability of the conditional attribute being classified into the decision attribute. A high certainty percentage of 100% (certainty f actor, cer x (C, D) = 1) indicates the generated rule is certain to the correct conditional attribute to be classified in the respective decision class, whereas the uncertain rule has the certainty factor to be less than 100% or a certainty factor of less than 1 (0 < certainty f actor, cer x (C, D) < 1).
In simplified words, certainty is calculated with the object fulfilling the decision rule in the particular decision class (σ x (C, D)) divided by all the objects that meet the decision rule (σ x (C)).
Besides, coverage (cov x (C, D)) measures the percentage of objects in the corresponding class under the rule.
Based on Equation (8), the coverage metrics determined by calculation with the object fulfilled the decision rule in the particular decision class (σ x (C, D)) divide by the total number of objects in the particular decision class (σ x (D)).

Cores and Reducts
Five reduct sets were identified among four various conditional attributes of the Balaban index, valence connectivity index, Wiener index, and boiling point. The number of rules generated for each reduct is shown in Table 5. However, no core of the model was identified across reducts. The absence of a core indicates no constant conditional attribute intersects within the reduct sets. Thus, the results are analyzed based on the corresponding conditional attribute in each reduct. The study generated a sum of 166 decision rules within five sets of reduct. In reduct 1, 36 decision rules were generated, and they were all determined by the Balaban index and valence connectivity index. Reduct 2 was determined by the valence connectivity index and Wiener index, whereas reduct 3 has the Balaban index and boiling point. Furthermore, reduct 4 comprised valence connectivity index and boiling point. Lastly, reduct 5 consists of the Wiener index and boiling point. A summary of the reduct sets and the number of rules generated is shown in Table 5.

Rule-Based Prediction Models
In the predictive model, five different sets of decision rules, along with certainty and coverage, were generated. The complete set of decision rules from each reduct is shown in Appendix A Tables A3-A7. In order to perform rules interpretation, rules with good coverage and certainty are normally considered. Table 6 shows some examples of decision rules in reduct 5, and the rules will be used to illustrate the explanation. Rule 4 expressed in the "IF-THEN" statement, "If the molecular structure of organic solvent has the Balaban index in the range of 34 to 63, boiling point greater than or equal to 104 • C, then the inhalation LC50 falls within the range of 10 to 100 ppm, the organic solvent is classified as extremely to highly toxic in human health." The coverage of rule 4 was 32.26% by fulfilling 10 out of the 31 organic solvents of training data under class 1 toxicity. The certainty in the value of 100% indicated 10 organic solvents were classified in the correct decision attribute.
For class 2, the interpretation of the rules has the following "IF-THEN" statement "If the organic solvents have the Balaban index range of 17 to 33, boiling point between 36 • C to 78 • C, then the organic solvents were classified as class 2 toxicity as slightly toxic with inhalation LC50 in the range between 100 to 100,000 ppm." The coverage of rule 22 is comparatively lower than in class 1 in the value of 19.23% and a 100% certainty.
For class 3, the "IF-THEN" statement for the decision rule was "If the Balaban index of the organic solvents were greater or equals to 153, boiling point greater or equals to 199 • C, the organic solvents were classified under class 3 with relatively harmless to human health." The decision rule has 23.08% coverage and 100% certainty.
Similar to the remaining reduct sets, the interpretations for each decision rule are dominated by coverage and certainty. A high coverage indicates a high number of organic solvents in the particular class that met the rule's requirement. Meanwhile, certainty shows the accuracy of the objects classified under the correct class. Thus, the decision rule with high coverage and certainty obtains a rigid predictive model in predicting human toxicity in inhalation exposure to organic solvent.

Validation of Decision Rules
Thirty percent of datasets that was not utilized in developing the model were used to validate the decision rules. The validation data consist of exactly 30 data points of organic solvents applied to all five reduct sets in validating the coverage and certainty of the generated decision rules. The validation results exhibited high certainty for a class and were promising in showing the underlying relationship between the condition attributes and the decision attribute. Figures 3-7               Reduct set has the minimal feature subset that correlates to the decision attributes in retaining the main backbone of the data set. For Class 1 toxicity, reduct 3 fulfilled the most significant number of decision rules in 9 out of the 13 rules, followed by reduct 1 meeting 7 out of 15 rules in validation. In terms of certainty, Figure 5 illustrates that reduct 3 has 5 validated rules with 100% certainty, as compared to reduct 1, shown in Figure 3, which    Reduct set has the minimal feature subset that correlates to the decision attributes in retaining the main backbone of the data set. For Class 1 toxicity, reduct 3 fulfilled the most significant number of decision rules in 9 out of the 13 rules, followed by reduct 1 meeting 7 out of 15 rules in validation. In terms of certainty, Figure 5 illustrates that reduct 3 has 5 validated rules with 100% certainty, as compared to reduct 1, shown in Figure 3, which has only 4 rules with certainty higher than 70%. However, rule no.10 (R10) in reduct 1  Reduct set has the minimal feature subset that correlates to the decision attributes in retaining the main backbone of the data set. For Class 1 toxicity, reduct 3 fulfilled the most significant number of decision rules in 9 out of the 13 rules, followed by reduct 1 meeting 7 out of 15 rules in validation. In terms of certainty, Figure 5 illustrates that reduct 3 has 5 validated rules with 100% certainty, as compared to reduct 1, shown in Figure 3, which has only 4 rules with certainty higher than 70%. However, rule no.10 (R10) in reduct 1 meets a remarkable 5 data points in 83.33% certainty. Thus, reduct 1, with high to moderate certainty and the maximum coverage, was chosen for class 1 toxicity. In general, the conditional attributes of the Balaban index and valence connectivity index in reduct 1 affect the class 1 toxicity.
For class 2 toxicity, the validation datasets met four decision rules with R17 and R24 with 100% certainty. However, only a data point was present in each decision rule. Hence, in the review of reduct 5, with the second highest number of decision rules, the certainty for all three rules was comparatively high, with 66.67% certainty with 2 data points in R12 and 100% certainty with 1 data point for R20 and R22. With a high number of data points and certainty, reduct 5 correlates better to class 2 toxicity.
Molecules in class 3 were non-toxic to human health and were determined by reduct 3 with 3 decision rules. The Balaban index and boiling point mainly influence class 3 toxicity.
In summary, each toxicity class as a decision attribute is affected by various conditional attributes as shown in Table 7. The Balaban and valence connectivity indices from reduct 1 affect class 1 toxicity. Moreover, class 2 toxicity was influenced by the Wiener index and boiling point from reduct 5. Lastly, the Balaban index and boiling point from reduct 3 lead to the non-toxicity of class 3.

Discussion
Each toxicity class can be explained with conditional attributes based on the selected reduct set. The results in each class of decision attributes are interpretable with the chosen reduct set.

Class 1 Toxicity
Class 1 toxicity is defined as high to moderate toxicity of organic solvents via inhalation on the scale of LC50. The conditional attributes affecting class 1 toxicity include the Balaban index, denoted by A1, and the valence connectivity index, denoted by A2. In the validated decision rules, both topological indices (A1 and A2) have higher values in class 1 than in other classes. Table 8 shows the decision rules for classes 1 and 2 in reduct 1. The Balaban index, also known as the averaged distance sum connectivity [35], refers to the J index. J index increases with increasing branching and number of rings (aromatic). In other words, the J index was determined by molecules' branching (shape) in the proportional relationship. Besides, the valence connectivity index is the sum of overall bonds in counting the interacted bonding among two molecules in their valence states. This study focuses on the first-order valence connectivity index. The index measures the degree of connectivity between atoms based on the valence electron counts. The index correlates with the polarity of a molecule in charge distribution. Then, the polarity is linked to the organic solvent's intermolecular forces and boiling point.
Solvent lipophilicity increases with molecular weight [36]. Hence, the increase of the Balaban index and aromaticity has increased the hydrophobicity of non-polar lipidsoluble molecules [37]. The dispersed organic solvent in the atmosphere with a high Balaban index and lipophilicity intake from inhalation tends to bind and accumulate in hydrophobic regions of the human body, like lipid-rich tissues. Moreover, the solvent with high valence connectivity index has a high degree of valence electron in uniform distribution resulting in low electronegativity and hence low polarity. The low polarity indicates weak intermolecular forces with low energy required for bond breaking. As a result, molecules with low polarity have a low boiling point. The low boiling point molecules tend to vaporize with high volatility. Therefore, the volatile solvent tends to disperse into the air and accumulate in the human body [35], risking extreme toxicity to human health. In summary, the topological indices of high Balaban index and valence connectivity index of the organic solvents are scientifically proven in high to moderate toxicity to human health via inhalation routes. Figure 8 summarizes the relationship of the topological indices in contributing towards high toxicity. disperse into the air and accumulate in the human body [35], risking extreme toxicity to human health. In summary, the topological indices of high Balaban index and valence connectivity index of the organic solvents are scientifically proven in high to moderate toxicity to human health via inhalation routes. Figure 8 summarizes the relationship of the topological indices in contributing towards high toxicity. In short, it can be concluded that organic solvents with high values of the Balaban index and valence connectivity index exhibit high to moderate toxicity to human health when inhaled.

Class 2 Toxicity
Class 2 toxicity is classified as slightly toxic to humans, with the inhalation routes of LC50 in rat exposure for 4 h ranging from 1000 to 100,000 ppm. Based on the validated decision rules, the conditional attributes of the Wiener index, denoted by A3, and the physical property of boiling points, denoted by A4, present in an organic solvent, are expected to contribute to class 2 toxicity. From the decision rules, class 2 is interpreted with a low Wiener index, with the lowest index lesser than 13 and a moderately high boiling point up to 288 °C.
In topological studies, the Wiener index quantifies the summation distances in the shortest path of each bonding [35] between two vertices. Wiener index correlates with molecular properties in QSAR and QSPR. The distance-based index has a good measurement of the compactness of a molecule [38] in an inversely proportional relationship. Hence, a molecule's Wiener index relates to its compactness and size. Moreover, boiling point is related to volatility. There is an inverse relationship between boiling point and volatility.
A slightly toxic organic solvent exhibits low Wiener index, indicating that the molecules are closely packed with large compactness. A compacted molecule was smaller, with vertices squeezed in a confined space. The small-sized molecules in the particle forms are easily inhaled into the lungs, then absorbed and distributed throughout the bloodstream. Simultaneously, organic solvents of volatile organic compounds (VOCs) in moderately high boiling points have a stronger intermolecular force, resulting in moderate volatility for molecules escaping into the atmosphere. Thus, small-sized particles with moderate volatility led to slight toxicity. A summary of the relationship between conditional attributes leading to toxicity is shown in Figure 9.

Low Wiener index,
Moderately high boiling point In short, it can be concluded that organic solvents with high values of the Balaban index and valence connectivity index exhibit high to moderate toxicity to human health when inhaled.

Class 2 Toxicity
Class 2 toxicity is classified as slightly toxic to humans, with the inhalation routes of LC50 in rat exposure for 4 h ranging from 1000 to 100,000 ppm. Based on the validated decision rules, the conditional attributes of the Wiener index, denoted by A3, and the physical property of boiling points, denoted by A4, present in an organic solvent, are expected to contribute to class 2 toxicity. From the decision rules, class 2 is interpreted with a low Wiener index, with the lowest index lesser than 13 and a moderately high boiling point up to 288 • C.
In topological studies, the Wiener index quantifies the summation distances in the shortest path of each bonding [35] between two vertices. Wiener index correlates with molecular properties in QSAR and QSPR. The distance-based index has a good measurement of the compactness of a molecule [38] in an inversely proportional relationship. Hence, a molecule's Wiener index relates to its compactness and size. Moreover, boiling point is related to volatility. There is an inverse relationship between boiling point and volatility.
A slightly toxic organic solvent exhibits low Wiener index, indicating that the molecules are closely packed with large compactness. A compacted molecule was smaller, with vertices squeezed in a confined space. The small-sized molecules in the particle forms are easily inhaled into the lungs, then absorbed and distributed throughout the bloodstream. Simultaneously, organic solvents of volatile organic compounds (VOCs) in moderately high boiling points have a stronger intermolecular force, resulting in moderate volatility for molecules escaping into the atmosphere. Thus, small-sized particles with moderate volatil-ity led to slight toxicity. A summary of the relationship between conditional attributes leading to toxicity is shown in Figure 9.
boiling point and volatility.
A slightly toxic organic solvent exhibits low Wiener index, indicating that the molecules are closely packed with large compactness. A compacted molecule was smaller, with vertices squeezed in a confined space. The small-sized molecules in the particle forms are easily inhaled into the lungs, then absorbed and distributed throughout the bloodstream. Simultaneously, organic solvents of volatile organic compounds (VOCs) in moderately high boiling points have a stronger intermolecular force, resulting in moderate volatility for molecules escaping into the atmosphere. Thus, small-sized particles with moderate volatility led to slight toxicity. A summary of the relationship between conditional attributes leading to toxicity is shown in Figure 9.

Low Wiener index,
Moderately high boiling point ↓ Larger in molecular compactness, Stronger intermolecular force ↓ Smaller in size, Moderate volatility ↓ Results in slight toxicity Figure 9. Summary of the effects of conditional attributes on class 2 toxicity. Table 9 shows a few extracted decision rules from reduct 5 that result in class 2 toxicity. Rule 13 to 15 shows the binary relation of the Wiener index and boiling point on a decision rule. The rules of binary conditional attributes have the statement: "If the Wiener index is lower, then the solvent has a moderate boiling point", while in another case with the statement "If the Wiener index is high, the boiling point of solvent has to be higher, to be maintained in class 2 toxicity". On the other hand, the single conditional attribute of the Wiener index or boiling point can also contribute to toxicity. In rule 20, the Wiener index has a higher value in contributing towards slight toxicity, ranging from 72 to 83. The decision rule makes sense for having a higher value if a lower Wiener index of 10 in a single attribute would lead to highly toxic. Moreover, the conditional attribute of boiling point at 154 • C has been identified as class 2 toxicity. Table 9. Extracted rules on class 2 toxicity from reduct 5. Therefore, class 2 toxicity has characteristics of low Wiener index and moderately high boiling points of organic solvents. The compactness and intermolecular force of molecules allows for easy inhalation into the human body.

Class 3 Toxicity
The organic solvent with low toxicity is classified as class 3 with LC50 as 100,000 ppm in the inhalation routes. In the validation results, there is a high level of certainty for reduct 3 in class 3 toxicity. The conditional attributes, including the Balaban index denoted in A1 and the boiling point in A4, have been interpreted to dominate the organic solvent in meeting the criteria of class 3. A non-toxic effect of organic solvent is predicted to be in low Balaban index and high boiling point.
For an organic solvent with low toxicity, the solvents are low in the Balaban index, with lesser branching and aromaticity effects in decreasing lipophilicity. Low lipophilicity decreases the tendency of a molecule to be absorbed into body cells and tissues, resulting in low accumulation and hence lesser toxicity. In conjunction, a high boiling point solvent has a stronger intermolecular force and requires high energy in bond breaking, leading to less volatility. Therefore, solvents with the criteria of less complicated structure in low aromaticity and high volatility are relatively difficult in the phase change from liquid to a gas phase and interact with the human body, resulting in a non-toxic effect on the human body. An overall relationship between the Balaban index and boiling point is presented in Figure 10.  In summary, five reduct sets were determined from the data inputs in the predictive model. The generated decision rules have demonstrated validated data with reasonably good certainty and coverage and can be explained scientifically. When interpreting the results, the decision rules induced by the RSML approach showed the relationship between the molecular structure and the respective toxicity classification. When an organic solvent is classified as extremely toxic (class 1), it is mainly attributed to the high Balaban and valence connectivity indices. While a solvent is classified as slightly toxic (class 2), it was found to have low Wiener index and moderately high boiling point. Lastly, the low Balaban index and high boiling point are significant factors that contribute to low toxicity (class 3). As compared to other machine learning approaches, which are normally "black-box models" with limited explainability, this work has successfully revealed the key molecular attributes that lead to distinct classes of toxicity. This understanding holds significance in the process of designing new molecules or products.

Conclusions
This research paper presents the topological indices and physical properties of solvents as conditional attributes in a predictive model of human toxicity of solvents based on RSML. The impacts of solvents on human health can be estimated based on the factors, such as the boiling point of the solvent and the topological indices, including Balaban Index, the Valence Connectivity Index, and the Wiener Index. The predicted model based on uncertain and ambiguous data has generated rules to uncover the underlying structure of molecules contributing to human toxicity. The Balaban Index, valence connectivity index, and Wiener Index provide the quantitative values for the structural connection to the toxicity of organic solvents.
The proposed predictive model of the health performance of solvents with RSML has provided significant advantages to evaluate the health performance of solvents by discovering the conditional attributes that affect different classes of toxicity. This is particularly useful in solvent design and screening. However, the research has limitations on the assessment solely on human toxicity and may not account for other health issues caused by solvents. Further research should focus on developing prediction models for other health issues caused by solvents, such as carcinogenicity and mutagenicity. The validated decision rules in Table 10 prove the attribute relationship between a low Balaban index and a high boiling point. In rule 27 and rule 28, the Balaban indices are lower than those in class 1, reduct 1. Similarly, a high boiling point is obtained in reduct 3 compared to class 3, reduct 5, tabulated in Table 10. The statement for rule 27 is explained as "If organic solvents with Balaban index ranging from 2.22 to 3.16, boiling point in the range of 199 • C to 284 • C, then the solvent is classified in class 3 toxicity. Moreover, a conditional attribute of a low Balaban index between 2.67 to 2.755 in rule 31 has resulted in extremely low toxicity. Hence, the validated decision rules for Class 3 toxicity highlight the significance of a low Balaban index and a high boiling point in determining the non-toxic effects of organic solvents. In summary, five reduct sets were determined from the data inputs in the predictive model. The generated decision rules have demonstrated validated data with reasonably good certainty and coverage and can be explained scientifically. When interpreting the results, the decision rules induced by the RSML approach showed the relationship between the molecular structure and the respective toxicity classification. When an organic solvent is classified as extremely toxic (class 1), it is mainly attributed to the high Balaban and valence connectivity indices. While a solvent is classified as slightly toxic (class 2), it was found to have low Wiener index and moderately high boiling point. Lastly, the low Balaban index and high boiling point are significant factors that contribute to low toxicity (class 3). As compared to other machine learning approaches, which are normally "black-box models" with limited explainability, this work has successfully revealed the key molecular attributes that lead to distinct classes of toxicity. This understanding holds significance in the process of designing new molecules or products.

Conclusions
This research paper presents the topological indices and physical properties of solvents as conditional attributes in a predictive model of human toxicity of solvents based on RSML. The impacts of solvents on human health can be estimated based on the factors, such as the boiling point of the solvent and the topological indices, including Balaban Index, the Valence Connectivity Index, and the Wiener Index. The predicted model based on uncertain and ambiguous data has generated rules to uncover the underlying structure of molecules contributing to human toxicity. The Balaban Index, valence connectivity index, and Wiener Index provide the quantitative values for the structural connection to the toxicity of organic solvents. The proposed predictive model of the health performance of solvents with RSML has provided significant advantages to evaluate the health performance of solvents by discovering the conditional attributes that affect different classes of toxicity. This is particularly useful in solvent design and screening. However, the research has limitations on the assessment solely on human toxicity and may not account for other health issues caused by solvents. Further research should focus on developing prediction models for other health issues caused by solvents, such as carcinogenicity and mutagenicity. Additionally, future research directions should enhance the machine learning techniques and larger datasets of the predictive model with the incorporation of more comprehensive data in order to further improve the model's performance. In conclusion, this research successfully demonstrated the potential of using topological indices and physical properties of solvents in a predictive model using machine learning tools for assessing human toxicity.