Knowledge Graph- and Bayesian Network-Based Intelligent Diagnosis of Highway Diseases: A Case Study on Maintenance in Xinjiang
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- “Error! Reference source not found ” → Figure 1
- Fig. 1: the text is almost unreadable
- Lines 235-236: what are nnn and mmm? Did you mean n and m ?
- Line 238: Insert a space before “Since..”
- Eq.(5) the αi,j have not been defined. Did you mean ai,j from line 262? What is the meaning of these random boundaries for ui,j ?
- In Eq.(7) there is another α (“hyperparameter”). How do you chose it? Is there any connection with the αi,j values?
- Line 329: mmm → m
- Line 387: If ui,j → 0, then the third box from the right at the line 353 indicates that in this case f(ui,j)=4ui,j=0. How can the logarithm (6) be calculated then?
- Eq.(11): P(Fi) → P(Fi)
- Line 415: Insert a space before “is”
- Figure 7: are you sure the pure textual information should really be presented in a graphics format?
General comment:
- All the equations are numbered twice
Author Response
Comments 1: “Error! Reference source not found ” → Figure 1
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript.
Comments 2: Fig. 1: the text is almost unreadable
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript.
Comments 3: Lines 235-236: what are nnn and mmm? Did you mean n and m ?
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript.
Comments 4: Line 238: Insert a space before “Since..”
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript.
Comments 5: Eq.(5) the αi,j have not been defined. Did you mean ai,j from line 262? What is the meaning of these random boundaries for ui,j ?
We thank the reviewer for pointing out this issue. Indeed, the symbol in Eq. (5) is a typographical error and should be replaced with to maintain consistency with the notation used in line 260. We have corrected this in the revised manuscript.
Regarding the random boundaries for , the term represents a random variable uniformly distributed between 0 and 1, which is introduced to account for the uncertainty when expert judgments are not definitive. The value of reflects the expert's belief in the likelihood of a causal connection between disease node and phenomenon node . The random variable is used in conjunction with to determine the prior model . Specifically, if is greater than or equal to , the connection between and is considered to be absent (= 0), otherwise, the connection is assumed to exist (= 1). This introduces a stochastic element to the prior model, which reflects the inherent uncertainty in expert judgments.
We have revised the manuscript to include a clearer explanation of this process and the role of the random variable .
Changes in the manuscript:
Corrected "" to "" in Eq. (5).
Added clarification in the manuscript regarding the role of the random variable in determining the prior model .
Comments 6: In Eq.(7) there is another α (“hyperparameter”). How do you chose it? Is there any connection with the αi,j values?
α in Eq. (7) is a global hyperparameter used to control the strength of the prior belief about the network structure. in Eq. (5) is a random variable that reflects the relationship between specific pairs of disease and phenomenon nodes in the Bayesian network. It is specific to each pair and is part of the model's parameters.
The hyperparameter α influences the overall network structure, while ​ influences the connection probabilities between specific nodes, with both playing distinct roles in the Bayesian network framework.
Comments 7: Line 329: mmm → m
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript.
Comments 8: Line 387: If → 0, then the third box from the right at the line 353 indicates that in this case =4=0. How can the logarithm (6) be calculated then?
Thank you for your insightful comment. You are correct in pointing out that when →0, , which could cause the logarithmic term in the hybrid scoring function (Equation 6) to become undefined. To address this issue, we have added a clarification and modification in the manuscript in line326-329.
Comments 9: Eq.(11): P(Fi) → P(Fi)
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript.
Comments 10: Line 415: Insert a space before “is”
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript.
Comments 11: Figure 7: are you sure the pure textual information should really be presented in a graphics format?
Comments 12 All the equations are numbered twice
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsComments:
(1) The knowledge graph construction process in Section 3.2 is relatively vague. This part should be explained in detail. For example, for road data sets, what entities include and what attributes include.
(2) Is the fitness function of genetic algorithm the performance of the model on the overall training set? If so, will it lead to the problem of overfitting?
(3) In line 190, it is stated, "evolve the Bayesian network within a sample training dataset derived from highway maintenance records"; it should be clarified whether this refers to the entire training dataset. If it is the case, why not use a test set to evaluate the model's capabilities?
(4) The paper preferably includes an overall network structure diagram, which is convenient for readers to read and understand.
(5) There are formatting errors in the text, such as on line 166.
Author Response
Comments 1: The knowledge graph construction process in Section 3.2 is relatively vague. This part should be explained in detail. For example, for road data sets, what entities include and what attributes include.
Thank you for your valuable feedback on the knowledge graph construction process in Section 3.2. Based on your suggestion, we have revised and expanded the explanation to provide a more detailed description of the knowledge graph construction process, particularly in terms of the entities and attributes involved in the road datasets. Below is a summary of the modifications we have made:
Entities: We have clearly defined the key entities involved in the knowledge graph, including but not limited to: road types, highway diseases, geographical locations, disease occurrence time, and maintenance methods.
Attributes: We have also provided a detailed description of the attributes for each entity. For example, the attributes of the "Road Type" entity include "Material" and "Design Life," while the "Highway Disease" entity includes attributes such as "Severity" and "Cause."
Dataset Description: We have clarified the number of entities and attributes in the dataset and provided a detailed description of the dataset's composition, ensuring the reviewer has a clear understanding of the data used in the knowledge graph.
Specific changes have been made in the manuscript.
Comments 2: Is the fitness function of genetic algorithm the performance of the model on the overall training set? If so, will it lead to the problem of overfitting?
We sincerely thank Reviewer for their thoughtful comments and suggestions. Regarding the question raised about the fitness function of the genetic algorithm and the potential issue of overfitting, we would like to clarify the following:
Fitness Function Definition: The fitness function used in our genetic algorithm is indeed based on the model's performance on the overall training set. Specifically, it evaluates the accuracy of the Bayesian network model in diagnosing highway diseases, considering both the structure of the network and its performance on the training data.
Overfitting Concern: We acknowledge that using training set performance as the sole criterion for fitness can lead to overfitting, especially in cases where the model complexity increases. To mitigate this risk, we employed cross-validation techniques during the training phase. This allowed us to assess the model’s generalization capability by testing it on multiple subsets of the data, rather than relying solely on the performance of the model on the training set.
Additionally, to prevent excessive complexity and reduce the possibility of overfitting, we incorporated constraints during the evolutionary process of the genetic algorithm. These constraints limited the complexity of the Bayesian network by controlling the number of nodes and connections, which helped in maintaining a balance between model accuracy and complexity.
Modifications Made: To address the reviewer’s concern, we have updated the manuscript in the following sections:
Section 4.2 (Experimental results): We added a detailed explanation about the fitness function of the genetic algorithm and the steps taken to prevent overfitting, including the use of cross-validation and complexity constraints during model optimization.
Section 5 (Discussion): We included a more thorough discussion on the potential issue of overfitting, the preventive measures implemented, and the role of cross-validation and model complexity constraints in ensuring robust model performance.
Comments 3: In line 190, it is stated, "evolve the Bayesian network within a sample training dataset derived from highway maintenance records"; it should be clarified whether this refers to the entire training dataset. If it is the case, why not use a test set to evaluate the model's capabilities?
Thank you for your insightful comment. You are correct to point out that the statement in line 190, "evolve the Bayesian network within a sample training dataset derived from highway maintenance records," may lead to confusion regarding the use of the training dataset and the test set.
In the current methodology, the training dataset is indeed used for evolving the Bayesian network, but this dataset represents only a portion of the available highway maintenance records. The model is subsequently evaluated on a separate test dataset to assess its generalization capability and performance on unseen data.
To clarify, we will revise the text to explicitly state that the evolution of the Bayesian network occurs using a training set (which is a subset of the full dataset), and that its performance is evaluated on a separate test set. We will also add a note explaining that the use of a separate test set is crucial for evaluating the model’s generalization capabilities.
Original Text:In the next phase, genetic algorithms, guided by expert diagnostic schemes, are used to evolve the Bayesian network within a sample training dataset derived from highway maintenance records.
Revised Text: In the next phase, genetic algorithms, guided by expert diagnostic schemes, are used to evolve the Bayesian network using a training dataset, which is a subset of the complete highway maintenance records. The model’s capabilities are then evaluated using a separate test dataset to assess its generalization ability and performance on unseen data.
Comments 4: The paper preferably includes an overall network structure diagram, which is convenient for readers to read and understand.
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript. A diagram describing the structure of the technical route is added in 3.1
Comments 5: There are formatting errors in the text, such as on line 166.
We thank the reviewer for pointing out this issue. We have checked the full text and corrected any formatting errors.
Reviewer 3 Report
Comments and Suggestions for Authors1. The paper title may be modified to further pertinent to the main advancement of the current study, for example, knowledge graph and heuristic can be added to the original title.
2.Refer to Figure 2, the edges are not shown in Fig.2 reducing the readability for the authors, additionally, in Line 203, 'node A is the parent node and connects to child node A through a directed edge'. A is the parent and also the child node, please check. it is recommended that highway disease can be used to illustrate the bayesian network.
3.The quality of many figures should be improved, for example, Fig.1, 2, 4,8, and 9;
4.In Equation (7), some variables or functions should be described explicitly after the equation (7).
5.The genetic algorithm is used to optimize the model structure(i.e., the matrices rij), please show which parameters you used to implement the GA, for example, the number of population, the crossover and mutuation coefficient, and how they influence the GA performance?
6.The proposed method outperforms taditional method by comparing the DET and DIA in Table 4, where the results form the propsoed method are a bit promising as compared to the traditional method. please show the times in order to compare the efficiency of the method.
Author Response
Comments 1: The paper title may be modified to further pertinent to the main advancement of the current study, for example, knowledge graph and heuristic can be added to the original title.
We sincerely thank Reviewer for their thoughtful comments and suggestions. “Knowledge graph” is added in the title. Intelligence maybe contains the “heuristic” meaning, so this word is not added in the title.
Comments 2: Refer to Figure 2, the edges are not shown in Fig.2 reducing the readability for the authors, additionally, in Line 203, 'node A is the parent node and connects to child node A through a directed edge'. A is the parent and also the child node, please check. it is recommended that highway disease can be used to illustrate the bayesian network.
We sincerely thank Reviewer for their thoughtful comments and suggestions. We have made changes in the manuscript.
Comments 3: The quality of many figures should be improved, for example, Fig.1, 2, 4,8, and 9;
Thank you for your insightful comment. We have made changes in the manuscript.
Comments 4: .In Equation (7), some variables or functions should be described explicitly after the equation (7).
We thank the reviewer for pointing out this issue. We have made revisions in the manuscript. There are some variables missing in the description. I complete the description to make it easy to understand.
Comments 5: The genetic algorithm is used to optimize the model structure(i.e., the matrices rij), please show which parameters you used to implement the GA, for example, the number of population, the crossover and mutuation coefficient, and how they influence the GA performance?
We sincerely thank the reviewer for their valuable feedback and for raising this important point. In response to the comment, we would like to clarify that the genetic algorithm (GA) parameters used for optimizing the Bayesian network structure (i.e., the matrices ) were selected through a set of experimental trials to balance between computational efficiency and the optimization performance.
Although we did not include the specific details of the GA parameters in the main text, we acknowledge the importance of providing this information for a clearer understanding of our approach. Therefore, we have decided to include the following details in the revised manuscript:
- Population Size: 100
- Crossover Rate:8
- Mutation Rate:02
- Number of Generations: 200
- Selection Method: Tournament Selection
Explanation of the Parameter Choices:
- Population Size (100): This population size was chosen to provide a reasonable balance between exploration and computational efficiency. A larger population would increase computational costs, while a smaller one could lead to insufficient diversity in the search space.
- Crossover Rate (0.8): A high crossover rate was chosen to enhance exploration of the solution space. This ensures that the algorithm generates new, diverse solutions by combining parts of high-performing individuals.
- Mutation Rate (0.02): A low mutation rate was used to maintain stability in the population, with occasional mutations helping to avoid local optima and introduce beneficial variations.
- Number of Generations (200): This number of generations was determined to be sufficient for convergence, allowing the GA to explore potential solutions effectively without overfitting or excessive computation time.
- Selection Method (Tournament Selection): Tournament selection was employed to choose individuals for reproduction, as it maintains diversity in the population while ensuring that high-performing individuals have a higher chance of being selected.
These parameters were fine-tuned based on preliminary experiments, and their performance was validated through empirical testing. We did not explicitly include these details in the original manuscript to avoid excessive technical detail, but we appreciate the reviewer’s suggestion to clarify them. This will be addressed in the revised manuscript.
We hope that this additional explanation helps clarify the choices made regarding the GA parameters, and we trust it will meet the reviewer's expectations.
We have included this section in Section 4.2 of the manuscript.
Comments 5: The proposed method outperforms taditional method by comparing the DET and DIA in Table 4, where the results form the propsoed method are a bit promising as compared to the traditional method. please show the times in order to compare the efficiency of the method.
We have added up the total number of diseases in the record in the manuscript to indicate the specific number for which the new method works better.