Identifying Factors that Influence the Patterns of Road Crashes Using Association Rules: A case Study from Wisconsin, United States
Round 1
Reviewer 1 Report
The paper is very interesting, but it is neither supported by a good literature review nor by a correct methodology.
1) An appropriate state of the art is missing. The authors generally describe the use of data mining, without describing the use of how the data mining theory is already used in these problems. For example, I recommend that you read the following articles:
a. Mannering, F., Chandra, R. B. (2014). Analytic methods in accident research: Methodological frontier and future directions. Analytic Methods in Accident Research (1) , 1–22.
b. Pande, A., & Abdel-Aty, M. (2009). Market basket analysis of crash data from large jurisdictions and its potential as a decision support tool. Safety science, 47(1), 145-154.
c. Prati, G., Pietrantoni, L., Fraboni, F. (2017). Using data mining techniques to predict the severity of bicycle crashes. Accident Analysis and Prevention, 101, 44-54.
2) The data structure is not clear.
a. Authors should specify which variables are present in the data set (e.g. Montella, A., Andreassen, D., Tarko, A., Turner, S., Mauriello, F., Imbriani, L., & Romero, M. (2013). Crash databases in Australasia, the European Union, and the United States: review and prospects for improvement. Transportation Research Record, vol. 2386, pp. 128-136.), then describe how these have been reworked.
b. Authors must insert a table with summary statistics of incidents.
The soundness of the results is not examined, Moreover, this large sample size would make it possible to split the sample in two sub-samples, one of these sub-samples being used to evaluate the results.
Anyway the authors should give to the reader some elements for judging the robustness of the findings (even though the aim is not to develop a predictive model, since the approach described is explanatory and exploratory). As mentioned by G. I. Webb, "Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-I error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data." (G. I. Webb, 2007, Discovering significant patterns, Machine learning 68, 1-33). In my opinion, it would be absolutely necessary to split the sample in one learning sample for generating the association rules and one test sample to evaluate the statistical significance of the rules obtained (or to apply any other appropriate technique to take the problem of statistical significance into account : see the article by Webb).
Results and conclusions section could be improved adding more references to other papers supporting these findings
Author Response
To: Referee
The authors are extremely grateful to anonymous referee involved for providing his/her excellent comments and valuable advice in this paper. We have revised the paper based on the referee’s comments. Following statements provide the comparison between the referee’s comments and the authors’ revision. We have pleasure in requesting the referee to review this paper. Thank you. Your prompt attention to this paper will be much appreciated.
Yours sincerely
The authors
Response to Reviewer 1 Comments
Point 1: An appropriate state of the art is missing. The authors generally describe the use of data mining, without describing the use of how the data mining theory is already used in these problems. For example, I recommend that you read the following articles:
a. Mannering, F., Chandra, R. B. (2014). Analytic methods in accident research: Methodological frontier and future directions. Analytic Methods in Accident Research (1), 1–22.
b. Pande, A., & Abdel-Aty, M. (2009). Market basket analysis of crash data from large jurisdictions and its potential as a decision support tool. Safety science, 47(1), 145-154.
c. Prati, G., Pietrantoni, L., Fraboni, F. (2017). Using data mining techniques to predict the severity of bicycle crashes. Accident Analysis and Prevention, 101, 44-54.
Response 1: Thank you very much for your comments. We have carefully read these three literatures listed in your comments. It plays an important role in improving our manuscript. In addition, we have also read and refer to several other related articles. On this basis, we make a detailed overview (page 2, lines 52-56) to explain the application progress of data mining theory in this research.
Point 2: The data structure is not clear.
a. Authors should specify which variables are present in the data set (e.g. Montella, A., Andreassen, D., Tarko, A., Turner, S., Mauriello, F., Imbriani, L., & Romero, M. (2013). Crash databases in Australasia, the European Union, and the United States: review and prospects for improvement. Transportation Research Record, vol. 2386, pp. 128-136.), then describe how these have been reworked.
b. Authors must insert a table with summary statistics of incidents.
Response 2:
a. Thank you for your comments. There are 29 attributes per accident in the original crash data set. 15 valid attributes are selected through correlation analysis among attribute variables in this research. Unselected attribute variables such as DOCTNMBR (i.e. the pre-printed number of a crash), NTFYHOUR (i.e. the one hour range in which the enforcement agency was notified of the crash; listed in military time), ONHWY (i.e. the name of the highway on which the crash took place) are not the direct impact factors of accidents, and have no direct impact on the mining of association rules of accident causes.
b. Thank you for pointing out our weaknesses. Moreover, your comments have played a great role in improving our manuscript. As your comments, we have redesigned the table 3 and added the percentage of each attribute variable in the fifth row in the table. Therefore, it is helpful to increase the readability of the article through statistical analysis of the data.
Point 3: The soundness of the results is not examined, Moreover, this large sample size would make it possible to split the sample in two sub-samples, one of these sub-samples being used to evaluate the results. Anyway the authors should give to the reader some elements for judging the robustness of the findings (even though the aim is not to develop a predictive model, since the approach described is explanatory and exploratory). As mentioned by G. I. Webb, "Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-I error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data." (G. I. Webb, 2007, Discovering significant patterns, Machine learning 68, 1-33). In my opinion, it would be absolutely necessary to split the sample in one learning sample for generating the association rules and one test sample to evaluate the statistical significance of the rules obtained (or to apply any other appropriate technique to take the problem of statistical significance into account : see the article by Webb).
Response 3: Thank you very much for your insightful comments, pointing out the shortcomings of this article and giving us the direction of improving our manuscript. As your comments, we have applied Bonferroni correction approach, which one of multiple hypothesis tests to avoid the extreme risk of type-I error. Moreover, the result of multiple hypothesis test indicate that the reason for the extremely low number of false discoveries is that the support, confidence and lift threshold already do an excellent job of pruning out most rules that are not statistically significant (pages 7-8, lines 166-180).
Point 4: Results and conclusions section could be improved adding more references to other papers supporting these findings
Response 4: Thank you for your valuable and thoughtful comments. According to the comment, we have rewritten the results and conclusions, and added some more detailed discussions about the results with other related research in section 4 (page 9, lines 193, 201, page 10, lines 221, 233, 238, and page 12, line 273). As mentioned in section 5 (page 13, lines 296-300), the present study did not optimize the parameters with any optimization method, for the current study obtained objective and significant results in the current size of the database. For future directions, efforts could be made on incorporating genetic algorithm and particle swarm optimization with Apriori algorithm to optimize the values of the parameters, to obtain significant results with high efficiency in analysing big size of database.
Author Response File: Author Response.docx
Reviewer 2 Report
The paper is interesting. Unfortunately several issues limit its quality. The main drawback is the language - the text contains high number of atypical expressions, typos, etc. For example:
- in the title: "study case" (should probably be "case study")
- in the abstract: "significant associated" (significantly), "the data for this study drives" (comes)
- in the introduction: "complexity nature" (complex nature), "measures makers" (decision makers)
Please consult with a native speaker and correct the text. Focus also on singular/plural forms (data is/are).
Next, in discussion, you should compare your findings with other studies (name some examples, do not just summarize that results are consistent) and potential limitations and further progress (which you list in the last paragraph of conclusions).
Last but not least - you mention that your findings are useful for decision makers. Please explain how can they be applied - name some specific examples.
Minor issue - in the references, you use several forms of author names and surnames. Please be consistent, according to the journal guidelines.
Author Response
To: Referee
The authors are extremely grateful to anonymous referee involved for providing his/her excellent comments and valuable advice in this paper. We have revised the paper based on the referee’s comments. Following statements provide the comparison between the referee’s comments and the authors’ revision. We have pleasure in requesting the referee to review this paper. Thank you. Your prompt attention to this paper will be much appreciated.
Yours sincerely
The authors
Response to Reviewer 2 Comments
Point 1: The main drawback is the language - the text contains high number of atypical expressions, typos, etc. For example: 1) in the title: "study case" (should probably be "case study"); 2) in the abstract: "significant associated" (significantly), "the data for this study drives" (comes); 3) in the introduction: "complexity nature" (complex nature), "measures makers" (decision makers). Please consult with a native speaker and correct the text. Focus also on singular/plural forms (data is/are).
Response 1: Thank you very much to point out the atypical expressions, typos, sentence structure and grammatical issues in our manuscript. According to the comments, we have modified the title: Identify factors that influence the patterns of road-crashes by using association rules: A case study from Wisconsin, United States. Moreover, we have changed ‘significant’ to ‘significantly’ (page 1, lines 13), and ‘drives’ to ‘comes’ in the abstract (page 1, lines 19). And then we have changed ‘complexity nature’ to ‘complex nature’ (page 1, lines 34) and ‘measures makers’ to ‘decision makers’ (page 1, lines 39) in the introduction. In addition, we have checked singular/plural forms throughout our manuscript. Finally, we have rechecked and revised the similar grammatical errors throughout our manuscript.
Point 2: In discussion, you should compare your findings with other studies (name some examples, do not just summarize that results are consistent) and potential limitations and further progress (which you list in the last paragraph of conclusions).
Response 2: Thank you for your valuable and thoughtful comments. According to the comment, we have revised the results and conclusions, and added some more detailed discussions about the results with other related research in section 4 (page 9, lines 193, 201, page 10, lines 221, 233, 238, and page 12, line 273). As mentioned in section 5 (page 13, lines 296-300), the present study did not optimize the parameters with any optimization method, for the current study obtained objective and significant results in the current size of the database. For future directions, efforts could be made on incorporating genetic algorithm and particle swarm optimization with Apriori algorithm to optimize the values of the parameters, to obtain significant results with high efficiency in analysing big size of database.
Point 3: You mention that your findings are useful for decision makers. Please explain how can they be applied - name some specific examples.
Response 3: Thank you for your valuable and thoughtful comments. According to the comments, we have added some specific examples to explain how the useful association rules applied by the decision makers. As mentioned in section 4 in our manuscript, according to the support association rules 11, 12, and 17, decision makers can reduce the occurrence of crash by setting up physical separation on crash prone sections (page 9, lines 204-206). And according to the support association rules 5, 16, and the confident association rules 17, decision makers can strengthen the traffic safety education for the drivers aged 16-25 to reduce the occurrence of traffic crashes (page10, lines 226-227). Moreover, according to the confident association rules 7, 13, 14 and 15, an appropriate organization of intersection flow might help decision makers control the occurrence of crash effectively (page 10, lines 232-234).
Point 4: In the references, you use several forms of author names and surnames. Please be consistent, according to the journal guidelines.
Response 4: Thank you for your valuable and thoughtful comments. According to the comment, we have rechecked and revised the forms of the references in the manuscript, including use the unified forms of author names and surnames.
Author Response File: Author Response.docx
Reviewer 3 Report
The study presents a different analysis perspective of traffic crashes data. However, the paper should be improved to be clear and useful to readers. Therefore, the following issues are suggested:
-add a literature review of similar studies or at least of similar traffic crash studies with data mining techniques
- explain better the advantageous of the association rules compared to other data mining techniques (e.g. ANN) not only in terms of priori assumption but also in terms of results. In other words, what kind of information we can get with this method that it wouldn’t be possible to obtain using other methods.
- clarify better in the context of the traffic crash analysis, the three top 20 analyses. What represent association rules, confidence support and lift value in the context of the analysed data.
-explain the threshold to select top 20. What did support this decision?
Finally, the English language should be improved and a homogenization of the terms are suggested (crash versus accident and so on).
Author Response
To: Referee
The authors are extremely grateful to anonymous referee involved for providing his/her excellent comments and valuable advice in this paper. We have revised the paper based on the referee’s comments. Following statements provide the comparison between the referee’s comments and the authors’ revision. We have pleasure in requesting the referee to review this paper. Thank you. Your prompt attention to this paper will be much appreciated.
Yours sincerely
The authors
Response to Reviewer 3 Comments
Point 1: Add a literature review of similar studies or at least of similar traffic crash studies with data mining techniques.
Response 1:Thank you for your comments. According to the comments, we have added some new literatures in section 1 (page 2, lines 53-56). Moreover, some discussions about the referees have been added to the introduction of our manuscript.
Point 2: Explain better the advantageous of the association rules compared to other data mining techniques (e.g. ANN) not only in terms of priori assumption but also in terms of results. In other words, what kind of information we can get with this method that it wouldn’t be possible to obtain using other methods.
Response 2:Thank you for your comment. According to the comment, we have applied ANN algorithm to mine association rules for the same crash data set. 712 association rules are obtained under the same parameter threshold with minimum support 0.1, minimum confidence 0.4. It shows that this method has fewer effective association rules than the Apriori algorithm in our research. Moreover, more than 11 association rules fail to pass multiple hypothesis tests. It indicates that the association rules mined by Apriori algorithm are more credible than the ANN algorithm.
Point 3: Clarify better in the context of the traffic crash analysis, the three top 20 analyses. What represent association rules, confidence support and lift value in the context of the analysed data.
Response 3:Thank you for your comment. According to the comment, we have added some more expatiation and discussions about the association rules, confidence support and lift value in the context of the traffic crash analysis (page 8, lines 181-186).
Point 4: Explain the threshold to select top 20. What did support this decision?
Response 4:More than 700 valid association rules are obtained by Apriori algorithm in this research. The reason why the association rules of the top 20 of confidence and confidence are chosen for discussion is that there are too many valid rules, and useful rules can be found by choosing the rules with high support and confidence. At the same time, a similar treatment has been taken in some related research, such as the following research ‘Das, S.; Dutta, A.; Jalayer, M.; Bibeka, A.; Wu, L. Factors influencing the patterns of wrong-way driving crashes on freeway exit ramps and median crossovers: Exploration using ‘Eclat’ association rules to promote safety. International Journal of Transportation Science and Technology 2018, S2046043017301028’.
Point 5: The English language should be improved and a homogenization of the terms are suggested (crash versus accident and so on).
Response 5:Thank you for your valuable comments. According to the comment, we have checked and polished the full text for grammatical errors, improper use of words and so on. Moreover, to ensure the homogenization of the terms in our manuscript, we changed accident to crash. Because accidental refers to all kinds of accidents, crash is mainly a collision accident. So crash is more accurate in describing the content of this paper.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
The authors have not yet made a correct state of the art. An appropriate state of the art must indicate what has been up to now regarding this topic, what problems are present and how they have been addressed, thus highlighting where this study is based. The subject addressed is very interesting but given the great interest must be explained better. Currently the paper seems a simple statistical exercise.
The second important point is the validation of the rules, as already mentioned in the previous revision. The validation process is generally distinguished in two ways. The first divides the database into two parts, one is used to build the rules the second to validate the rules. the second method is called cross validation, using a simultaneous construction and validation process. The authors should specify how they validated these rules, the authors only claim to have used the bonferroni correction.
Author Response
To: Referee
The authors are extremely grateful to anonymous referee involved for providing his/her excellent comments and valuable advice in this paper. We have revised the paper based on the referee’s comments. Following statements provide the comparison between the referee’s comments and the authors’ revision. We have pleasure in requesting the referee to review this paper. Thank you. Your prompt attention to this paper will be much appreciated.
Yours sincerely
The authors
Response to Reviewer 1 Comments
Point 1: The authors have not yet made a correct state of the art. An appropriate state of the art must indicate what has been up to now regarding this topic, what problems are present and how they have been addressed, thus highlighting where this study is based. The subject addressed is very interesting but given the great interest must be explained better. Currently the paper seems a simple statistical exercise.
Response 1: Thank you very much for your valuable and thoughtful comments. We have carefully read the literature listed in your last comment and other related articles again.
(1) According to the comment, first, we emphasized the importance of identifying critical risk factors (page 1, lines 31-41) as follows:
Traffic crashes can be decreased significantly, and identifying the causes of a traffic crash is the most critical procedure in adopting precautionary measures to reduce the severity and quantity of traffic crashes. However, some previous studies estimated a model of crash frequency and severity using only the volume of traffic as an explanatory variable, while clearly many other factors affect the frequency and severity of crashes such as environmental conditions, roadway geometrics, driver characteristics, and so on. Due to the complex nature of traffic crashes, the policy decision makers must consider numerous contributory factors when making decisions on the improvement of safety. It is vital for decision makers to find the most significant factors that affect the occurrence and consequence of traffic crashes. After years of research, it is generally accepted that through recognizing risk factors as shown in Figure 1, which affect the severity of a crash and corresponding coping strategies, the impact of crashes can be significantly reduced.
(2) Then we described the effort of previous studies, which could be divided into statistical methods and data mining. Both methods identified the risk factors while ignored the correlation between factors (pages 1-2, lines 42-61) as follows:
Some previous studies have devoted to identifying the contributing factors that affect the occurrence and severity of traffic crashes through traffic data. Various approaches were proposed by these studies such as binary logit/probit models, multinomial logit models, nested logit models, log-linear models, artificial neural networks, spatial and temporal correlations, Markov switching models and genetic algorithmsetc. Meanwhile, various contributing factors to frequency and severity of traffic crashes have been identified in the above literature such as weather, gender and age of drivers, posted speed, roadway geometrics, condition of drivers, and so on.
In recent years, the analysis of the various types of data by using data mining techniques has been attracting more and more attention among researchers. Data mining technology has been employed in traffic crashes analysis and achieved satisfactory results, in areas such as assessing the inherent connection between crashes and road geometry, critical points identification, factors that contribute to the severity of traffic crashes, the relationship between driver characteristics and traffic crashes. Many studies have analyzed crash data with data mining techniques. Agrawal et al. utilized the data mining technique of association analysis for crash data analysis. Golob and Recker used clustering analysis for relating prevailing traffic conditions on freeways with type of collision most likely to occur. Prati et al. applied decision tree technique and Bayesian network to predict the severity of bicycle crashes. However, some of these studies based on the hypotheses that these factors are independent of one another, which might misunderstand the contributory of every single factor.
(3) In the end, we proposed the Apriori algorithm, that does not rely on any hypothesis and can discover meaningful rules (page 2, lines 62-69) as follows:
Among these data mining techniques, association rules mining is a valid technique to analyze traffic crashes since data mining methods do not rely on any hypothesis and can discover meaningful connections hidden in large data sets. There are three kinds of basic algorithms for association rules mining, which are Apriori algorithm, algorithm based on partition and FP-Tree. The Apriori algorithm is succinct and clear, which adopts an iterative method of layer-by-layer search. Compared to the other two algorithms, the Apriori algorithm is more capable of processing largescale data sets. In the current study, the Apriori algorithm was used to discover the significant rules between the factors and crashes in Wisconsin.
Point 2: The second important point is the validation of the rules, as already mentioned in the previous revision. The validation process is generally distinguished in two ways. The first divides the database into two parts, one is used to build the rules the second to validate the rules. the second method is called cross validation, using a simultaneous construction and validation process. The authors should specify how they validated these rules, the authors only claim to have used the bonferroni correction.
Response 2: Thank you very much for your insightful comments. We are sorry that we did not specify the progress of validity test in the previous manuscript. According to the comment, we added a section (3.3) to introduce the validity test of the association rules in this revised manuscript. The current manuscript investigated two approaches to apply validity tests, which were direct adjustment approach and holdout approach. We adopted the direct adjustment approach for the advantage of data usage for both association rules discovery and statistical evaluation; meanwhile, no more statistical tests will be required under this approach than under the holdout approach (pages 6-7, lines 158-182) as follows:
An extreme risk of type-I error exists because of the large number of association rules, which needs a process of validity test to evaluate the statistical significance of the rules obtained. The validation process is generally distinguished in two ways. The first approach is direct adjustment approach, which requires all association rules to pass statistical tests at the adjusted critical value. The second approach is holdout approach, which divides the data into exploratory data for generating association rules without regard for the problem of multiple testing and holdout data for statistical test.
In the current study, a direct adjustment approach is applied to test the validation of association rules, as it has an advantage of data usage for both association rules discovery and statistical evaluation. Meanwhile, no more statistical tests will be required under this approach than under the holdout approach. A number of direct adjustment approaches were employed to perform multiple hypothesis tests, such as Bonferroni correction, Sequentially Rejective Bonferroni, Adaptive Benjamini-Hochberg algorithm, and so on. Bonferroni correction states that if an experimenter is testing n independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested. Because of the principle and characteristics of Bonferroni correction, it made the results more rigorous with a tightly upper bound. Thus, the method of Bonferroni correction was applied in the current study. The definition of Bonferroni correction is as follows:
Let H1, H2,..., Hn be a family of hypotheses and p1, p2,…, pn be their corresponding p-values. The n is the total number of null hypotheses, while n0 is the number of true hypotheses. The familywise error rate (FWER) is the probability of rejecting at least one true Hi, in another word, of making at least one type I error. The Bonferroni correction rejects the null hypothesis for each pi ≤ α/n, while α is the global significance level. Proof of this control follows from Boole's inequalit.
Author Response File: Author Response.docx
Reviewer 2 Report
Thank you for the revision. I believe you addressed all the review points and put the effort in clarifying the outlined issues. I really appreciate your work. However, I am still not sure that language is 100% understandable for all readers - there are still many uncommon phrases. Please have the text checked and corrected by a native English speaker.
In addition, please once again carefully the information in the reference list, for example:
- Sometimes you mix names and surnames (you use them vice versa), for example in refs. 5 or 31.
- Sometimes you use incomplete source information. For example, ref. 2 has nothing common with TRL (TRL is only listed as the library which provided information to TRID database) - the paper comes from PIARC congress.
Author Response
To: Referee
The authors are extremely grateful to anonymous referee involved for providing his/her excellent comments and valuable advice in this paper. We have revised the paper based on the referee’s comments. Following statements provide the comparison between the referee’s comments and the authors’ revision. We have pleasure in requesting the referee to review this paper. Thank you. Your prompt attention to this paper will be much appreciated.
Yours sincerely
The authors
Response to Reviewer 2 Comments
Point 1: Thank you for the revision. I believe you addressed all the review points and put the effort in clarifying the outlined issues. I really appreciate your work. However, I am still not sure that language is 100% understandable for all readers - there are still many uncommon phrases. Please have the text checked and corrected by a native English speaker.
Response 1: Thank you for your valuable and thoughtful comments. According to the comment, we have checked the full text for grammatical errors, improper use of words, singular and plural forms, homogenization, and so on.
Point 2: In addition, please once again carefully the information in the reference list, for example:
-Sometimes you mix names and surnames (you use them vice versa), for example in refs. 5 or 31.
-Sometimes you use incomplete source information. For example, ref. 2 has nothing common with TRL (TRL is only listed as the library which provided information to TRID database) - the paper comes from PIARC congress.
Response 2: Thank you for your valuable and thoughtful comments. According to the comment, we have corrected the order of names and surnames in references 5 (page 14, line 337) and 31 (page 15, line 394). We have corrected the source information in ref.2 (page 13, lines 331-332). Moreover, we checked all the references.
Author Response File: Author Response.docx
Reviewer 3 Report
The authors attended the reviewer comments accordingly.
Author Response
To: Referee
The authors are extremely grateful to anonymous referee involved for providing his/her excellent comments and valuable advice in this paper. We have revised the paper based on the referee’s comments. Following statements provide the comparison between the referee’s comments and the authors’ revision. We have pleasure in requesting the referee to review this paper. Thank you. Your prompt attention to this paper will be much appreciated.
Yours sincerely
The authors
Response to Reviewer 3 Comments
Point 1: The authors attended the reviewer comments accordingly.
Response 1:Thank you very much for your valuable and insightful comments, pointing out the shortcomings of this article and giving us the direction of improving our manuscript. Your comments have played a great role in improving our manuscript.
Author Response File: Author Response.docx