Next Article in Journal
The Role of Verbal Feedback in the Motor Learning of Gymnastic Skills: A Systematic Review
Previous Article in Journal
Prediction Method of Water Absorption of Soft Rock Considering the Influence of Composition, Porosity, and Solute Quantitatively
Previous Article in Special Issue
Prediction of Process Quality Performance Using Statistical Analysis and Long Short-Term Memory
 
 
Article
Peer-Review Record

Identifying a Correlation among Qualitative Non-Numeric Parameters in Natural Fish Microbe Dataset Using Machine Learning

Appl. Sci. 2022, 12(12), 5927; https://doi.org/10.3390/app12125927
by Hideaki Shima 1, Yuho Sato 2, Kenji Sakata 1, Taiga Asakura 1 and Jun Kikuchi 1,2,3,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Appl. Sci. 2022, 12(12), 5927; https://doi.org/10.3390/app12125927
Submission received: 29 April 2022 / Revised: 6 June 2022 / Accepted: 8 June 2022 / Published: 10 June 2022
(This article belongs to the Special Issue Latest Advances and Prospects in Big Data)

Round 1

Reviewer 1 Report

The paper works towards mining correlation rules in datasets, extracted from experiments of microbiology domain. The machine learning techniques used are k-means, random forest and Apriori. R software is utilized for applying machine learning algorithms and results visualization. K-means and random forest algorithms are used for clustering and verification of clusters, respectively; then Apriori is applied for the association rules identification. 

The paper tries  to identify the relationship between bacterial data and NMR data related to fish. 

Observations:- 

Major revision 

The paper presented a novel application of combination (i-means) of machine learning techniques, for finding relationship rules among non-numeric parameters related to fish microbes' cultivation. 

1) Title does not depict the content of the paper; title, possibly could have been "Identifying correlation among qualitative non-numeric parameters in natural fish microbe dataset using machine learning. 

2) "Market Basket Analysis" topic is related to association rule mining in the field of commerce. 

3) Table 1 and 2 must be in the main text. 

4) Problem being researched, is not clearly presented? The discussion section should clearly discuss the results i.e. association rules and their implications. Secondly, at multiple places, authors refer to their other works, making a reader go through that work first to understand the point of discussion! 

4) The authors could not clearly outline the association rules elucidated from the machine learning experiments. More explanation required! Along with discussion on the dataset, to support the claim of Big Data. 

5) One model does not fit all scenarios. Therefore, authors should present the conclusion of their work specifically related to fish bacterial/microbe cultivated in natural environment scenario! 

6) The present presentation of work seems to be more towards the microbiology domain, something they already have presented in their previous publications, related to the fish research project. If the work has to be considered from machine learning algorithms, then the authors must give more detail of parameters setting for K-means, Random Forest and association rules (e.g. Confidence, support etc.), along with comprehensive rational for the parameter values! 

9) To justify the generic nature of the presented work, authors must come up with more experiments, with ensemble technique, on other datasets in the domain. Justification required for the statement in abstract ... "The analysis scheme presented here may increase the potential to identify important characteristics of big data sets". 

Author Response

The paper works towards mining correlation rules in datasets, extracted from experiments of microbiology domain. The machine learning techniques used are k-means, random forest and Apriori. R software is utilized for applying machine learning algorithms and results visualization. K-means and random forest algorithms are used for clustering and verification of clusters, respectively; then Apriori is applied for the association rules identification.

 

The paper tries to identify the relationship between bacterial data and NMR data related to fish

 

[Response]

Thanks for your understanding and appropriate comments on our analysis scheme. we considered responses to your comments

 

Title does not depict the content of the paper; title, possibly could have been "Identifying correlation among qualitative non-numeric parameters in natural fish microbe dataset using machine learning.

 

[Response]

Thank you for your suggestion, we changed the title to your suggestion.

 

"Market Basket Analysis" topic is related to association rule mining in the field of commerce.

 

[Response]

Thank you for your comment, we added “association analysis” a part of contents. However, some papers used Market Basket Analysis in the title like our concept then we had left the word market basket analysis in the part of paper.

 

Table 1 and 2 must be in the main text.

 

 [Response]

 

thank you for your comment, we reconsidered what is important and reorganized Figures and Tables. Some figures and tables moved to the main taxt.

 

Problem being researched, is not clearly presented? The discussion section should clearly discuss the results i.e. association rules and their implications. Secondly, at multiple places, authors refer to their other works, making a reader go through that work first to understand the point of discussion!.

 

[Response]

According to your comment, we rewrote introduction and discussion or so and indicated some information from previous research. And we restructured the paper and added the information of formula in the related works and proposed framework.

 

The authors could not clearly outline the association rules elucidated from the machine learning experiments. More explanation required! Along with discussion on the dataset, to support the claim of Big Data.

[Response]

Thank you for your comments, Our scheme could extract association rules but the meaning was not known. If we focus the meaning, we must check the association rules experimentally as you said but, in this paper, we focused the methods of mining association rules. This is why we could only refer to past reports information for a part of extracted rules.

 

 

The present presentation of work seems to be more towards the microbiology domain, something they already have presented in their previous publications, related to the fish research project. If the work has to be considered from machine learning algorithms, then the authors must give more detail of parameters setting for K-means, Random Forest and association rules (e.g. Confidence, support etc.), along with comprehensive rational for the parameter values!

 

[Response]

According to your comment, we attached reference URL in the line 166.

.

 One model does not fit all scenarios. Therefore, authors should present the conclusion of their work specifically related to fish bacterial/microbe cultivated in natural environment scenario! 

 

[Response]

We agreed with your opinion, one model is not enough. The same is true for analysis techniques, and it is considered that the conventional analysis methods or schemes are not sufficient. We believe that a part of meaningful rules can be extracted by our scheme. In addition, many microbes are difficult to culture then it is still difficult to fully support cultivated in natural environment scenario.

 

The present presentation of work seems to be more towards the microbiology domain, something they already have presented in their previous publications, related to the fish research project. If the work has to be considered from machine learning algorithms, then the authors must give more detail of parameters setting for K-means, Random Forest and association rules (e.g. Confidence, support etc.), along with comprehensive rational for the parameter values! 

 

[Response]

According to your comment, we inserted some section, related work, proposed framework and future work. We introduced the algorithm information, previous report, formula and parameters in the section and material and methods.

 

To justify the generic nature of the presented work, authors must come up with more experiments, with ensemble technique, on other datasets in the domain. Justification required for the statement in abstract ... "The analysis scheme presented here may increase the potential to identify important characteristics of big data sets".

 

[Response]

Thank you for the very important point, but we could not compere other machine leaning replace random forest due to time constraints, so we discussed this point in the future work.

On the other hand, we need to check and indicate significance of our scheme. Then we compared with currently used index, silhouette coefficient which is well known the internal validation index of clustering. As a result, used the data from our paper, there was no or few correlations between random forest error rate and the silhouette coefficient, even if it was small. Of course, when use ideal data set, it is possible that the error rate and silhouette coefficient will be highly correlated. The important results were embedded the paper supplemental table 2 and discussed in line 289.

 

Author Response File: Author Response.docx

Reviewer 2 Report

In this manuscript, the authors propose an analysis scheme for unsupervised data and non-numerical variables. In this method, K-means clustering and random forest are combined to extract association rules with market basket analysis. The proposed method is investigated by using natural fish samples from Japanese hydrosphere cultivation.

Major comments:

  1. In the introduction section, the motivation of this work is not clear. The authors should introduce some previous related works in this field, then describe the existing methods’ advantages and disadvantages. Compared with the existing methods, the authors should highlight the major motivation of the proposed method.
  2. In the i-means analysis, it is not clear how to use the random forest after the K-means clustering? The authors should provide more details about this.
  3. In results section, the proposed method is not compared with any state-of-the-art methods. Moreover, it is not clear the effect of the proposed analysis scheme using other machine learning models (such as SVM). The authors should conduct the experiments for this.

 

Minor comments:

  1. Some important figures should be moved into the main text instead of the supplementary material. For example, Figure S3 (The results of mining the characteristic variance of the groups that were determined using i means).
  2. Page 3, line 111, the reference or URL of the TopSpin software should be cited.

Author Response

In this manuscript, the authors propose an analysis scheme for unsupervised data and non-numerical variables. In this method, K-means clustering and random forest are combined to extract association rules with market basket analysis. The proposed method is investigated by using natural fish samples from Japanese hydrosphere cultivation..

 

[Response]

Thanks for your understanding and appropriate comments on our analysis scheme. we considered responses to your comments

 

  • In the introduction section, the motivation of this work is not clear. The authors should introduce some previous related works in this field, then describe the existing methods’ advantages and disadvantages. Compared with the existing methods, the authors should highlight the major motivation of the proposed method..

 

[Response]

Our motivation is to overcome face a problem with use data set from the filed for accomplishment of sustainable development goals. We rewrite those contents in introduction line 46 and 74.

 

In the i-means analysis, it is not clear how to use the random forest after the K-means clustering? The authors should provide more details about this.

 

[Response]

Thank you for your comment, we rewrote the detail how to use the result of K-means to random forest algorithm. Briefly, we applied the result transformed tentative class as supervised information. We inserted the detail in line 189-

 

In results section, the proposed method is not compared with any state-of-the-art methods. Moreover, it is not clear the effect of the proposed analysis scheme using other machine learning models (such as SVM). The authors should conduct the experiments for this.

 

 [Response]

Thank you for the very important point, but we could not compere other machine leaning replace random forest due to time constraints, so we discussed this point in the future work.

 On the other hand, we compared with currently used index, silhouette coefficient which is well known the internal validation index of clustering. As a result, used the data from our paper, there was no or few correlations between random forest error rate and the silhouette coefficient, even if it was small. Of course, when use ideal data set, it is possible that the error rate and silhouette coefficient will be highly correlated. The important results were embedded the paper supplemental table 2 and discussed in line 289.

 

Some important figures should be moved into the main text instead of the supplementary material. For example, Figure S3 (The results of mining the characteristic variance of the groups that were determined using i means).

[Response]

thank you for your comment, we reconsidered what is important and reorganized Figures and Tables.

 

In the results, you have mentioned that a heatmap graph is illustrating the correlations (Figure S2). where is it? The link dose not work; Also for S1, S3,S4.

[Response]

Thank you for your comments, It is our mistake. when resubmission, we separated main materials and supplemental materials.

Page 3, line 111, the reference or URL of the TopSpin software should be cited.

[Response]

According to your comment, we attached reference URL in the line 166.

Author Response File: Author Response.docx

Reviewer 3 Report

Please see the attached file

Comments for author File: Comments.pdf

Author Response

Visualization of association rules among multiple 2 measurement data with qualitative nonnumeric parameters by 3 machinelearning and market basket analyses In this work, authors have investigated proposed an analysis scheme which combines two machine learning steps to mine association rules between non-numerical parameters. They have claimed that the aim of this analysis is to identify relationship between variables and enable visualization of association rules from data of samples collected in the field.

 

[Response]

Thanks for your understanding and appropriate comments on our visualization scheme. we considered responses to your comments

 

  • Since you are discussing about the complex relation between variables, by the aid of Random Forest, it will be good to discuss about some nonlinear relations in the second paragraph of your introduction, see , e.g., Mesiar and Sheikhi(2021) below.

 

[Response]

Thank you for your comments and appropriate suggestion, we introduce more information in the introduction and the related work, in line 53-.

 

please provide a motivation of your work in a separate paragraph in the introduction

 

[Response]

Our motivation is to overcome face a problem with use data set from the filed for accomplishment of sustainable development goals. We rewrite those contents in introduction line 46 and 74.

 

in the pre-processing step did you have encountered with some missing values or outliers or other illcondition data in which you have needed to handle them? please discuss in detail.

[Response]

In the market basket analysis. we could apply missing values as NULL so we would not mind using missing value. In the i-means, effect of the outliers diminished repeating i-means so we did not care the outliers data. Moreover, when handling non-experimental data set from the field, there is no way to identify if it is an outlier.

1-what are some typos in the text exemplify, lines: 86: the validity checked using --> the validity "is" checked using 112: obtained---> were obtained 121: fish intestine for each samples--->fish intestine for each sample

[Response]

According to your comment, we corrected the mistake. we rewrote the paper so the collected line was 138, 168 and 176.

 

In the results, you have mentioned that a heatmap graph is illustrating the correlations (Figure S2). where is it? The link dose not work; Also for S1, S3,S4.

[Response]

Thank you for your comments, It is our mistake. when resubmission, we separated main materials and supplemental materials.

 

 

please equip the result section with a table results tat help the readers to evaluate your results and find their strength and weakness. Refs:

--------------------------------------

------------------------------------

Mesiar, R. and Sheikhi, A., 2021. Nonlinear random forest classification, a copula-based approach. Applied Sciences, 11(15), p.7140."

[Response]

According to your comment, we inserted a new table of association rules. The tables 3-5 is below

Author Response File: Author Response.docx

Reviewer 4 Report

This paper proposes a semi-supervised data mining framework to identify relationships between variables and visualize the association rules from various types of data. In the first step, the model uses k-means to segment the dataset into several groups and then applies random forest to choose the best segmentation with the lowest error rate. In the second step, the model uses association rules mining algorithm to extract patterns from each group.

The experiment was conducted on at least two datasets Miseq and NMR datasets (as I may understand correctly).

In general, the idea of combining both machine learning algorithms and association rules mining is interesting. However, the current version has many issues that the authors need to revise to improve the quality of the paper.

First, did the authors attach the supplemental material to the paper submission? I cannot find Tables S1, S2, and Figures S1, S2, and S3 in the paper.

In the Introduction, highlight the main contribution of the paper.

The authors need to reorganize the structure of the paper. I suggest a standard structure that includes 2. Related work, 3. Proposed Framework, 4. Experimental Results, 5. Discussion, 6. Conclusion and Future work

Following the above structure, in the Related work, the authors should introduce several works that combine both data mining and machine learning for the task of exploratory data analysis. I can find many works that use both clusterings, decision trees, and pattern mining in the literature such as 978-3-030-06155-5_26 and s10489-020-01677-5

In the proposed framework, the authors should insert a subsection to introduce and put the mathematical formulations for k-means, random forest, and association rules mining.

In this section, the authors should revise Figure 1 to make the workflow more detailed. I suggest authors make a workflow that shows all steps taken in this research to help readers easily capture what the authors have proposed. It should be a thorough workflow from the input to the output of the model.

Section 2 (in the current version) is very confusing. The authors should insert a subsection in section 4-Experimental Results to introduce the datasets used in the experiment. Also, in this section, the authors should show results yielded by the random forest. I suggest the authors conduct the experiment to evaluate the running time and memory usage of the proposed framework.

In the discussion, the authors should highlight the main findings in each dataset and visualize the results instead of using plain text. I also think there are multiple ways for performing only clustering to reduce the complexity of the framework. The authors can use the internal validation metrics such as the Silhouette coefficient to evaluate the quality of the clustering and thus do not need to use Random Forest. Another way is to use Hierarchical clustering for such problems. Here are several good examples that authors should refer into the discussion [https://doi.org/10.3390/app12010072], [https://doi.org/10.1007/978-981-15-1209-4_1].

In the clustering step, the authors should compare the performance when using random initialization and k-means++. If k-means++ can produce a better result, then do not need to repeat the clustering step for at least 10,000 rounds.

Author Response

This paper proposes a semi-supervised data mining framework to identify relationships between variables and visualize the association rules from various types of data. In the first step, the model uses k-means to segment the dataset into several groups and then applies random forest to choose the best segmentation with the lowest error rate. In the second step, the model uses association rules mining algorithm to extract patterns from each group.

The experiment was conducted on at least two datasets Miseq and NMR datasets (as I may understand correctly).

In general, the idea of combining both machine learning algorithms and association rules mining is interesting. However, the current version has many issues that the authors need to revise to improve the quality of the paper

 

[Response]

Thanks for your understanding and appropriate comments on association rules mining. we considered responses to your comments

 

First, did the authors attach the supplemental material to the paper submission? I cannot find Tables S1, S2, and Figures S1, S2, and S3 in the paper.

 

[Response]

Thank you for your comments, It is our mistake. when resubmission, we separated main materials and supplemental materials.

 

In the Introduction, highlight the main contribution of the paper.

 

[Response]

Our main contribution is developing appropriate method for data with unsupervised and/or non-numerical variables from the field. We believe appropriate method to overcome face a problem for accomplishment of sustainable development goals. We rewrite those contents in introduction line 46 and 74.

 

The authors need to reorganize the structure of the paper. I suggest a standard structure that includes 2. Related work, 3. Proposed Framework, 4. Experimental Results, 5. Discussion, 6. Conclusion and Future work

Following the above structure, in the Related work, the authors should introduce several works that combine both data mining and machine learning for the task of exploratory data analysis. I can find many works that use both clusterings, decision trees, and pattern mining in the literature such as 978-3-030-06155-5_26 and s10489-020-01677-5

[Response]

According to your comment, we reorganize the structure of the paper. We added 2. Related work, 3. proposed framework, and 7. conclusion and future work to the paper.

In the proposed framework, the authors should insert a subsection to introduce and put the mathematical formulations for k-means, random forest, and association rules mining.

[Response]

According to your comment, we introduced the formulations appeared our scheme and the related previous papers in section 3. proposed framework.

 

In this section, the authors should revise Figure 1 to make the workflow more detailed. I suggest authors make a workflow that shows all steps taken in this research to help readers easily capture what the authors have proposed. It should be a thorough workflow from the input to the output of the model.

 

[Response]

According to your comment, we replaced the figure 1 which depicted more detailing our scheme.

Figure 1: Overview of our proposed analysis.

First, K-means clustering is performed on each data set and the validity is checked using random forest importance-based error rate. This part is referred to as “importance-based K-means,” or “i-means.” The resulting and original data sets are converted into zero-one data by data ranking.

 

Section 2 (in the current version) is very confusing. The authors should insert a subsection in section 4-Experimental Results to introduce the datasets used in the experiment. Also, in this section, the authors should show results yielded by the random forest. I suggest the authors conduct the experiment to evaluate the running time and memory usage of the proposed framework.

 

 

[Response]

According to your comment, we inserted a subsections in the results and added the important information in the Figure 2 and new table 3-5 which listed association rules extracted by our scheme.

In the discussion, the authors should highlight the main findings in each dataset and visualize the results instead of using plain text. I also think there are multiple ways for performing only clustering to reduce the complexity of the framework. The authors can use the internal validation metrics such as the Silhouette coefficient to evaluate the quality of the clustering and thus do not need to use Random Forest. Another way is to use Hierarchical clustering for such problems. Here are several good examples that authors should refer into the discussion [https://doi.org/10.3390/app12010072], [https://doi.org/10.1007/978-981-15-1209-4_1].

 

[Response]

We have empirically found that the goodness of cluster is not always the proper classification. However, it was sensuous. Your suggestion supported our motivation to confirm that.  We tried to confirm correlation between random forest classification error rate and the silhouette coefficient. As a result, used the data from our paper, there was no or few correlation between random forest error rate and the silhouette coefficient, even if it was small. Of course, when use ideal data set, it is possible that the error rate and silhouette coefficient will be highly correlated. The important results were embedded the paper supplemental table 2 and discussed in line 289.

 

In the clustering step, the authors should compare the performance when using random initialization and k-means++. If k-means++ can produce a better result, then do not need to repeat the clustering step for at least 10,000 rounds.

 

[Response]

 As you pointed out, K-means ++ has a possibility that will improve i-means performance.

However, as mentioned above, since there are few correlations between the error rate and the silhouette coefficient, it is uncertain whether the error rate will converge faster than K-means. We discussed in line 344.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Paper improved a lot.
1. Authors can use term Association Rule Mining instead of Market Basket Analysis. Market Basket Analysis term used for Retailers data analysis.
2. Lines 95 : Algorithms listed below ....
No algorithms given .... only mathematical expression given for support, confidence and lift!
3. Lines 121 : Give couple of sentences explaining how the values were computing. And rephrase the sentences because the ref. articles [15,25-27] do not give any impression if these articles give the values of support, confidence and lift for authors work?
4. Figure 1 : Normalised by DSS ... give a sentence ... what is DSS!

Author Response

Points by points letter To Reviewer 1

 

  1. Authors can use term Association Rule Mining instead of Market Basket Analysis. Market Basket Analysis term used for Retailers data analysis.

[Response]

Thank you for your comment, we replaced “market basket analysis” with “association rules mining” in this paper. However, some papers used Market Basket Analysis in the title and main part like our concept in not Retailers data analysis then we had left the word market basket analysis in the part of paper. Leaving the word makes it easier to search that for people who are interested in the market basket analysis and want to apply to the other field data set.

 

  1. Lines 95 : Algorithms listed below .... No algorithms given .... only mathematical expression given for support, confidence and lift!

 

[Response]

According to your comment, we added an information used algorithm “a priori” in line 115.

Listed names “K-means”, ”Random forest” and “Apriori” are algorithm name. And we showed formulas for appeared variable name in this paper.

 

  1. Lines 121 : Give couple of sentences explaining how the values were computing. And rephrase the sentences because the ref. articles [15,25-27] do not give any impression if these articles give the values of support, confidence and lift for authors work?

 

[Response]

Thank you for your comment. Although we are unsure if the changes accurately answer your comment, we have detailed how the parameters were set, in line 123. We referred to parameter setting and computing in Shiokawa’s paper (Shiokawa, Misawa et al. 2016).

 

  1. Figure 1 : Normalised by DSS ... give a sentence ... what is DSS!

 

 [Response]

 

Thank you for your comment, it was our fault. When we improved Figure 1, the timing of the first appearance “DSS” in the main text had changed. We rewrote Figure 1 legend.

 

 

Figure 1: Overview of our proposed analysis.

NMR measurements were normalized by 2,2-dimethyl-2-silapentane-5-sulfonate (DSS). Each data set was converted as per to ratio to sum of each sample. K-means clustering is performed on each data set and the validity checked using random forest importance-based error rate. This part is referred to as “importance-based K-means,” or “i-means.” The resulting and original data sets are converted into zero-one data by data ranking or class information resulted i-means. The zero-one data was analyzed by the Apriori algorithm. Finally, we selected meaningful association rules by importance calculated by random forest from the extracted association rules by the Apriori algorithm.

 

Reference

 

Shiokawa, Y., et al. (2016). "Application of market basket analysis for the visualization of transaction data based on human lifestyle and spectroscopic measurements." Analytical chemistry 88(5): 2714-2719.

   

Author Response File: Author Response.docx

Reviewer 2 Report

The authors have addressed all my previous concerns. I have no more comments.

Author Response

Thank you so much for your previous comments.

Reviewer 4 Report

The authors have improved the paper base on my suggestions, thus I vote for an acceptance.

Author Response

Thank you so much for your previous comments.

Back to TopTop