Next Article in Journal
How to Make a Sustainable Manufacturing Process: A High-Commitment HRM System
Previous Article in Journal
A Comparative Study of Cross-Border and Domestic Acquisition Performances in the South Korean M&A Market: Testing the Two Competing Theories of Culture
 
 
Article
Peer-Review Record

Using Decision Tree to Predict Response Rates of Consumer Satisfaction, Attitude, and Loyalty Surveys

Sustainability 2019, 11(8), 2306; https://doi.org/10.3390/su11082306
by Jian Han 1, Miaodan Fang 1, Shenglu Ye 1, Chuansheng Chen 2, Qun Wan 3,* and Xiuying Qian 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Sustainability 2019, 11(8), 2306; https://doi.org/10.3390/su11082306
Submission received: 21 March 2019 / Revised: 12 April 2019 / Accepted: 15 April 2019 / Published: 17 April 2019

Round 1

Reviewer 1 Report

I found this an interesting and generally well-presented analysis of a topic which is important in research practice. I do think however that much more analysis could have been undertaken.

Major Points

1) The dependent variable (response rate) is quantitative, but the decision was taken to dichotomise it. Why? Why not carry out regression tree analysis with the actual response rate as dependent variable?

2) Even granted the decision to dichotomise, it is not obvious to me that the 50% cut-off (albeit close to the mean) is justified. Many important surveys demand 70% response. At least, different cut-offs could have been tried.

3) Regression/classification tree methodology generally uses "pruning" to avoid over-fitting. This does not appear to have been done. Pruning might have removed the split of node 4 in Figure 6, which has only 8 items. Suggesting a critical length of 20 items may well be over-interpreting the results.


Other points

4) Training and test sets can be employed with any statistical prediction methodology, so it is slightly misleading to emphasise their use as an advantage of trees.

5) Why merge the telephone and online survey modes? One has personal contact, the other doesn't, which is a big difference.

6) The logistic regression model (or linear regression if a quantitative dependent variable is used) could be presented, to show which factors are significant in that analysis, for comparison.

7) In Tables 1 and 2, the symbols "a", "b" etc. need to be superscript everywhere they are used.

8) In Table 2, "d" and "e" footnotes are the wrong way round. (There is no need to use d and e, they could have been a and b.) Also, these two definitions are unclear. There is nothing in the text to explain to the reader what "identified relevant information" means.

9) Write "p<0.001" rather than "p=0.000"

10) Correct the spelling of "invitation" in "Direct invitation" in Figures 4 and 6.

11) Line 122: "have" not "has"

Author Response

Major Points

1)     The dependent variable (response rate) is quantitative, but the decision was taken to dichotomise it. Why? Why not carry out regression tree analysis with the actual response rate as dependent variable?

Following this and other reviewers’ helpful suggestion, we have added a regression tree analysis (see below for our justification for keeping the original analysis and the reasons for selecting 50% as a cutoff point). We used all attributes to construct a decision tree regression model to predict response rate. Classification and Regression Tree (C&RT) algorithm implemented in IBM SPSS Modeler 18.0 was used to construct the prediction model. The data were divided into the training set (80%) and the test set (20%). Results showed that the linear correlation coefficient between the predicted values of decision tree regression model and actual values was 0.722 in the training set and the linear correlation coefficient was 0.578 in the test set.

Of all the predictors, direct invitation had by far the highest importance. Mode of data collection, confidentiality, relevance of topics, type of survey sponsors, and questionnaire length had similar levels of importance. To simplify the decision tree model and avoid overfitting, the decision tree was pruned.


2)     Even granted the decision to dichotomise, it is not obvious to me that the 50% cut-off (albeit close to the mean) is justified. Many important surveys demand 70% response. At least, different cut-offs could have been tried.

 

We kept the dichotomized analysis for the reason that researchers’ eventual decisions are binary (whether to conduct the survey or not), so we think it is useful to keep this data analysis. In terms of the cutoff point, we selected 50% for at least two reasons. First, Baruch et al. (2008) found that the average response rate for 1607 studies that utilized data collected from individuals was 52.7%, so it is a reasonable goal for researchers to achieve an average or higher response rate. Second, due to our limited sample size, 50% gave us the maximum variance to run the analyses. If we had used other criteria such as 70%, the number of cases would have been small for one category.

 

3)     Regression/classification tree methodology generally uses "pruning" to avoid over-fitting. This does not appear to have been done. Pruning might have removed the split of node 4 in Figure 6, which has only 8 items. Suggesting a critical length of 20 items may well be over-interpreting the results.

 

As mentioned earlier, our new regression tree analysis used pruning. As for the classification analysis, we used C&RT and C5.0 algorithm implemented in IBM SPSS Modeler 18.0 to construct the prediction model. In SPSS Modeler, the system automatically adjusts the parameter values and pruning is done based on the prediction accuracy of the model.

 

Other points

4)     Training and test sets can be employed with any statistical prediction methodology, so it is slightly misleading to emphasise their use as an advantage of trees.

We have modified this point according to your comment.

 

5)     Why merge the telephone and online survey modes? One has personal contact, the other doesn't, which is a big difference.

Following this reviewer’s helpful suggestion, we have separated online and phone surveys.

 

 

6)     The logistic regression model (or linear regression if a quantitative dependent variable is used) could be presented, to show which factors are significant in that analysis, for comparison.

We compared our new regression decision tree model with a prediction model based on traditional linear regression. The results of significance test of regression equation: F(7,88) = 9.273, p < 0.001, and the adjusted R2 = 0.379. Results showed that the linear correlation coefficient between the predicted values of linear regression model and actual values was 0.615 in the training set and the linear correlation coefficient was 0.423 in the test set, both of which were lower than the corresponding coefficients of our regression tree model.

Consistent with the results of the decision tree model, direct invitation had the highest importance (β = 0.498, p < 0.001). But the second most important attribute was confidentiality (β = 0.159, p = 0.058) and the third was mode of data collection (β = -0.154, p = 0.066).

 

 

7)     In Tables 1 and 2, the symbols "a", "b" etc. need to be superscript everywhere they are used.

We have made the change.

 

8)     In Table 2, "d" and "e" footnotes are the wrong way round. (There is no need to use d and e, they could have been a and b.) Also, these two definitions are unclear. There is nothing in the text to explain to the reader what "identified relevant information" means.

In Table 2, we have changed ‘d’ and ‘e’ footnotes as ‘a’ and ‘b’ footnotes.

And we changed the definitions of precision rate and recall rate.

Precision rate = TP / (TP + FP),

Recall rate = TP / (TP + FN).

where,

TP (true positive) is the number of positive samples predicted by the classifier as positive;

TN (true negative) is the number of negative samples predicted by the classifier as negative;

FP (false positive) is the number of negative samples predicted by the classifier as positive;

FN (false negative) is the number of positive samples predicted by the classifier as negative.

 

9)     Write "p<0.001" rather than "p=0.000"

We have made the change.

 

10)  Correct the spelling of "invitation" in "Direct invitation" in Figures 4 and 6.

We have corrected it.

 

11)  Line 122: "have" not "has"

We have added the author’s name, so “has” is correct.


Author Response File: Author Response.pdf

Reviewer 2 Report

An interesting paper and research on the factors of survey response rates, with appropriate methodology and data analysis. 

The only addition that I think would improve its overall merit, especially in what concerns readers' comprehension of the data analysis section, would be a description of the specific approach of the C5.0 algorithm, maybe in comparison to logistic regression.

Author Response

The only addition that I think would improve its overall merit, especially in what concerns readers' comprehension of the data analysis section, would be a description of the specific approach of the C5.0 algorithm, maybe in comparison to logistic regression.

We thank this reviewer for his/her encouraging comments. We have added the following description: “C5.0 is one of the classic decision tree algorithms. It can learn to predict discrete outputs based on the values of the inputs it receives. Whether the inputs are linear data or nonlinear data, even have missing values, C5.0 could perform robustly. C5.0 splits the sample based on the field of the maximum information gain (entropy) brought by the input variable. Information entropy reflects the degree of information clutter, and the more impure the information is, the larger the information entropy is. Finally, C5.0 generates a decision tree or rule set with very straightforward explanations.

Reviewer 3 Report

I think this paper is an interesting topic and a novel methodology however I think the further analysis is required. Most importantly you have a nice size sample however you lose a lot of data by categorising to a binary level to assist in using a decision tree. The arbitrary use of 50% to almost evenly split your data and to categorise high and low response is also of concern as I think some further justification from the literature as to actual response rates should be provided. This should be included in the introduction. Often a 35 or 40% response rate is considered high. 


It would be to know a little more about the samples are the probabilistic convenience etc


My line by line commentary 

Line 35... the use of after all is too colloquial and unnecessary

Line 56 & 83

I am not sure as to editorial direction but I believe a name and reference should be used here.


The materials and method section needs more work. You need to explain your statistical methods, cut offs, test for significance, and packages used. Some of this can be found in the results section near lines 181-184.


line 84 

The use of however is redundant


Line 174 

Regression analysis would result in significant retention of data here. as binary categorisation results in substantial information loss. 


Line 187 spelling error in the title of the column


Lines 201-203 

It would be good to see the regressions done as a linear model rather than a logistic model. Further, it is not clear if this regression was performed as multi or univariable analysis. Most importantly I would like to see what happened if the order of events were changed in the decision tree


Overall I think this paper has merit and is of interest to the readers and with a little more sensitivity analysis the paper would be an excellent presentation of a novel approach. 


 

Author Response

I think this paper is an interesting topic and a novel methodology however I think the further analysis is required. Most importantly you have a nice size sample however you lose a lot of data by categorising to a binary level to assist in using a decision tree.

Following your and Reviewer 1’s suggestion, we have added a regression tree analysis, which treats the response rates as a continuous variable. See our responses to Reviewer’s comments #1 and #2 for more details.

 We used all attributes to construct a decision tree regression model to predict response rate. Classification and Regression Tree (C&RT) algorithm implemented in IBM SPSS Modeler 18.0 was used to construct the prediction model. The data were divided into the training set (80%) and the test set (20%). Results showed that the linear correlation coefficient between the predicted values of decision tree regression model and actual values was 0.722 in the training set and the linear correlation coefficient was 0.578 in the test set.

Of all the predictors, direct invitation had by far the highest importance. Mode of data collection, confidentiality, relevance of topics, type of survey sponsors, and questionnaire length had similar levels of importance. To simplify the decision tree model and avoid overfitting, the decision tree was pruned.

We kept the dichotomized analysis for the reason that researchers’ eventual decisions are binary (whether to conduct the survey or not), so we think it is useful to keep this data analysis. In terms of the cutoff point, we selected 50% for at least two reasons. First, Baruch et al. (2008) found that the average response rate for 1607 studies that utilized data collected from individuals was 52.7%, so it is a reasonable goal for researchers to achieve an average or higher response rate. Second, due to our limited sample size, 50% gave us the maximum variance to run the analyses. If we had used other criteria such as 70%, the number of cases would have been small for one category.

The arbitrary use of 50% to almost evenly split your data and to categorise high and low response is also of concern as I think some further justification from the literature as to actual response rates should be provided. This should be included in the introduction. Often a 35 or 40% response rate is considered high. 

 

Reviewer also has the same concern. Interestingly he/she would like us to use a higher cutoff point of 70%. Please see our Response to Reviewer #1’s comment #2 for the reasons for using 50%.

 We kept the dichotomized analysis for the reason that researchers’ eventual decisions are binary (whether to conduct the survey or not), so we think it is useful to keep this data analysis. In terms of the cutoff point, we selected 50% for at least two reasons. First, Baruch et al. (2008) found that the average response rate for 1607 studies that utilized data collected from individuals was 52.7%, so it is a reasonable goal for researchers to achieve an average or higher response rate. Second, due to our limited sample size, 50% gave us the maximum variance to run the analyses. If we had used other criteria such as 70%, the number of cases would have been small for one category.


My line by line commentary 

Line 35... the use of after all is too colloquial and unnecessary

We have deleted “After all”.

 

Line 56 & 83

I am not sure as to editorial direction but I believe a name and reference should be used here.

We apologize for the automatic referencing program error. We have added the authors’ names where needed (in about 10 places throughout the manuscript).

e.g., Line 56: Helgeson et al. [57] concluded that….

Line 83: As Groves and Lyberg [58] pointed out…

 

The materials and method section needs more work. You need to explain your statistical methods, cut offs, test for significance, and packages used. Some of this can be found in the results section near lines 181-184.

We have added the following content.

“We first analyzed the influence of each survey attribute on response rates by performing ANOVA in SPSS 22.0. Then, we constructed a decision tree regression model by applying C&RT algorithm and a linear regression model with the response rate as the dependent variable and all attributes as predictors in IBM SPSS Modeler 18.0. Finally, we used 50% as the cutoff point to divide our sample studies into those with ‘High’ response rates and those with ‘Low’ response rates, and the decision tree classification model was constructed with all useful attributes as predictors by applying C5.0 algorithm.”

 

line 84 

The use of however is redundant

We have deleted “however”.

 

Line 174 

Regression analysis would result in significant retention of data here. as binary categorisation results in substantial information loss. 

See our response to Comment #1.

 

Line 187 spelling error in the title of the column

We have made the correction.

 

Lines 201-203 

It would be good to see the regressions done as a linear model rather than a logistic model. Further, it is not clear if this regression was performed as multi or univariable analysis.

 

As mentioned in our response to Reviewer 1’s Comment #1, we have added a regression tree analysis as well as a linear regression. See that comment for the results.

 

Most importantly I would like to see what happened if the order of events were changed in the decision tree

 

In decision tree and decision tree regression analyses, the order of factors was determined by the algorithm, not by the researcher.


Round 2

Reviewer 1 Report

Following the various additions that the authors have made, I have no further comments on the substance of the paper.

Some trivial points:

Line 164: "and found" not just "found"

LIne 226: "In the previous section" (insert "the")

Line 247: need space between "countries" and "(59"

LInes 298-299: it looks like "Hackler and Bourgett" is in smaller font than the rest of the text

Author Response

Some trivial points:

Line 164: "and found" not just "found"

We have made the change.

Line 226: "In the previous section" (insert "the")

We have made the change.

Line 247: need space between "countries" and "(59"

We have made the change.

Lines 298-299: it looks like "Hackler and Bourgett" is in smaller font than the rest of the text

We have made the change.


Back to TopTop