Review Reports - Multi-Class Assessment Based on Random Forests

Round 1

Reviewer 1 Report

Dear authors,

I have read your manuscript with great interest.

I would like to point out just a few aspects / suggestions related to the structure:

I think that the content of section 2.1. Related works should be part of 1. Introduction. Thus, the content of 2.2. Methodology would be 2. Material and methods.

In section 1. Introduction it would be convenient to introduce some citation to the main ideas of the text (between lines 18 to 38).

Line 80. You could write Randon Forests (RF), so the acronym RF would be defined for the entire article.

In Figure 1, the text is misplaced.

In the rest of the document, you could substitute RF for Random Forests

Line 94. There should be a citation next to Breinman []

Line 184. The phrase "You must therefore add the hyperparameter" class_weight = balanced "." it could be written in a more impersonal way.

Lines 189-190. I would like to understand (maybe the redaction of the text) if 80% is randomly selected and the remaining 20% is used for the test, or other data is arbitrarily selected up to 20% of the total. I understand that you mean the first one, but perhaps the redaction of the text can be improved.

Lines 220. Are there 1081 sheets including root?

Please, check the numeration of the Figures of the document (I have selected in the attached file).

Lines 238-239. Maybe that paragraph should be deleted.

Table 2. Please correct the names.

Figures 7 and following, would it be possible edited to be better visualization of the numbers ?

References should be reviewed and adapted to the format of the journal.

Thank you very much.

Best Regards,

Comments for author File: Comments.pdf

Author Response

Dear Reviewer. You will find all the changes that have been made thanks to your comments:

I think that the content of section 2.1. Related works should be part of 1. Introduction. Thus, the content of 2.2. Methodology would be 2. Material and methods.

--> Done, numerotation has been modified

In section 1. Introduction it would be convenient to introduce some citation to the main ideas of the text (between lines 18 to 38).

--> We have added some citations

Line 80. You could write Randon Forests (RF), so the acronym RF would be defined for the entire article.

--> Done

In Figure 1, the text is misplaced.

--> The text area has been resized

In the rest of the document, you could substitute RF for Random Forests

--> Done, RF is used in all the document

Line 94. There should be a citation next to Breinman []

--> The citation has been added

Line 184. The phrase "You must therefore add the hyperparameter" class_weight = balanced "." it could be written in a more impersonal way.

--> Sentence has been change : "It can be therefore useful to add the hyperparameter "class_weight = balanced". This automatically weights classes inversely proportional to the frequency of their appearance in the data."

--> The first interpretation is the correct one. I have change the sentence in order to be more clear

Lines 220. Are there 1081 sheets including root?

--> no, we only talk about the final leaf of the tree, not the total nodes of the tree. The sentence has been change

Please, check the numeration of the Figures of the document (I have selected in the attached file).

--> Thank you for having pointed our mistake, we have corrected it

Lines 238-239. Maybe that paragraph should be deleted.

--> yes.... a stupid oversight ;-)

Table 2. Please correct the names.

--> done

Figures 7 and following, would it be possible edited to be better visualization of the numbers ?

--> we corrected the numbers and the colors, I hope it's better now

References should be reviewed and adapted to the format of the journal.

--> Done

Thank you for all your constructive remarks.

Author Response File: Author Response.docx

Reviewer 2 Report

The authors decribe application of random forest method to classification of students according to their study results. The manuscript is well structured however it suffers from many insufficiencies.

The first sentences in the introduction describe a general problem of many universities: failure rate of students in the first year of the study. However, it is not stated whether this problem is across all study programs (i.e. engineering, life sciences, humanities, management...) or whether it is more serious in some study areas.
Page 2, lines 69-70 and 78-79 are almost the same.
Page 3, line 97 - formal comment - the symbol ρ¯ - the macron should be over ρ and not next to it (like here in the editor, which has limited options).
Page 3, lines 113-114 - the sentence starting with "We are also aware..." is not clear. There is no additional information about the pre-processing.
Page 3, line 121 - selection of students is mentioned. However, no information about the critria and procedure of selection is presented.
Page 3-4 - section Methodology - it is written that from all the information about the candidates the authors get "a fairly substantial data set". More detailed information about the features (total number, content) is missing. It would be interesting for the reader to see a complete list of considered features.
Page 4, table 1 - the table needs more detailed explanation. For example, why was this particular division to intervals chosen.
Page 6, line 195 - "train dataset" - it should be "training dataset".
Figures 7 and 8 require enlargement. Please consider other colors. It is difficult to read black numbers on dark blue background.
Page 10, lines 291-292 - the sentence is unclear. Please try to re-formulate it.
Page 11, line 322 - what is the meaning of abbreviation "PACES"? It is not explained in the text.
References No. 3, 16, 17 and 18 are incomplete

Author Response

Dear Reviewer. You will find all the changes that have been made thanks to your comments:

The first sentences in the introduction describe a general problem of many universities: failure rate of students in the first year of the study. However, it is not stated whether this problem is across all study programs (i.e. engineering, life sciences, humanities, management...) or whether it is more serious in some study areas.

--> This paragraph has been change : "Many universities face the same problem: the first-year failure rate is far too high. For example, the average failure rate observed in first year in France is more than 60%. This is a problem common to all areas, albeit to varying degrees [6]. It is particularly high in France in science, psychology, law and so on. This phenomenon concerns both small and very large cohorts [7-8]. This failure rate is found both in training courses that do not select their students and in so-called selective training."

Page 2, lines 69-70 and 78-79 are almost the same.

--> The second sentence has been deleted

Page 3, line 97 - formal comment - the symbol ρ¯ - the macron should be over ρ and not next to it (like here in the editor, which has limited options).

--> Done, we put the correct notation

Page 3, lines 113-114 - the sentence starting with "We are also aware..." is not clear. There is no additional information about the pre-processing.

--> We added additional information to be more clear : "We are also aware that the nature of the data we consider requires us to make a significant effort of pre-processing. This is because the dataset contains almost empty columns because all the subjects existing in the school are represented and not all the pupils follow them. Moreover, only the grades obtained in the subjects by the students, the best and worst averages and the class average are available. A lot of information must be removed from the initial dataset in order to ensure strict anonymization (surname, first name, date of birth, address, telephone, e-mail, etc.). Is also important to be able to obtain indications such as the evolution of grades in a subject over time in order to have a better profile. In the following, we will show the strategies we have implemented to improve our prediction."

Page 3, line 121 - selection of students is mentioned. However, no information about the critria and procedure of selection is presented.

--> We added this sentence to better explain the criteria for selection : "This selection is made by looking at scores in mathematics, French and English, and also by looking at behavioral remarks."

Page 3-4 - section Methodology - it is written that from all the information about the candidates the authors get "a fairly substantial data set". More detailed information about the features (total number, content) is missing. It would be interesting for the reader to see a complete list of considered features.

--> we added the number of characteristics of the dataset at the begening of the process and after pre-processing :"We have thus obtained a fairly large set of data if we consider the information that characterizes each candidate for our training. Thus, our initial data set contains more than 400 characteristics. After removing characteristics that are either unnecessary or related to the identity of the students and adding additional information, the dataset contains more than 1000 characteristics."

Page 4, table 1 - the table needs more detailed explanation. For example, why was this particular division to intervals chosen.

--> We added more information :"The class numbered 0 is also the most obvious: success (result above 10/20) and failure (result below 10/20). But we also tried to be more precise in our classification. For class number 1, we therefore consider pupils with great difficulties (result below 7.5/20), those who succeed (result above 10/20) and pupils who have results below 10/20 but who have not completely failed. As this study is exploratory, we have declined this class into two other similar classes (classes 2 and 3) by slightly modifying the boundaries of each group, as shown in Table 1. Finally, we also plan to separate the pupils in two other even more precise configurations (classes 4 and 5) by dividing the notes into 4 and then 5 intervals."

Page 6, line 195 - "train dataset" - it should be "training dataset".

--> Done

Figures 7 and 8 require enlargement. Please consider other colors. It is difficult to read black numbers on dark blue background.

--> The size of the numbers and the colors have been change. I hope it's more clear

Page 10, lines 291-292 - the sentence is unclear. Please try to re-formulate it.

--> You are right, it was totally incomprehensible, please excuse us. We change it : "This model is therefore more interesting for predicting students who will largely fail than those with an average around 10/20."

Page 11, line 322 - what is the meaning of abbreviation "PACES"? It is not explained in the text.

--> This refers to studies in the medical field. We have changed the sentence to make it more understandable : "This can be considered very important like biology or medical training where it can exceed a thousand each year, or low like graphic design schools, for example Itecom Art Design with around twenty a year."

References No. 3, 16, 17 and 18 are incomplete

--> Thank you we missed it.

Thank you for all your constructive remarks.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

The manuscript has been significantly improved. However, I did not find response to my question concerning features. I understand that it is impossible to list all considered features. The authors should present the most important features that contribute mostly to the prediction.

In the section 2.2 Data Processing the authors mention that they did not consider features that mainly have no values and they added new data correlated with the dataset. I recommend to add examples of both groups for better understanding.

Author Response

Dear reviexer, you will find our responses of your comment:

--> We added the following paragraph

The dataset has been designed to be compatible with all the university's courses, whatever their field of study. Thus we consider all the subjects that can be followed before arriving at the university. Our dataset therefore contains all existing subjects. However, a high school student cannot follow all these subjects. For example, a high school student who has specialized in a literary field will not have any marks in the final year before university in science subjects such as mathematics or physics and chemistry. Thus, when we consider for our study a university department of studies specializing in computer science, all the columns in our dataset that correspond to non-scientific subjects (literature, Greek, Latin, etc.) will therefore be almost empty.

--> We added the following paragraph:

"More precisely, we calculate those kind of information for each of the subjects followed by the candidates:

Averages: calculation of averages in all subjects for each term of each school year two by two (première and terminale).
Weighted averages: weighted averages taking the subjects scientific and non-scientific matters. A higher coefficient is given depending on whether the calculated average corresponds to a scientific or non-scientific subject.
Candidate average delta: candidate average - class average.
Low average delta: candidate's average - lowest average.
High average delta: highest average - candidate average.
Difference: difference between the highest and the lowest mark in each subject.
Range: highest average - lowest average.
Bonus: (candidate's average – average of the class) / (highest average - lowest average).

It emerges that the characteristics that have the greatest influence on prediction obviously depend on one training course to another. Nevertheless, low and high delta values stand out as important characteristics in several formations, so they seem to be relevant metrics."

Author Response File: Author Response.docx