Next Article in Journal
Exploring Performance Degradation in Virtual Machines Sharing a Cloud Server
Previous Article in Journal
Caching Placement Optimization Strategy Based on Comprehensive Utility in Edge Computing
 
 
Article
Peer-Review Record

A Depression Recognition Method Based on the Alteration of Video Temporal Angle Features

Appl. Sci. 2023, 13(16), 9230; https://doi.org/10.3390/app13169230
by Zhiqiang Ding 1,2, Yahong Hu 1,2, Runhui Jing 1,2, Weiguo Sheng 3 and Jiafa Mao 1,2,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4:
Appl. Sci. 2023, 13(16), 9230; https://doi.org/10.3390/app13169230
Submission received: 17 July 2023 / Revised: 9 August 2023 / Accepted: 11 August 2023 / Published: 14 August 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

The manuscript has contributed towards depression recognition using machine learning which is an interesting topic. The writing and presentation of the manuscript are good.

The authors can provide a separate contribution section or paragraph for a clear understanding. An overall block diagram is missing which helps the readers to get a good overview of the manuscript at the beginning.

In the performance comparison section, it is observed that most of the state-of-the-art techniques either perform well in precision or in recall but not in both. Can the authors provide a different analysis or results to understand this better? For example, I am suggesting them to include discrete ROC graphs for better understanding of the comparison. For discrete ROC, the author can follow this paper: https://doi.org/10.1016/j.dsp.2022.103763

English is ok, minor proofread is required.

Author Response

Response to Reviewer 1 Comments

 

 

Point 1: The authors can provide a separate contribution section or paragraph for a clear understanding. An overall block diagram is missing which helps the readers to get a good overview of the manuscript at the beginning.

 

Response 1:

Thank you for reviewing our paper and providing valuable feedback. We greatly appreciate your attention to our research work.

Based on your suggestions, we have made the following revisions to enhance the clarity and logical structure of the paper.

In Point 1, two comments were provided, which have been carefully addressed in the revised version of this article.

(1). As recommended, a dedicated paragraph highlighting the contributions of this study has been incorporated into the introduction, precisely placed on page 2. The contents of this new paragraph are outlined below:

“ The main contributions of this paper are as follows:

(a)This paper introduces a novel feature extraction method based on facial angle. The method produces facial features that possess translation invariance and rotation invariance. Additionally, it counteracts the interference caused by patients' head and limb movements through flipping correction, resulting in highly robust extracted features.

(b)A new model is developed by combining GhostNet with multi-layer Perceptron (MLP) modules and video process headers. The model is fine-tuned with respect to the features extracted using the proposed method in this paper. This novel model offers a valuable addition to the depression classification task.

(c)Extensive experiments are conducted on the DAIC-WOZ dataset to validate the feasibility of the proposed method. The results obtained outperform those of similar methods, establishing the efficacy of the proposed approach. ”

(2). As per the provided comments, we have included an overall block diagram to enhance the visual representation of the proposed framework. This figure is numbered as Figure 1 and is positioned on the third page of the article, immediately following Section 3 - Proposed Framework. The contents of Figure 1 are as follows:

 

Figure 1. The framework of the proposed approach and its three phases.

“ The framework is comprised of three main components: feature extraction and training unit acquisition, GhostNet neural classification network, and aggregation classification. Each of these components will be introduced in detail in Chapters 3 and 4, respectively. “ 

As suggested, we have made several modest yet important modifications in different sections of the article to improve readers' understanding of the experimental content through the frame diagram. The key points of these enhancements are summarized below for your review:

“a).3.5.3. Training unit acquisition

A training unit refers to the extraction of several instances from a single piece of data, where each instance serves as a representative of that particular data. For instance, in the case of experimental data obtained from a subject without mental illness, the majority of instances are likely to be non-mental illness. In this paper, a sliding window is employed for reference, enabling the extraction of multiple instances on a larger scale. This technique transforms one sample into multiple instances of the same scale. Regarding the size of the sliding window, we specify the parameters in section 4.3 of the experimental dataset introduction.b).The video processing head is used to shape the unit to fit the GhostNet input.”

The framework is composed of three phases : Phase 1 is responsible for extracting features from the video, then dividing the extracted features into training units, and assigning pscudo labels to each unit; Phase 2 is responsible for training and testing each data unit by using the designed network structure; Phase 3 is responsible for aggregating and classifying the output labels of each data unit of a video, and finally obtaining the final classification result of the current video.

 

Point 2: In the performance comparison section, it is observed that most of the state-of-the-art techniques either perform well in precision or in recall but not in both. Can the authors provide a different analysis or results to understand this better? For example, I am suggesting them to include discrete ROC graphs for better understanding of the comparison. For discrete ROC, the author can follow this paper: https://doi.org/10.1016/j.dsp.2022.103763

 

Response 2: 

In response to your comments, we have included a ROC scatter plot in the article to illustrate the advantages and disadvantages of our model. The figure, labeled as Figure 9, can be found on page 16.

We acknowledge that existing research on depression identification primarily focuses on comparing metrics such as F1 score, recall, accuracy, RMSE, and MAE. Consequently, it is challenging to directly compare our model's ROC performance with other papers in the field. To address this limitation, we conducted comprehensive experiments within our study and based our ROC comparisons on the results obtained from these experiments.

The relevant explanations and references of ROC diagram are as follows:

 

 

 

Figure 9. ROC plots diagram for different model.

“ In order to assess the model optimization, this paper utilizes the ROC scatter plot [51]. As depicted in Figure 9, a total of 5 models were tested on the DAIC-WOZ [12] dataset: (1) the experiment using the native structure of GhostNet with a sliding window of 3600, referred to as GhostNet; (2) the experiment using the native structure of GhostNet combined with MLP at a sliding window of 3600, referred to as GhostNet+MLP; (3) the modified GhostNet plus MLP with sliding windows of 5400, 3600, and 1800, referred to as GhostNet+Win1, GhostNet+Best, and GhostNet+Win2, respectively.

According to the nature of the ROC scatter diagram, a model located in the upper-left quadrant is considered better, and the larger the area of the circle representing the model, the higher its optimization level. Based on this, the model with a sliding window of 3600, namely GhostNet+Best, performs the best and serves as the final experimental result of this paper. Both the GhostNet+Win2 and GhostNet+Win1 models are inferior to the GhostNet+MLP model due to the sliding window being set too large and too small, respectively.

References

[51] Sahoo S P, Modalavalasa S, Ari S. DISNet: A sequential learning framework to handle occlusion in human action recognition with video acquisition sensors[J]. Digital Signal Processing, 2022, 131: 103763. “

 

Dear Reviewer:

Thank you for dedicating your time to reviewing our manuscript. We deeply appreciate your thoughtful consideration and valuable feedback. We apologize for any confusion arising from the inadequacies in explaining the content of the article. Your insightful comments on the shortcomings have been instrumental in driving significant improvements to our work. We have diligently addressed all your concerns through thorough revisions, and the details of these modifications are carefully outlined in the current document. We sincerely hope that you find the revised version more comprehensive and suitable for publication. Your approval is of utmost importance to us, as this outcome holds significant significance in our academic journey. Thank you once again for your invaluable support and guidance.

Best regards.

Author Response File: Author Response.pdf

Reviewer 2 Report

1. any chance to put some photos for section 3.5.1 or 3.5.2 to make it better to understand the given concept?. otherwise too much words make it complicated.

2. "The algorithm proposed by Manoret et al. [36] achieves a recall rate of 0.91, but its accuracy rate is low at 0.64" I think not accuracy but precision

3. what are the methods used in table 3 and 4? are they based on neural networks also or different kinds? a little explaination should be appreciated

4. any chance to compare training times? maybe we can see the complexity of the algorithms with that

5. statistical comparison for the algorithms is possible?

6. there are 2 table 5

7. F in table 6 should F1?

8. what is the difference between 3 different modalities? in your opinions whic one or ones are better for the problem?

9. please further explain table 5. did you use all those different vision effects on the same network and dataset? is it the same methodology you proposed for table 3 and 4?

.

Author Response

Response to Reviewer 2 Comments

Point 1:  any chance to put some photos for section 3.5.1 or 3.5.2 to make it better to understand the given concept?. otherwise too much words make it complicated.

 

Response 1:

To facilitate your understanding of the feature extraction process in Section 3.5.2, we have included two photos of Scarlett—one from the front and another from the side. Suppose both of these photos are part of the data set of a patient with depression, taken at the same time. In the case of photo (b), when extracting a certain angle between the key points of the target face, it becomes evident that the angle of the extracted key points will differ from the corresponding angle of the face in Figure A, primarily due to the tilted head behavior captured in photo (b).

To address this issue arising from head movement, we employ the techniques described in Section 3.4. These content-related approaches are expected to mitigate the interference caused by head movement during the feature extraction process, thereby enhancing the accuracy and reliability of our analysis.

 

 

(a)full face photo               (b) Profile face photo    

Through the picture (a) we can obtain some fixed angles between certain points on the face, which should not change with the change of facial expression. These fixed angles are called hyperparameters. Then find the Angle of the corresponding point in the picture (b), and we can obtain the deflection Angle  through formula (13), that is, how much deflection occurs in the head movement from picture (a) to picture (b). as shown in Table 1, all angles defined on picture (b) are then corrected by the obtained deflection Angle . At this time, in theory, the feature angles we obtain in photo (b) are all the angles when the face is taken from the front.

 

(18)

The formula (18) in Section 3.5.1 assists in locating a correctly oriented face photo, as shown in figure (a).

In addition, the lack of references to formulas (19) and (20) in Section 3.5.2 might have caused confusion. This issue might have resulted from an editing error on our part. We have now rectified the situation, redirecting formula (19) to formula (16) and formula (20) to formula (17). The specific corrections made are outlined below:

Figure 7. The process of feature extraction.

In Figure 7, the "Determine the reference frame" step yields certain correction parameters, which are subsequently utilized for flip correction in the feature extraction process, as depicted in step 1 below. And in Figure 7, The "feature extraction" step must execute the following specific operations:

(1)Flip correction: Use formula (16) and formula (17) to correct the deflection and flip of the angle of this frame, and get  .

(2)Feature calculation: that is, calculate the expression difference between the two frames before and after. then the feature   is calculated by:

 

(19)

n in formula (19) is the total number of frames that can detect expressions.

Following the above feature extraction process, we obtain the original input data needed for the experiment from the video. The subsequent step involves partitioning the original feature data into training units.

Point 2: "The algorithm proposed by Manoret et al. [36] achieves a recall rate of 0.91, but its accuracy rate is low at 0.64" I think not accuracy but precision

 

Response 2: 

We have identified an error in our previous statement. The correct term is "Precision," not "accuracy." We appreciate your valuable assistance in pointing out this mistake. We extend our sincere gratitude for your invaluable help in detecting this error. The error has been rectified in the article.

 

Point 3: what are the methods used in table 3 and 4? are they based on neural networks also or different kinds? a little explaination should be appreciated.

 

Response 3: 

For articles utilizing speech modes as input, two main types of data are commonly employed: MFCC features and raw audio files. For those employing text modal data as input, typically for NLP tasks, Word Embeddings in combination with RNN networks are widely used for accomplishing depression classification tasks. In the case of video modes, original modal features are predominantly utilized as inputs.

In Table 4 and Table 3, the majority of papers leverage neural networks for both feature extraction and classification, with the exception of Yang[38], who employed random forest as the final classifier.

 

Point 4: any chance to compare training times? maybe we can see the complexity of the algorithms with that.

 

Response 4: 

Based on the functions provided by the third-party Python package "thop", we have obtained the number of parameters and computation amount of our GhostNet model (excluding MLP), which are approximately 2.7 million parameters and 0.1 billion FLOPs, respectively.

However, we regret to inform you that we do not currently possess sufficient experimental data for comparison. The reason behind this limitation is that the majority of studies in depression recognition primarily focus on evaluating recognition effects, such as precision, recall rate, F1 score, RMSE, and MAE. Additionally, the training time of the model can be influenced by various factors, including the training environment, input parameters, and equipment conditions. As a result, the training time may not hold as much significance as the recognition performance in the context of depression recognition. Consequently, there is a lack of relevant papers that consider training time as a primary indicator.

Nonetheless, we deeply understand the importance of the issue you raised and fully agree that future research should consider the algorithm's running speed or execution time. To address this concern, we have incorporated a description of the number of model parameters and computation amount on page 12 of the paper.

Thank you for your valuable suggestion, and we strive to improve our research to align with your recommendations in future endeavors.

 

Point 5: statistical comparison for the algorithms is possible?

 

Response 5:

Unfortunately, we acknowledge the absence of concepts and data related to statistical comparison in relevant papers on depression identification, which has led to its exclusion as an element in our experiments. Moreover, our own understanding of statistical comparison is currently limited. Therefore, in the short term, we are unable to incorporate statistical comparison into our research.

Even if we attempt to supplement related concepts in subsequent experiments, the number of papers available for comparison is limited. Replicating other researchers' experiments may present significant challenges, as many papers do not disclose their code, and reproducing their results through a manual study of the original paper may result in immense pressure due to variations in training equipment, code hyperparameters, and model workloads.

Conducting experiments for the statistical comparison of various algorithms would require a separate effort, as it involves considerable workload and is beyond the scope of this paper's current objectives.

In light of these limitations, we have included an ROC graph in the hope of partially addressing the gap in statistical comparison. The ROC graph provides valuable insights into the model's performance, especially in distinguishing between depression and non-depression instances.

The relevant explanations and references of ROC diagram are as follows:

 

 

 

Figure 9. ROC plots diagram for different model.

“ In order to assess the model optimization, this paper utilizes the ROC scatter plot [51]. As depicted in Figure 9, a total of 5 models were tested on the DAIC-WOZ [12] dataset: (1) the experiment using the native structure of GhostNet with a sliding window of 3600, referred to as GhostNet; (2) the experiment using the native structure of GhostNet combined with MLP at a sliding window of 3600, referred to as GhostNet+MLP; (3) the modified GhostNet plus MLP with sliding windows of 5400, 3600, and 1800, referred to as GhostNet+Win1, GhostNet+Best, and GhostNet+Win2, respectively.

According to the nature of the ROC scatter diagram, a model located in the upper-left quadrant is considered better, and the larger the area of the circle representing the model, the higher its optimization level. Based on this, the model with a sliding window of 3600, namely GhostNet+Best, performs the best and serves as the final experimental result of this paper. Both the GhostNet+Win2 and GhostNet+Win1 models are inferior to the GhostNet+MLP model due to the sliding window being set too large and too small, respectively.“

 

 

Point 6: there are 2 table 5?

 

Response 6:

We express our gratitude for your meticulous examination of our article and for bringing the mistakes to our attention. We have promptly addressed the error you pointed out in document P16. However, we must acknowledge our oversight in editing, as the ablation experiment in Table 7 originally contained only two groups of array data, whereas it should include three groups of data. We have made the necessary and detailed modifications to Table 7 to include the missing third group of data.

Table 7. Ablation experiment for optimal sliding window.

Window_size

P

R

F1

180s

0.53

0.75

0.62

120s

0.77

0.83

0.80

60s

0.56

0.75

0.64

 

Point 7: F in table 6 should F1?

 

Response 7:

We appreciate your attention to detail in identifying the issue. Subsequently, we have made the necessary corrections in Table 6, as indicated in the table above.

 

Point 8: what is the difference between 3 different modalities? in your opinions whic one or ones are better for the problem?

 

Response 8:

While the multi-modal approach can enhance the recognition performance to some extent, it also comes with certain challenges, such as handling large amounts of data, facing high data collection difficulty, dealing with numerous model parameters, enduring long training times, incurring substantial training costs, and encountering difficulties in widespread adoption. Furthermore, synchronizing multimodal data poses a significant issue. In contrast, single-mode data is easier to collect, and the models used for single-mode data are lightweight. Nonetheless, existing single-mode models suffer from low accuracy, which makes enhancing their accuracy a prominent research focus. Among the three modalities, video data exhibits richer data features [15] and is less affected by subjective factors, cultural backgrounds, and racial differences. Consequently, it proves more suitable for the task of depression identification. Therefore, we believe that addressing this problem using video modal data may yield favorable results. We have included a comprehensive explanation of the benefits offered by video modal data in the introduction, specifically on page 2. The modifications are as follows:

“After the Audio-Visual Emotion Challenge and Workshop (AVEC) [12], numerous scholars have embarked on studying the utilization of text, audio, and visual modalities for depression prediction. According to related studies, facial expressions, vocal features, and semantic features play distinct roles in conveying emotional information [13-15]. Facial expressions have shown high reliability and consistency, accounting for 55% of the emotional information. Voice characteristics convey 38% of emotional information, but they are susceptible to interference from distractions such as noise, accents, and background music. In comparison, semantic features, which are emotional expressions based on text analysis, convey only 7% of the emotional information. Due to the influence of language habits, spoken language, abbreviations, and other factors, the processing of semantic features is relatively complex. Considering these factors, in the context of depression recognition algorithms, feature extraction and classification recognition based on video modality data prove to be highly effective. This approach holds the potential for further promotion and application in medical equipment.”

 

Point 9: please further explain table 5. did you use all those different vision effects on the same network and dataset? is it the same methodology you proposed for table 3 and 4?

 

Response 9:

The experiments presented in Table 5 were conducted on the same dataset, but not all of them were executed using the same network; rather, different facial feature extraction methods were employed. The outcomes of the experiments substantiate the superiority of the feature extraction method proposed in this paper. It is worth noting that the methods utilized in Table 5 are consistent with those applied in Tables 3 and 4.

In response to your comments, we have further elucidated Table 5 in the article. The modifications made are as follows:

“In the DAIC-WOZ dataset [12], we are provided with several types in terms of visual modal data, including head posture Pose, eye annotation direction Gaze, and 2D face68 key points. Table 5 presents our method and the results of Guo et al. [50] using the time-space expanded convolutional network (TDCN) and feature attention module (FWA) in different video modalities. Among all the visual modalities, The effect of our method is second only to the method that uses mixed visual information of facial key points and head pose at the same time. F1 values are only 0.1 lower than them.

Table 5. Comparison of the effects of the vision itself.

Feature

M

P

R

F1

Gaze

V

0.85

0.50

0.63

Pose

V

0.72

0.66

0.69

2D landmarks

V

0.69

0.75

0.72

2D Landmarks+Pose

V

0.73

0.91

0.81

2D Landmarks+Gaze

V

0.61

0.91

0.73

Ours Angle

V

0.77

0.83

0.80

 

As shown in Table 6, when using Gaze data alone, they obtained an F1 value of 0.63, whereas our method achieved a 0.17 higher F1 value. For the experiment using Pose data alone, their F1 value was 0.69, which is 0.11 lower than ours. In the case of 2D Landmark data, they obtained an F1 value of 0.72, which is 0.08 lower than our F1 value. When utilizing both 2D Landmark and Gaze data, our method outperformed theirs by 0.07 in F1 values. While using 2D Landmark and Pose simultaneously, our method achieved an F1 value only 0.01 lower than theirs, but the difference in recall rate and precision rate for our method is only 0.06, whereas theirs shows a difference of 0.18. These results indicate that although our method is slightly lower in F1 value, it exhibits greater robustness. Improved clarity and conciseness. “

 

 

Dear Reviewer:

We sincerely appreciate the time you dedicated to reviewing our paper and providing your valuable feedback. We apologize for any confusion arising from the inadequate explanation of the article's content. We value your valuable comments on the shortcomings of our article. In response to your comments, we have diligently made significant revisions to the article, and we have provided detailed elaborations of these revisions in the current document. We sincerely hope that you find the revised version suitable for publication. Your approval is of utmost importance to us, as this outcome holds significant significance in our academic journey. Thank you once again for your invaluable support and guidance.

Best regards.

Author Response File: Author Response.pdf

Reviewer 3 Report

1.The detailed description of figure 6 is necessary to understand it properly.

2.The abstract doesnt contain the no. of classes in the dataset. the author must mention it.

3.It is suggested to highlight the contribution in bullet points.

4.Practical usage of proposed model is neither mentioned in the literature nor in the introduction.

5.The literature is not sufficient to cover the said area.It is suggested to refer the following work for more clarity-

https://doi.org/10.3389/fpubh.2022.860396

https://doi.org/10.1145/3241056

https://doi.org/10.1166/jmihi.2017.2187

 

1.The detailed description of figure 6 is necessary to understand it properly.

2.The abstract doesnt contain the no. of classes in the dataset. the author must mention it.

3.It is suggested to highlight the contribution in bullet points.

4.Practical usage of proposed model is neither mentioned in the literature nor in the introduction.

5.The literature is not sufficient to cover the said area.It is suggested to refer the following work for more clarity-

https://doi.org/10.3389/fpubh.2022.860396

https://doi.org/10.1145/3241056

https://doi.org/10.1166/jmihi.2017.2187

 

Author Response

Response to Reviewer 3 Comments

 

Point 1: The detailed description of figure 6 is necessary to understand it properly. 

 

Response 1:

The lack of references to formulas (19) and (20) in Section 3.5.2 might have caused confusion. This issue might have resulted from an editing error on our part. We have now rectified the situation, redirecting formula (19) to formula (16) and formula (20) to formula (17). The specific corrections made are outlined below:

Figure 7. The process of feature extraction.

IIn Figure 7, the "Determine the reference frame" step yields certain correction parameters, which are subsequently utilized for flip correction in the feature extraction process, as depicted in step 1 below. And in Figure 7, The "feature extraction" step must execute the following specific operations:

(1)Flip correction: Use formula (16) and formula (17) to correct the deflection and flip of the angle of this frame, and get  .

(2)Feature calculation: that is, calculate the expression difference between the two frames before and after. then the feature  is calculated by:

 

(19)

n in formula (19) is the total number of frames that can detect expressions.

For better comprehension of the feature extraction process in Section 3.5.2, two photos of Scarlett—one from the front and one from the side—have been provided. Let us assume that both these photos are present in the dataset of a patient with depression. When extracting a certain angle between the key points of the target face from photo (b), it becomes evident that the extracted angle of the key points will differ from the corresponding angle of the face in Figure A, owing to the tilted head behavior in photo (b). This interference resulting from head movement is expected to be minimized through the techniques outlined in Section 3.4.

 

     (a)full face photo       (b) Profile face photo    

Through picture (a), fixed angles between specific points on the face can be determined, which remain unchanged despite facial expressions. These fixed angles are referred to as hyperparameters. By identifying the corresponding point in picture (b) and utilizing formula (13), the deflection angle caused by head movement from picture (a) to picture (b) can be calculated. This deflection angle  reflects the extent of head movement interference. as shown in Table 1, all angles defined on picture (b) are then corrected by the obtained deflection Angle . At this time, in theory, the feature angles we obtain in photo (b) are all the angles when the face is taken from the front.

 

(18)

The formula (18) in Section 3.5.1 assists in locating a correctly oriented face photo, as shown in figure (a).

All that remains is to perform deflection correction by using formula (16) and formula (17) in conjunction with hyperparameters .

 

Point 2: The abstract doesnt contain the no. of classes in the dataset. the author must mention it.

 

Response 2: 

We have addressed the issues highlighted in the summary and implemented the following changes:

In the depression binary classification task of DAIC-WOZ dataset, our proposed framework significantly improves the classification performance, achieving an F1 value of 0.80 for depression detection. Experimental results demonstrate that our method outperforms other existing depression detection models based on a single modality.

 

 

Point 3: It is suggested to highlight the contribution in bullet points.

 

Response 3: 

We extend our gratitude for your valuable advice. As per your suggestion, we have included a dedicated contribution paragraph in the introduction, found on page 2. The newly added contents are as follows:

“ The main contributions of this paper are as follows:

(1)This paper introduces a novel feature extraction method based on facial angle. The method produces facial features that possess translation invariance and rotation invariance. Additionally, it counteracts the interference caused by patients' head and limb movements through flipping correction, resulting in highly robust extracted features.

(2)A new model is developed by combining GhostNet with multi-layer Perceptron (MLP) modules and video process headers. The model is fine-tuned with respect to the features extracted using the proposed method in this paper. This novel model offers a valuable addition to the depression classification task.

(3)Extensive experiments are conducted on the DAIC-WOZ dataset to validate the feasibility of the proposed method. The results obtained outperform those of similar methods, establishing the efficacy of the proposed approach. ”

 

Point 4: Practical usage of proposed model is neither mentioned in the literature nor in the introduction.

 

Response 4:

 We sincerely appreciate your valuable advice. The depression identification model developed by our team can serve as a valuable tool for the preliminary screening of depression patients in public health settings. Conventional depression screening methods heavily rely on questionnaire assessments and expert consultations, which require substantial human and material resources. In contrast, the recognition model proposed in this paper can perform this task automatically. Subsequently, we have included a discussion of the model's practical applications in the introduction section. The specific details are as follows:

After the Audio-Visual Emotion Challenge and Workshop (AVEC) [12], numerous scholars have embarked on studying the utilization of text, audio, and visual modalities for depression prediction. According to related studies, facial expressions, vocal features, and semantic features play distinct roles in conveying emotional information [13-15]. Facial expressions have shown high reliability and consistency, accounting for 55% of the emotional information. Voice characteristics convey 38% of emotional information, but they are susceptible to interference from distractions such as noise, accents, and background music. In comparison, semantic features, which are emotional expressions based on text analysis, convey only 7% of the emotional information. Due to the influence of language habits, spoken language, abbreviations, and other factors, the processing of semantic features is relatively complex. Considering these factors, in the context of depression recognition algorithms, feature extraction and classification recognition based on video modality data prove to be highly effective. This approach holds the potential for further promotion and application in medical equipment.

In this paper, we present a framework based on visual temporal modality for depression detection. We extract features from patients' faces in spatial and temporal dimensions using facial landmarks. To address the issue of recognition capability in single-modal depression detection models, we design a feature extraction method that exhibits strong robustness and low dimensionality. Finally, we utilize the DAIC-WOZ [12] dataset to construct and validate the efficacy of our proposed approach.

The main contributions of this paper are as follows:

(1)In this paper, a feature extraction method based on facial Angle is proposed. The facial features extracted by this method have translation invariance and rotation invariance, and can resist the interference of the patient's head and limb movements by flipping correction, so that the extracted features have good robustness.

(2)A new model is designed based on GhostNet, which combines multi-layer Perceptron (MLP) modules and video process headers, and is fine-tuned in terms of model parameters according to the features extracted in the method in this paper. It provides a new model for depression classification task.

(3)In this paper, sufficient experiments are carried out on DAIC-WOZ dataset to prove the feasibility of the proposed method and obtain the best results among similar methods.

 

 

Point 5: The literature is not sufficient to cover the said area.It is suggested to refer the following work for more clarity- https://doi.org/10.3389/fpubh.2022.860396;

https://doi.org/10.1145/3241056;

https://doi.org/10.1166/jmihi.2017.2187

 

Response 5: 

We express our sincere gratitude for your valuable comments, and in response, we have appropriately cited the literature you provided. The specific details are as follows:

 

[8] Rashid J, Batool S, Kim J, et al. An augmented artificial intelligence approach for chronic diseases prediction[J]. Frontiers in Public Health, 2022, 10: 860396.

[37] Hossain M S, Amin S U, Alsulaiman M, et al. Applying deep learning for epilepsy seizure detection and brain mapping visualization[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2019, 15(1s): 1-17.

[52] Saeed S, Shaikh A, Memon M A, et al. Assessment of brain tumor due to the usage of MATLAB performance[J]. Journal of Medical Imaging and Health Informatics, 2017, 7(6): 1454-1460.

 

Author Response File: Author Response.pdf

Reviewer 4 Report

I have read the full paper, and though the topic is really interesting, I am not satisfied with the paper's representation and the findings. I have added some concerns that must be addressed before resubmitting papers.

1. The fataset is not described well. They need to provide proper information regarding the dataset.

2. The GhostNet architecture is not well presented; provide equations step by step to understand the model architecture.

3. The overall working flow of this paper needs to be presented properly.

4. Model tuning needs to address parameters and requires an ablation study of the model.

5. Add some of the latest references to your paper:

https://www.mdpi.com/2077-1312/11/2/426

https://www.sciencedirect.com/science/article/pii/S2772662223000851

https://www.techscience.com/CMES/v131n3/47389/html

https://link.springer.com/article/10.1007/s11334-022-00523-w

Need to fix writing issues properly.

Author Response

Response to Reviewer 4 Comments

 

Point 1: The fataset is not described well. They need to provide proper information regarding the dataset.

 

Response 1:

In response to your valuable suggestions, we have provided a more detailed description of the dataset in the introduction section of the article. The specific description can be found on page 13 as follows:

The DAIC-WOZ dataset[12] is mainly video, audio, and questionnaire survey data collected to diagnose mental disorders such as depression and anxiety. This dataset is referenced in the Depression Recognition Task Module in the 2014 and 2016 AVEC competitions. The interviews are conducted through situational dialogues between the virtual interviewer, Ellie, and the subjects to elicit corresponding emotional responses and record the characteristics in real-time. According to official standards, the entire dataset exposes 189 sample data and is divided into 107 training sets, 35 validation sets, and 47 test sets. Since no official public label indicates whether the subjects in the test set are depressed or not, this paper employs only 142 sample data from the combined training and test sets to ensure the accuracy and validity of the experimental results. Moreover, to safeguard privacy, the DAIC-WOZ dataset does not provide the original video data but instead offers the 2D coordinate data of facial landmarks extracted by OpenFace [30]. Therefore, the experiment in this paper directly utilizes the provided coordinate data of facial landmarks, eliminating the need for facial landmark extraction steps. The detailed classification of depression or non-depression in the sample data is shown in Table 3.

Table 3. This table presents the sample distribution of the DAIC-WOZ dataset [12].

Datasets

non-depression

depression

Training set

77

30

Test set

23

12

 

Point 2: The GhostNet architecture is not well presented; provide equations step by step to understand the model architecture.

 

Response 2: 

We extend our gratitude for your valuable comments. As equations to gradually present the model structure might occupy considerable space. Therefore, we opted to present the model architecture in a tabular form, which offers clear and concise representation. On page 12 of the article, we have included comprehensive information concerning the parameters of the GhostNet network structure, along with a detailed step-by-step explanation. Selected details from the content are provided below:

“Inspired by the work presented in [31, 36, 34, 35, 37], we made changes to GhostNet based on the following factors: (1) In our video classification task, the initial input channels vary from traditional images. For images, the initial channel number is 3, representing the RGB channels. However, in this paper, the initial channel number corresponds to the total number of each feature angle designed in Section 3.1. (2) While largely preserving the original GhostNet structure, this section selectively modifies the second part of GhostNet's structure, specifically the GhostNet Bottleneck, with the objective of employing GhostNet as a bottleneck graph to extract the network's backbone. (3) The fourth part of GhostNet, which comprises an FC neural classification network, has been removed and replaced with a 4-class MLP [32] structure for classification purposes. The revised network structure is depicted in Figure 8. And The video processing head is used to shape the unit to fit the GhostNet input.

In addition, we have included Table 2, which presents data to facilitate understanding of the modifications made in GhostNet. The current GhostNet network consists of approximately 2.7M parameters, and the flops (floating-point operations) are around 1.0 G. The specific parameters in the network structure depicted in Figure 8 are listed in Table 2.”

Table 2. Parameters of GhostNet network

Input

Opeartor

#exp

#out

SE

Stride

642×20

Conv2d 3×3

-

84

-

2

322×84

G-bneck

120

84

1

1

322×84

G-bneck

240

84

0

2

162×84

G-bneck

200

84

0

1

162×84

G-bneck

184

84

0

1

162×84

G-bneck

184

84

0

1

162×84

G-bneck

480

112

1

1

162×112

G-bneck

672

112

1

1

162×112

G-bneck

672

160

1

2

82×160

G-bneck

960

160

0

1

82×160

G-bneck

960

160

1

1

82×160

G-bneck

960

160

0

1

82×160

G-bneck

960

160

1

1

82×160

Conv2d 1×1

-

960

-

1

12×960

AvgPool 8×8

-

-

-

-

Table 2 presents the following designations: G-bneck refers to the GhostBottleneck module, #exp represents the expansion distance, #out signifies the number of output channels, and SE indicates the usage of the SE module.

 

Point 3: The overall working flow of this paper needs to be presented properly.

 

Response 3: 

In response to your comments, we have included an overall working flow diagram. The diagram is numbered as Figure 1 and is positioned on the third page of the article, following section 3. Proposed Framework. The added Figure is presented as follows:

 

Figure 1. Framework of the proposed approach and its three phases.

“ The framework is comprised of three main components: feature extraction and training unit acquisition, GhostNet neural classification network, and aggregation classification. Each of these components will be introduced in detail in Chapters 3 and 4, respectively. “ 

In addition, we have added some explanatory words in section 3 and Section 4 to facilitate readers to better understand the experimental content of this paper through the frame diagram. Due to the modest modifications in different locations, we have listed some important points below for your review and understanding:

“a).3.5.3. Training unit acquisition

A training unit refers to the extraction of several instances from a single piece of data, where each instance serves as a representative of that particular data. For instance, in the case of experimental data obtained from a subject without mental illness, the majority of instances are likely to be non-mental illness. In this paper, a sliding window is employed for reference, enabling the extraction of multiple instances on a larger scale. This technique transforms one sample into multiple instances of the same scale. Regarding the size of the sliding window, we specify the parameters in section 4.3 of the experimental dataset introduction.

b).The video processing head is used to shape the unit to fit the GhostNet input.”

The framework is composed of three phases : Phase 1 is responsible for extracting features from the video, then dividing the extracted features into training units, and assigning pscudo labels to each unit; Phase 2 is responsible for training and testing each data unit by using the designed network structure; Phase 3 is responsible for aggregating and classifying the output labels of each data unit of a video, and finally obtaining the final classification result of the current video.

 

Point 4: Model tuning needs to address parameters and requires an ablation study of the model..

 

Response 4: 

Due to our oversight during the editing process, the ablation experiment in Table 7 only includes two groups of array data instead of the intended three groups.

The tuning parameters of the model in this paper primarily involve the values of sliding windows and epoch learning rate. Epoch and learning rate are commonly used tuning parameters for neural networks, and their explanations do not require special ablation experiments in this paper. However, we have conducted ablation experiments specifically for the sliding parameters, and the results are presented in Table 7.

Table 7. Ablation experiment for optimal sliding window.

Window_size

P

R

F1

180s

0.53

0.75

0.62

120s

0.77

0.83

0.80

60s

0.56

0.75

0.64

 

 

Point 5: Add some of the latest references to your paper:

https://www.mdpi.com/2077-1312/11/2/426

https://www.sciencedirect.com/science/article/pii/S2772662223000851

https://www.techscience.com/CMES/v131n3/47389/html

https://link.springer.com/article/10.1007/s11334-022-00523-w.

 

Response 5: 

We express our sincere gratitude for your valuable comments. In response, we have appropriately cited the literature you provided. The specific details are as follows:

[34] Mei S, Chen Y, Qin H, et al. A Method Based on Knowledge Distillation for Fish School Stress State Recognition in Intensive Aquaculture[J]. CMES-Computer Modeling in Engineering & Sciences, 2022, 131(3).

[35] Hassan M M, Hassan M M, Yasmin F, et al. A comparative assessment of machine learning algorithms with the Least Absolute Shrinkage and Selection Operator for breast cancer detection and prediction[J]. Decision Analytics Journal, 2023, 7: 100245.

[53] Chen L, Yang Y, Wang Z, et al. Lightweight Underwater Target Detection Algorithm Based on Dynamic Sampling Transformer and Knowledge-Distillation Optimization[J]. Journal of Marine Science and Engineering, 2023, 11(2): 426.

[54] Hassan M M, Zaman S, Mollick S, et al. An efficient Apriori algorithm for frequent pattern in human intoxication data[J]. Innovations in Systems and Software Engineering, 2023, 19(1): 61-69.

 

 

 

Author Response File: Author Response.pdf

Round 2

Reviewer 4 Report

This paper can be accepted but need to add equation of proposed model with step by step. Cite recent works i.e; 

https://www.thelancet.com/journals/lanplh/article/piis2542-5196(23)00025-6/fulltext

https://www.sciencedirect.com/science/article/pii/S1877050923004532

Proof reading needed

Author Response

Response to Reviewer 1 Comments

 

Point 1: This paper can be accepted but need to add equation of proposed model with step by step. Cite recent works i.e;

https://www.thelancet.com/journals/lanplh/article/piis2542-5196(23)00025-6/fulltext

https://www.sciencedirect.com/science/article/pii/S1877050923004532

 

Response 1:

Thank you for your advice; the model's equation material has been included to the article per your recommendation. Because the structure of each layer of the neural network is similarity, we have shown the notion of the equation design of the Ghost Module (GM) module involved in Figure 8 in the structure of the equation. The GhostNet neural network can be thought of as a collection of stacked GM modules. The particular modifications are shown below, and the changes in the original text may be seen on page 12.

The Ghost Module (GM) in Figure 8 consists of three steps, namely conventional convolution, Ghost generation and feature map splicing.

(a).First, by giving the input data , use the convolution operation to obtain the intrinsic feature map . Where  is the number of input channels, and are the height and width of the input data.

 

Where  is the convolution kernel used,  is the size of the convolution kernel, and the bias term is omitted.

(b).In order to obtain the required Ghost feature map, each feature map  of  is used to generate a Ghost feature map with operation , as shown in Figure 9(b).

 

(c).The intrinsic feature map obtained in the first step and the Ghost feature map obtained in the second step are spliced (Identity connection) to obtain the output of the Ghost Module, as shown in Figure 9(b).

 

Figure 9. Ghost module compared with general convolution operations[31]

And, in response to your comments, we have integrated the two references you supplied into our article. The specific material is as follows:

[55] Ava L T, Karim A, Hassan M M, et al. Intelligent Identification of Hate Speeches to address the increased rate of Individual Mental Degeneration[J]. Procedia Computer Science, 2023, 219: 1527-1537.

[56] Nguyen P Y, Astell-Burt T, Rahimi-Ardabili H, et al. Effect of nature prescriptions on cardiometabolic and mental health, and physical activity: a systematic review[J]. The Lancet Planetary Health, 2023, 7(4): e313-e328.

 

 

Figure 8. The network architecture of GhostNet[31].

 

Author Response File: Author Response.docx

Back to TopTop