A Depression Recognition Method Based on the Alteration of Video Temporal Angle Features

: In recent years, signiﬁcant progress has been made in the auxiliary diagnosis system for depression. However, most of the research has focused on combining features from multiple modes to enhance classiﬁcation accuracy. This approach results in increased space-time overhead and feature synchronization problems. To address this issue, this paper presents a single-modal framework for detecting depression based on changes in facial expressions. Firstly, we propose a robust method for extracting angle features from facial landmarks. Theoretical evidence is provided to demonstrate the translation and rotation invariance of these features. Additionally, we introduce a ﬂip correction method to mitigate angle deviations caused by head ﬂips. The proposed method not only preserves the spatial topological relationship of facial landmarks, but also maintains the temporal correlation between frames preceding and following the facial landmarks. Finally, the GhostNet network is employed for depression detection, and the effectiveness of various modal data is compared. In the depression binary classiﬁcation task using the DAIC-WOZ dataset, our proposed framework signiﬁcantly improves the classiﬁcation performance, achieving an F1 value of 0.80 for depression detection. Experimental results demonstrate that our method outperforms other existing depression detection models based on a single modality.


Introduction
Depression is a mental disorder characterized by a persistent low mood, typically accompanied by diminished self-esteem, impaired concentration, excessive guilt or selfdeprecation, a sense of hopelessness regarding the future, sleep disturbances, and feelings of fatigue or reduced energy [1]. Furthermore, depression exacerbates the risk of suicide among affected individuals, imposing a substantial burden and exerting significant societal impacts [2]. Moreover, a more concerning issue is the ongoing increase in the prevalence of this disease, attributed to various factors, including societal pressures [3].
At present, the diagnosis of depression primarily relies on clinical interviews conducted by doctors and utilizes standardized rating scales, such as the Hamilton Depression Rating Scale (HDRS) [4]. Additionally, self-assessment tools like the Self-Rating Depression Scale (SDS) [5], the Patient Health Questionnaire (PHQ) [6], and the Baker Depression Inventory (BDI) [7] are commonly employed. However, these methods can be influenced by external factors, including the clinician's expertise and the patient's tendency to conceal symptoms. Consequently, an accurate automatic diagnosis system [8][9][10] can offer patients effective daily prevention and auxiliary assessment, enabling the early detection of depression and reducing medical costs and societal impact [11].
After the Audio-Visual Emotion Challenge and Workshop (AVEC) [12], numerous scholars have embarked on studying the utilization of text, audio, and visual modalities for depression prediction. According to related studies, facial expressions, vocal features, 2 of 20 and semantic features play distinct roles in conveying emotional information [13][14][15]. Facial expressions have shown high reliability and consistency, accounting for 55% of the emotional information. Voice characteristics convey 38% of emotional information, but they are susceptible to interference from distractions such as noise, accents, and background music. In comparison, semantic features, which are emotional expressions based on text analysis, convey only 7% of the emotional information. Due to the influence of language habits, spoken language, abbreviations, and other factors, the processing of semantic features is relatively complex. Considering these factors, in the context of depression recognition algorithms, feature extraction and classification recognition based on video modality data prove to be highly effective. This approach holds the potential for further promotion and application in medical equipment.
In this paper, we present a framework based on visual temporal modality for depression detection. We extract features from patients' faces in spatial and temporal dimensions using facial landmarks. To address the issue of recognition capability in single-modal depression detection models, we design a feature extraction method that exhibits a strong robustness and low dimensionality. Finally, we utilize the DAIC-WOZ [12] dataset to construct and validate the efficacy of our proposed approach.
The main contributions of this paper are as follows: (1) This paper introduces a novel feature extraction method based on facial angle. The method produces facial features that possess translation invariance and rotation invariance. Additionally, it counteracts the interference caused by patients' head and limb movements through flipping correction, resulting in highly robust extracted features.
(2) A new model is developed by combining GhostNet with multi-layer perceptron (MLP) modules and video process headers. The model is fine-tuned with respect to the features extracted using the proposed method in this paper. This novel model offers a valuable addition to the depression classification task. (3) Extensive experiments are conducted on the DAIC-WOZ dataset to validate the feasibility of the proposed method. The results obtained outperform those of similar methods, establishing the efficacy of the proposed approach.

Related Works
Facial expressions and facial movements have a significant impact on nonverbal communication [16][17][18][19]. They reflect an individual's mental and emotional state and are less influenced by personal subjectivity. Conversely, individuals with depression exhibit distinct facial expressions and movements, typically characterized by frowning, drooping eyes, downturned corners of the mouth, and diminished smiles [20,21]. Analytical modelling based on facial expressions and movements proves to be highly effective in the field of depression recognition.
In the study of facial features, many scholars utilize the Facial Action Coding System (FACS) as the foundation for constructing features. Cohn et al. [22] employed the Active Appearance Model (AAM) to establish 68 facial landmark points, which were subsequently utilized in the classification process. Features such as distance, angle, and area between these landmark points were extracted for analysis. Mclntyre et al. [23] defined facial activity units (AUs) as distinct regions and examined the facial expressions of patients with depression using both original AUs and the defined AU regions. They obtained facial activity patterns based on these analyses. Hamm et al. [24] developed an automated FACS system that calculated the frequency of individual AUs and combined AUs in videos. This system was applied to analyze facial expressions in depressed patients.
Yang et al. [25] utilized a hidden Markov information fusion method to extract the cross-correlation of AU features in a time information sequence, demonstrating its advantages and effectiveness in detecting emotional disorders. Gupta et al. [26] chose structural features, including head movement, eyes opening and closing, the average distance between the upper and lower eyelids, and the distance from the mouth to specific landmarks.

of 20
Nasir et al. [27] examined facial landmarks as a time series, investigating displacement, velocity, and acceleration as motion-related features, and attained the highest performance indicators in the AVEC2016 competition.
Pampouchidou et al. [17] encoded facial landmark features into grayscale images, generating the Landmark Motion History Image (LMHI) of facial landmarks. White pixels represent the most recent motion, while the darkest grey represents the earliest motion. Grayscale image extraction was performed using Local Binary Patterns (LBPs) and Histogram of Oriented Gradients (HOGs) features. Wang et al. [28] proposed Sampling Sliced Long Short-Term Memory Multiple Instance Learning (SS-LSTM-MIL), a multiinstance learning method that utilizes two-dimensional facial key points as input. This approach employs LSTM and pooling techniques for detecting facial depression in specific video segments. Sun et al. [29] presented the Robust Adversarial Adaptation (SRADDA) method, which addresses domain adaptation across varying scales, label spaces, and depression severity assessments in deep evaluation networks.
The previous studies demonstrate that these facial features rely on a human clinical understanding of depression. These features are derived by quantifying geometric properties like distance, angle, and area to analyze the facial changes exhibited by individuals with depression. Dynamic information regarding features is represented through the calculation of speed and acceleration for each motion frame. Nevertheless, this method of feature extraction leads to the loss of both the topological structure information of the face and additional features that go beyond clinical understanding.
This paper presents a framework for depression detection based on visual temporal modality. Facial landmarks are utilized to extract features from patients' faces in both spatial and temporal dimensions. A feature extraction method is then designed to classify depression, addressing the limitation of single-modal depression detection models. The method exhibits strong robustness and low dimensionality.

Proposed Framework
The framework is comprised of three main components: feature extraction and training unit acquisition, GhostNet neural classification network, and aggregation classification. As shown in Figure 1. Each of these components will be introduced in detail in Sections 3 and 4, respectively.

Facial Landmark Detection
Facial landmark detection is a computer vision technique [30] that automatically detects the locations of specific points (e.g., eyes, nose, mouth) in facial images. These specific points are commonly referred to as facial landmarks or face landmarks. This paper utilizes the facial landmark detection method proposed by Tadas et al. [29,30]. The method establishes 68 landmark points on the human face, comprising 17 This article establishes two concepts regarding angles for the ease of subsequent discussions: Definition 1. Target angle: In the target plane, the angle between two straight lines is called "angle" for short. Definition 2. Target image angle: In the image plane, the angle between two straight lines is called "image angle" for short.

Facial Landmark Detection
Facial landmark detection is a computer vision technique [30] that automatically detects the locations of specific points (e.g., eyes, nose, mouth) in facial images. These specific points are commonly referred to as facial landmarks or face landmarks. This paper utilizes the facial landmark detection method proposed by Tadas et al. [29,30]. The method establishes 68 landmark points on the human face, comprising 17 points for face shape, 10 points for eyelashes, 12 points for the nose, and 18 points for the mouth. These 68 landmark points provide a more intuitive outline of the human face model. Figure 2 illustrates the impact of facial landmark detection.

Facial Landmark Detection
Facial landmark detection is a computer vision technique [30] that automatically detects the locations of specific points (e.g., eyes, nose, mouth) in facial images. These specific points are commonly referred to as facial landmarks or face landmarks. This paper utilizes the facial landmark detection method proposed by Tadas et al. [29,30]. The method establishes 68 landmark points on the human face, comprising 17 points for face shape, 10 points for eyelashes, 12 points for the nose, and 18 points for the mouth. These 68 landmark points provide a more intuitive outline of the human face model. Figure 2 illustrates the impact of facial landmark detection.

Angle Selection Method
The fundamental principle of feature selection is to accurately capture the angle of expression change. This paper selects the following angles as the features representing facial expression changes, based on this principle.
The face is analyzed using the key point detection method described in Section 3.1, resulting in the extraction of 68 marker point positions. The 68 landmarks are assigned numbers ranging from 1 to 68. The numbering sequence is provided in Table 1.

Angle Selection Method
The fundamental principle of feature selection is to accurately capture the angle of expression change. This paper selects the following angles as the features representing facial expression changes, based on this principle.
The face is analyzed using the key point detection method described in Section 3.1, resulting in the extraction of 68 marker point positions. The 68 landmarks are assigned numbers ranging from 1 to 68. The numbering sequence is provided in Table 1.
In feature A 18 , Le = mean(37~42) represents the centroid position of the left eye, while Re = mean(43~48) represents the centroid position of the right eye. Here, mean(·) denotes the function for calculating the average value.
Each feature carries a distinct significance and can partially depict the variations in facial expressions. For instance, when individuals smile, their eyes contract into a line, while they widen when experiencing fear. The selected features A 5 to A 8 in this study are capable of reflecting these changes in facial expressions.

Robustness Analysis of Angle Features
Facial expressions primarily manifest through variations in angle information. For instance, a smile is characterized by an upward movement of the corners of the mouth, while intense crying involves a downward movement of the corners of the mouth. In this paper, we extract features A 11 to A 14 to precisely capture these expression changes. Human head movement can be decomposed into six components: horizontal translation (x-axis), vertical translation (y-axis), depth translation (z-axis), yaw rotation (left/right tilt), pitch rotation (head up/down), and roll rotation (left/right head). It is important to investigate whether these six actions have an impact on our angle features and if the features extracted in this study effectively withstand the influence of these actions, thus demonstrating their robustness. The subsequent section presents an analysis and discussion on the robustness of these features.

Rotation Invariance
The target undergoes rotation along the z-axis, as depicted in Figure 3. Patients frequently tilt their heads to the left and right during the recording of the interrogation video. Consequently, the tilting of the patient's head to the left and right can be considered facial rotation. Appl  Proof of Theorem 1. The rotation of the target around the z-axis is equivalent to the rotation of the camera around the optical axis. Figure 3a is the imaging model when the camera rotates around the optical axis, and Figure 3b is the imaging model when the target rotates around the z-axis. By establishing the camera space coordinate system , as

Theorem 1.
Let ∆ABC be an image on the target surface π after camera imaging, and let ∆A B C be its corresponding image. When the target rotates along the z-axis, the angles ∠A B C , ∠B A C , and ∠B C A remain unchanged. This property is referred to as the rotation invariance theorem.
Proof of Theorem 1. The rotation of the target around the z-axis is equivalent to the rotation of the camera around the optical axis. Figure 3a is the imaging model when the camera rotates around the optical axis, and Figure 3b is the imaging model when the target rotates around the z-axis. By establishing the camera space coordinate system OXYZ, as shown in Figure 3a, let A 1 B 1 be the image of target AB in image plane π 1 . When the camera rotates to shoot, although the image A 1 B 1 of the target AB will change relative to the image coordinates, the length |A 1 B 1 | of A 1 B 1 will not change. This is shown in the vignette in Figure 3a. Thus, the following conclusions can be drawn: in the process of target rotation, the length of the image obtained by obliquely shooting the target at any angle will not change.
Let the image of target ABC be A B C ; when the target point ABC rotates around the z-axis, a new target A 1 B 1 C 1 is formed, and correspondingly, there is a new corresponding image A 1 B 1 C 1 on the image plane. According to the theory of target rotation around the z-axis, there is the following Formula (1): That means ∆A B C ∼ = ∆A 1 B 1 C 1 . Therefore, the image angles ∠A B C , ∠B A C , ∠B C A will not change with the change of the target rotation.

Translation Invariance
Translation involves the parallel displacement of a target point along the X, Y, or Z-axes within the OXYZ plane. In the context of a patient's medical examination, it refers to bodily movements such as flexion, extension, lateral shifting, as well as anterior and posterior shifting.

Theorem 2.
Let the image of any ∆ABC in the target surface π after the camera imaging be ∆A B C ; when the target ∆ABC translates arbitrarily along the x-axis, y-axis, or z-axis, the image angles ∠A B C , ∠B A C , ∠B C A do not follow the target changes in position. Hence, this property is referred to as the translation invariance theory.
Proof of Theorem 2. Establish the coordinate system shown in Figure 4, set the image of triangle ABC in image plane π 1 as ∆A B C , move triangle ABC to A 1 B 1 C 1 along the y-axis, and the image of triangle  According to the principle of pinhole imaging, triangle is similar to triangle . The following Formula (2) represents this relationship: According to the principle of pinhole imaging, triangle ABC is similar to triangle A B C . The following Formula (2) represents this relationship: When triangle ABC moves to A 1 B 1 C 1 along the y-axis, triangle ABC will not be deformed due to parallel movement, so there is a congruent relationship between triangle ABC and triangle A 1 B 1 C 1 . There is the following Formula (3): and so, there is the following Formula (4): Then, it can be concluded that triangle A B C and triangle A 1 B 1 C 1 have a similar relationship. There is the following Formula (5): Thus, the image angles ∠A B C , ∠B A C , and ∠B C A do not change as the target moves along the y-axis. Similarly, when the target moves along the x-axis and z-axis, the image angle will not change either.

Flip Correction
Human head movements commonly involve shaking the head laterally (left and right deflection) as well as tilting it upward and downward (referred to as pitch movement). The lateral deflection (roll) and pitch movement have a significant impact on the angular properties, consequently influencing facial expression recognition. Angle correction is essential to address changes in angular properties caused by target flipping. To achieve accurate flipping correction, it is crucial to comprehend the relationship between the flipping angle, target angle, and image angle in the two given scenarios. First, establish a space coordinate system XFY as shown in Figure 5, where the point F is the optical center of the camera. Assuming that AB is parallel to the y-axis, and the image of AB is A B , in the image plane π , establish the uO v plane coordinate system, where u and v represent the horizontal axis and the vertical axis, respectively. Obviously, the A B and v axes are parallel. Suppose the intersection point of the target and the main optical axis of the camera is O, and a plane π parallel to the image plane π 1 is made through point O, connect FA and extend π to intersect A 1 , connect FB and extend π to intersect B 1 . A B intersects u axis at D , connects D F and extends, and intersects AB, A 1 B 1 at D, D 1 , respectively. Connect OD and OD 1 . In plane XFZ, a straight line parallel to the z-axis passing through point D intersects OD 1 at point D 2 , and a straight line drawn through point D along OD 1 intersects the z-axis at D 3 .
the main optical axis of the camera is , and a plane parallel to the image plane is made through point , connect and extend to intersect , connect and extend to intersect . intersects axis at , connects and extends, and intersects , at , , respectively. Connect and . In plane , a straight line parallel to the z-axis passing through point intersects at point , and a straight line drawn through point along intersects the z-axis at . Figure 5. The relationship between the target angle, image angle, and turnover angle during left and right deflections.
Because is parallel to , there is the following Formula (6): Set the distance between point and the optical center of the camera as , that is, | | = . Also, set the distance from point to line segment as , that is, | | = . Suppose the length of line segment is , the length of is , and the length of is , then there is + = . Let be the angle between plane ∆ and plane . Since ⊥ , then there is the following Formula (7): In ∆ , there is the following Formula (10): In the same way, there is the following Formula (12): Because π is parallel to π 1 , there is the following Formula (6): Set the distance between point O and the optical center of the camera as m, that is, |FO| = m. Also, set the distance from point O to line segment AB as d, that is, |OD| = d. Suppose the length of line segment AB is c, the length of AD is c 1 , and the length of BD is c 2 , then there is c 1 + c 2 = c. Let α be the angle between plane ∆OAB and plane π 1 . Since DD 2 ⊥OD 1 , then there is the following Formula (7): ∵: ∆DFD 3 ≈ ∆D 1 FO, ∴: In ∆D 1 OA 1 , there is the following Formula (10): In the same way, there is the following Formula (12): To sum up, the relationship between the angle in the target image (referred to as the image angle) ∠A O B , the target angle ∠AOB, and the deflection angle α can be obtained: It can be obtained from Formula (13) that when the target ∆AOB moves around the point O parallel to the y-axis, its image angle φ has no relevance with the distance m between the target point O and the optical center of the camera. The relationship between φ, θ, and α is shown in Figure 6.
To sum up, the relationship between the angle in the target image (referred to as the image angle) ∠ , the target angle ∠ , and the deflection angle can be ob tained: It can be obtained from Formula (13) that when the target ∆ moves around the point parallel to the y-axis, its image angle has no relevance with the distance between the target point and the optical center of the camera. The relationship between , , and is shown in Figure 6. It can be determined from Figure 6 that when the face is not deflected, that is, = 0 the target angle and image angle will be equal in value. And when the face is de flected, 0. The target angle will not be the same as the image angle , and the image angle will definitely be greater than the target angle . The flip action then dis turbs the angular characteristics. To mitigate this interference, the approach involves cap turing a frontal shot of the face, extracting relevant values at specific angles that remain unaffected by facial expression changes, and subsequently applying corrections to other face frames using Formula (13). Detailed information on obtaining a front face photo can be found in Section 3.5.1.

When the Target Is Flipped Vertically, There Exists a Relationship among the Target Angle, Image Angle, and Flip Angle
The target flips up and down (pitch), which means that the target rotates around the x-axis or a line parallel to the x-axis. When the target flips up and down, the image angle will change.
Due to the symmetry of camera imaging, there is symmetry between left and righ deflections and up and down flips; thus, simply swap the x-axis and y-axis in Figure 5 to obtain the relationship between the target angle, image angle, and flip angle when flipping up and down.
Let the flip angle be , and, according to the symmetry, the relationship between them can be obtained: It can be determined from Figure 6 that when the face is not deflected, that is, α = 0, the target angle θ and image angle φ will be equal in value. And when the face is deflected, α = 0. The target angle θ will not be the same as the image angle φ, and the image angle φ will definitely be greater than the target angle θ. The flip action then disturbs the angular characteristics. To mitigate this interference, the approach involves capturing a frontal shot of the face, extracting relevant values at specific angles that remain unaffected by facial expression changes, and subsequently applying corrections to other face frames using Formula (13). Detailed information on obtaining a front face photo can be found in Section 3.5.1.

When the Target Is Flipped Vertically, There Exists a Relationship among the Target Angle, Image Angle, and Flip Angle
The target flips up and down (pitch), which means that the target rotates around the x-axis or a line parallel to the x-axis. When the target flips up and down, the image angle will change.
Due to the symmetry of camera imaging, there is symmetry between left and right deflections and up and down flips; thus, simply swap the x-axis and y-axis in Figure 5 to obtain the relationship between the target angle, image angle, and flip angle when flipping up and down.
Let the flip angle be β, and, according to the symmetry, the relationship between them can be obtained: In Formula (14), φ and θ have the same meaning as Formula (19).

Deflection/Flip Correction Method
From Formulas (13) and (14), it can be seen that if the target angle can be known in advance, then the deflection/flip angle can be solved. In this paper, A 18 is used for flip correction, and either A 19 or A 20 is used for deflection correction.
For angle A 18 (Ang(Le, 31, Re)), point Le is the center point of the left eye, point 31 is the tip of the nose, and point Re is the center point of the right eye. These three points will not move with the change of expression, which means that the angle information formed by these three points does not become larger or smaller with facial expressions, but does increase or decrease in size accordingly with flipping up and down. Therefore, flip correction can be performed by means of this one feature.
Suppose there is a frame of frontal image I, then let Ang I (Le, 31, Re) = θ, and then θ is calculated by Formula (15): Another frame of image I 1 is set, so Ang I 1 (Le, 31, Re) = φ can be calculated by Formula (15), Similarly, the flip angle β of the image I 1 relative to the reference frame I can be calculated by Formula (14).
When the flip angle β is obtained, the error caused by the flip of some features can be corrected. It is advisable to set the angle to be corrected as A UD i , then A UD i is calculated by Formula (16): A i in Formula (16) is the image angle, and the image angle can be obtained by Formula (15).
In the same way, angle A 19 or A 20 can be used to correct the left and right deflection of some angles that need deflection: Figure 1 illustrates the initial stage of the framework, which involves feature extraction and training unit acquisition. Feature extraction in the first stage is divided into two steps: (1) the selection of a reference frame and (2) the extraction of angle features using difference calculation. The acquisition of training units, i.e., multiple instances of a video, is accomplished using the active Windows mechanism. Subsequently, the feature extraction process is described.

Determine the Reference Frame
The reference frame image must have the patient's face plane parallel to the image plane, that is, the frontal photo of the patient's face. Although it may not be possible to find a frame with the face plane parallel to the image plane during the patient's consultation process, it is necessary to identify the frame in the diagnosis video that is closest to the parallel image plane. In this scenario, we simply identify the frame image closest to the image plane and use it as the reference frame image.
From formulas (13) and (14), it can be known that when the head does not deflect and flip, that is, α = β = 0, the face plane and the image plane are parallel, and the image angle φ is exactly equal to the target angle θ. When there is a deflect and flip situation, the image angles A 18 , A 19 , and A 20 will decrease as well. Therefore, only one frame of image needs to be found so that Formula (18) holds: When a reference frame satisfying Formula (18) is found, use Formula (18) to calculate all A 1 ∼ A 20 in Table 1, denote it as A re f ,1 ∼ A re f ,20 , and save it. Among them, the three angles of A 18 , A 19 , and A 20 will be used as hyperparameters in the flip correction of the feature extraction process (Figure 7); n is the number of video frames. flip, that is, = = 0, the face plane and the image plane are parallel, and the image angle is exactly equal to the target angle . When there is a deflect and flip situation, the image angles A , A , and A will decrease as well. Therefore, only one frame of image needs to be found so that Formula (18) When a reference frame satisfying Formula (18) is found, use Formula (18) to calculate all ~ in Table 1, denote it as , ~ , , and save it. Among them, the three angles of A , A , and A will be used as hyperparameters in the flip correction of the feature extraction process (Figure 7); n is the number of video frames. End Yes Figure 7. The process of feature extraction.

Difference Extraction of Angle Features
In order to accurately describe changes in expression and mitigate misjudgment caused by deflection and flip actions, the angle feature extraction scheme aims to minimize their influence. To achieve this objective, Figure 7 illustrates the designed process for angle feature extraction in this paper.
In Figure 7, the "Determine the reference frame" step yields certain correction parameters, which are subsequently utilized for flip correction in the feature extraction process, as depicted in step 1 below. And in Figure 7, The "feature extraction" step must execute the following specific operations: (1) Flip correction: Use Formulas (16) and (17)  (2) Feature calculation: Calculate the expression difference between the two frames before and after; then, the feature , is calculated by: n in Formula (19) is the total number of frames that can detect expressions. Following the above feature extraction process, we obtain the original input data needed for the experiment from the video. The subsequent step involves partitioning the original feature data into training units.

Training Unit Acquisition
A training unit refers to the extraction of several instances from a single piece of datum, where each instance serves as a representative of that particular datum. For instance, in the case of experimental data obtained from a subject without mental illness, the majority of instances are likely to be non-mental illness in nature. In this paper, a sliding window is employed for reference, enabling the extraction of multiple instances on a larger scale. This technique transforms one sample into multiple instances of the same

Difference Extraction of Angle Features
In order to accurately describe changes in expression and mitigate misjudgment caused by deflection and flip actions, the angle feature extraction scheme aims to minimize their influence. To achieve this objective, Figure 7 illustrates the designed process for angle feature extraction in this paper.
In Figure 7, the "Determine the reference frame" step yields certain correction parameters, which are subsequently utilized for flip correction in the feature extraction process, as depicted in step 1 below. And in Figure 7, The "feature extraction" step must execute the following specific operations: (1) Flip correction: Use Formulas (16) and (17) to correct the deflection and flip of the angle of this frame, and obtain A i = 0.5 × A UD i + A LR i , i = 1, 2, · · · , 20. (2) Feature calculation: Calculate the expression difference between the two frames before and after; then, the feature F j,i is calculated by: · · · , 20, j = 1, 2, · · · , n − 1 (19) n in Formula (19) is the total number of frames that can detect expressions.
Following the above feature extraction process, we obtain the original input data needed for the experiment from the video. The subsequent step involves partitioning the original feature data into training units.

Training Unit Acquisition
A training unit refers to the extraction of several instances from a single piece of datum, where each instance serves as a representative of that particular datum. For instance, in the case of experimental data obtained from a subject without mental illness, the majority of instances are likely to be non-mental illness in nature. In this paper, a sliding window is employed for reference, enabling the extraction of multiple instances on a larger scale. This technique transforms one sample into multiple instances of the same scale. Regarding the size of the sliding window, we specify the parameters in Section 4.3. of the experimental dataset introduction.

GhostNet
GhostNet [31] is a lightweight neural network with several key features. Firstly, it demonstrates a high efficiency when compared to other neural network structures like ResNet [32] and MobileNetV2 [33]. GhostNet achieves this through faster processing, fewer parameters, and reduced calculations. Secondly, GhostNet utilizes group convolution technology, dividing the convolution kernel into smaller groups. This approach reduces parameter size and calculation requirements for each convolution layer, resulting in an improved operational efficiency. Thirdly, GhostNet incorporates a channel attention mechanism. This involves adding a channel attention module after each convolutional layer to dynamically adjust the weight of each channel. This adjustment enhances the network's learning capability. Lastly, GhostNet introduces a feature reorganization mechanism. This involves rearranging the feature maps generated by different convolutional layers to further enhance training efficiency and network accuracy.
The Ghost Module (GM) [31] in Figure 8 consists of three steps, namely conventional convolution, Ghost generation, and feature map splicing. scale. Regarding the size of the sliding window, we specify the parameters in Section 4.3. of the experimental dataset introduction.

GhostNet
GhostNet [31] is a lightweight neural network with several key features. Firstly, it demonstrates a high efficiency when compared to other neural network structures like ResNet [32] and MobileNetV2 [33]. GhostNet achieves this through faster processing, fewer parameters, and reduced calculations. Secondly, GhostNet utilizes group convolution technology, dividing the convolution kernel into smaller groups. This approach reduces parameter size and calculation requirements for each convolution layer, resulting in an improved operational efficiency. Thirdly, GhostNet incorporates a channel attention mechanism. This involves adding a channel attention module after each convolutional layer to dynamically adjust the weight of each channel. This adjustment enhances the network's learning capability. Lastly, GhostNet introduces a feature reorganization mechanism. This involves rearranging the feature maps generated by different convolutional layers to further enhance training efficiency and network accuracy.
The Ghost Module (GM) [31] in Figure 8 consists of three steps, namely conventional convolution, Ghost generation, and feature map splicing. (a) First, by providing the input data ∈ ℝ × × , use the convolution operation to obtain the intrinsic feature map ∈ ℝ × × . Where is the number of input channels, ℎ and are the height and width of the input data.
= * where ′ ∈ ℝ × × × is the convolution kernel, × is the size of the convolution kernel, and the bias term is omitted.
(b) In order to obtain the required Ghost feature map, each feature map of is used to generate a Ghost feature map with operation , . (a) First, by providing the input data X ∈ R c×h×w , use the convolution operation to obtain the intrinsic feature map Y ∈ R h ×w ×m . Where c is the number of input channels, h and w are the height and width of the input data.
where f ∈ R c×k×k×m is the convolution kernel, k × k is the size of the convolution kernel, and the bias term is omitted. (b) In order to obtain the required Ghost feature map, each feature map y i of Y is used to generate a Ghost feature map with operation Φ i,j .
y ij = Φ i,j y i , ∀i = 1, ..., m, j = 1, ..., s, (c) The intrinsic feature map obtained in the first step and the Ghost feature map obtained in the second step are spliced (identity connection) to obtain the output of the Ghost module.
Inspired by the work presented in [31,[34][35][36][37], we made changes to GhostNet based on the following factors: (1) In our video classification task, the initial input channels vary from traditional images. For images, the initial channel number is 3, representing the RGB channels. However, in this paper, the initial channel number corresponds to the total number of each feature angle designed in Section 3.1. (2) While largely preserving the original GhostNet structure, this section selectively modifies the second part of GhostNet's structure, specifically the GhostNet Bottleneck, with the objective of employing GhostNet as a bottleneck graph to extract the network's backbone. (3) The fourth part of GhostNet, which comprises an FC neural classification network, has been removed and replaced with a 4-class MLP [32] structure for classification purposes. The revised network structure is depicted in Figure 8, and the video processing head is used to shape the unit to fit the GhostNet input.
In addition, we have included Table 2, which presents data to facilitate the understanding of the modifications made in GhostNet. The current GhostNet network consists of approximately 2.7 M parameters, and the flops (floating-point operations) are around 1.0 G. The specific parameters in the network structure depicted in Figure 8 are listed in Table 2. Table 2. Parameters of GhostNet network.

Input
Operator #Exp #Out SE Stride Table 2 presents the following designations: G-bneck refers to the GhostBottleneck module, #exp represents the expansion distance, #out signifies the number of output channels, and SE indicates the usage of the SE module.

Aggregate Classification
In the PHQ-8 scale [38], the scoring is as follows: 0-4 points indicate no depression, 5-9 points suggest mild depression, 10-14 points indicate moderate depression, and a score of 15 or higher indicates moderate to severe depression.
The four types of pseudo-labels (1, 2, 3, and 4) correspond to different PHQ-8 scores, denoted as PHQ s . The general depression consultation data will only provide tags corresponding to the entire duration of the long video. To transform the depression classification task of long videos into the score classification task of segmented short time series units, a new label must be created for each unit.
Based on depression-related research findings, the short-sequence units corresponding to pseudo-label 4 have been found to be more significant than those corresponding to pseudo-labels 1, 2, and 3 in assessing depression. However, the frequency of facial changes corresponding to label 4 is lower throughout the entire video. Thus, we developed a score aggregation function for adjusting the weights of various pseudo-labels during the aggregation process.
Assuming that video V has l short time series units, the model output label of the ith (0 < i ≤ l) short time series unit is denoted as v n . All short sequential units can be expressed as V = {v 1 , v 2 , v 3 , ..., v l }. Then, the expression of the score aggregation function is as follows:ŝ For the weight parameters ρ 1 , ρ 2 , ρ 3 , and ρ 4 , we make selections via the differential evolution algorithm [39]. Finally, we determine whether the corresponding patient suffers from depression by judging whether the aggregated scoreŝ is greater than the set threshold.

Experimental Dataset
The DAIC-WOZ dataset [12] is mainly video, audio, and questionnaire survey data collected to diagnose mental disorders such as depression and anxiety. This dataset is referenced in the Depression Recognition Task Module in the 2014 and 2016 AVEC competitions. The interviews are conducted through situational dialogues between the virtual interviewer, Ellie, and the subjects to elicit corresponding emotional responses and record the characteristics in real-time. According to official standards, the entire dataset exposes 189 sample data and is divided into 107 training sets, 35 validation sets, and 47 test sets. Since no official public label indicates whether the subjects in the test set are depressed or not, this paper employs only 142 sample data from the combined training and test sets to ensure the accuracy and validity of the experimental results. Moreover, to safeguard privacy, the DAIC-WOZ dataset does not provide the original video data but instead offers the 2D coordinate data of facial landmarks extracted by OpenFace [30]. Therefore, the experiment in this paper directly utilizes the provided coordinate data of facial landmarks, eliminating the need for facial landmark extraction steps. The detailed classification of depression or non-depression in the sample data is shown in Table 3. Table 3. This table presents the sample distribution of the DAIC-WOZ dataset [12].

Non-Depression Depression
Training set 77 30 Test set 23 12 In this paper, the DAIC-WOZ dataset was utilized, and a sliding window approach was implemented to sample the original data. The window size was set to 120 s, with an overlap size defined as 1/6 of the window size. The original data consisted of video samples at a rate of 30 Hz. During the data cleaning process, a filter was applied to exclude video frames with a face confidence lower than 0.7 from the original video. This filtering process aimed to mitigate the influence of significant face deflection or flipping on the experimental results.

Evaluation
In this paper, the GhostNet network is chosen as the feature extraction network and modified to suit the input. An MLP [36] is employed for four-class classification. The original video data consist of 20 feature angles across 3600 frames, which are reshaped into a 60 × 60 × 20 input format. The learning rate plays a crucial role in optimizing neural networks. For this experiment, the Adam optimizer [40] is utilized. The initial learning rate is set to 0.001 and reduced to 0.1 times the original rate at the 200th and 400th epochs, respectively. Soft label is used during training to calculate the loss, mitigating the impact of hard labels on network training. Soft label is calculated as follows: In this paper, precision rate (P), recall rate (R), and F1 value (F-score) are used as the evaluation indicators of sentiment polarity classification. The calculation formula is as follows:

Experimental Results Comparison
The methods employed in this paper involve extracting one-dimensional or multidimensional modal features from the patient's face, voice, and text data. These extracted features are then used for training and verification on various neural networks. Table 4 provides a comprehensive comparison between our developed unimodal depression algorithm based on facial features and previous studies utilizing different modalities. The designed features demonstrate commendable performance in terms of precision rate, recall rate, and F1 score, achieving values of 0.77, 0.83, and 0.80, respectively. Manoret et al. [41] obtained the result of a mixed model of multiple neural networks, including CNN and LSTM, with a recall rate of 0.91. However, the precision rate of this algorithm is only 0.64, resulting in a final F1 of 0.75, which is 0.05 lower than the one proposed algorithm in this paper. Additionally, the F1 scores of Othmani [14] and Rejaibi et al. [42] in the audio single-mode design algorithm are 0.30 and 0.34 lower, respectively, compared to the performance achieved by the features designed in this paper.
Text is considered unimodal due to its characteristic of containing precise information. The F1 score of the text single-modal feature proposed by Sun et al. [13] is 0.23 lower compared to this paper. Dinkel et al. [39] and other researchers effectively utilize text features by employing the BGRU neural network for processing text information after performing word embedding (Word2Vec) on the text data, resulting in an F1 value that is 0.04 higher compared to ours. Additionally, when comparing with the multimodal detection algorithms proposed by Arioz [44] and Lam et al. [45], it is evident that the algorithm presented in this paper outperforms them in terms of precision, recall, and F1 scores when utilizing mixed modes.
To objectively evaluate the algorithm proposed in this paper, we compare the video modality used in this paper with other algorithms, including video modality. The comparison results are presented in Table 5. The results presented in Table 5 demonstrate that the features proposed in this study yield remarkable improvements in algorithm performance. Wei et al. [47] achieved a precision rate of 0.89 in their three-modal (visual + audio + text) fusion, surpassing the pure visual mode in this study by 0.12. However, the recall rate was adversely affected, leading to an F1 value 0.1 lower compared to this paper. Haque et al. [49], utilizing the Causal Convolutional Network (C-CNN), achieved a similar recall rate to this study in multimodal analysis. However, their F1 value was 0.03 lower compared to this paper. Conversely, both Wang [28] and Song [46] utilized visual modality data in their studies. In general, the algorithm proposed in this study excels at extracting features from the pure visual modality, leading to superior outcomes.
In the DAIC-WOZ dataset [12], we are provided with several types in terms of visual modal data, including head Pose, eye annotation direction Gaze, and 2D face68 key points. Table 6 presents our method and the results of Guo et al. [50] using the time-space expanded convolutional network (TDCN) and feature attention module (FWA) in different video modalities. Among all the visual modalities. The effectiveness of our method is second only to the approach that utilizes the mixed visual information of facial key points and head pose simultaneously. The F1 values are only 0.1 lower than those achieved by that approach. As shown in Table 6, when using Gaze data alone, they obtained an F1 value of 0.63, whereas our method achieved a 0.17 higher F1 value. For the experiment using Pose data alone, their F1 value was 0.69, which is 0.11 lower than ours. In the case of 2D Landmark data, they obtained an F1 value of 0.72, which is 0.08 lower than our F1 value. When utilizing both 2D Landmark and Gaze data, our method outperformed theirs by 0.07 in F1 values. While using 2D Landmark and Pose simultaneously, our method achieved an F1 value only 0.01 lower than theirs, but the difference in recall rate and precision rate for our method is only 0.06, whereas theirs shows a difference of 0.18. These results indicate that although our method is slightly lower in F1 value, it exhibits a greater robustness and improved clarity and conciseness.
Our comparative analysis reveals that the features we designed exhibit a superior expressive capability. Additionally, this verification highlights the enhanced robustness of our specifically designed angular features compared to general visual unimodal data.
To determine the optimal window size for capturing angle features effectively, we performed three sets of ablation experiments, as presented in Table 7. Three indicators are considered for determining the window size. The optimal result is achieved when using a sliding window of 120 s. Therefore, we selected a sliding window size of 120 s for the experiment. In order to assess the model optimization [51][52][53], this paper utilizes the ROC scatter plot [54]. As depicted in Figure 9, a total of five models were tested on the DAIC-WOZ [12] dataset: (1) the experiment using the native structure of GhostNet with a sliding window of 3600, referred to as GhostNet; (2) the experiment using the native structure of GhostNet combined with MLP at a sliding window of 3600, referred to as GhostNet+MLP; and (3) the modified GhostNet plus MLP with sliding windows of 5400, 3600, and 1800, referred to as GhostNet + Win1, GhostNet+Best, and GhostNet + Win2, respectively. expressive capability. Additionally, this verification highlights the enhanced robustness of our specifically designed angular features compared to general visual unimodal data.
To determine the optimal window size for capturing angle features effectively, we per formed three sets of ablation experiments, as presented in Table 7. Three indicators are con sidered for determining the window size. The optimal result is achieved when using a sliding window of 120 s. Therefore, we selected a sliding window size of 120 s for the experiment. In order to assess the model optimization [51][52][53], this paper utilizes the ROC scatter plot [54]. As depicted in Figure 9, a total of five models were tested on the DAIC-WOZ [12] dataset: (1) the experiment using the native structure of GhostNet with a sliding win dow of 3600, referred to as GhostNet; (2) the experiment using the native structure o GhostNet combined with MLP at a sliding window of 3600, referred to as GhostNet+MLP and (3) the modified GhostNet plus MLP with sliding windows of 5400, 3600, and 1800 referred to as GhostNet+Win1, GhostNet+Best, and GhostNet+Win2, respectively.
According to the nature of the ROC scatter diagram, a model located in the upper left quadrant is considered better, and the larger the area of the circle representing the model, the higher its optimization level. Based on this, the model with a sliding window of 3600, namely GhostNet+Best, performs the best and serves as the final experimenta result of this paper. Both the GhostNet+Win2 and GhostNet+Win1 models are inferior to the GhostNet+MLP model due to the sliding window being set too large and too small respectively. According to the nature of the ROC scatter diagram, a model located in the upperleft quadrant is considered better, and the larger the area of the circle representing the model, the higher its optimization level. Based on this, the model with a sliding window of 3600, namely GhostNet+Best, performs the best and serves as the final experimental result of this paper. Both the GhostNet + Win2 and GhostNet + Win1 models are inferior to the GhostNet+MLP model due to the sliding window being set too large and too small, respectively.

Conclusions
This paper introduces facial angle features as a novel replacement for the original visual modality data. The theoretical proof is provided to demonstrate that these angle features exhibit translation invariance in the front-back, left-right, and up-down directions, as well as rotation invariance. This illustrates the effectiveness of the proposed method in countering translation and rotation attacks on the head. To address head-up and headdown flips, as well as left and right deflection attacks, this paper presents a corresponding countermeasure scheme that effectively mitigates these attacks. This further demonstrates the robustness of the proposed features.
For classifier selection, this paper presents a GhostNet deep neural network classifier. The use of hard labels (0, 1) for loss training calculations on the network is abandoned in favor of soft-labels. Three sets of comparative experiments are conducted to evaluate the proposed method's detection performance against other single modes. The experimental results demonstrate the superiority of the proposed method. Furthermore, compared to other multimodal methods, the proposed method outperforms them in certain detection indicators, highlighting its superior feature extraction performance. Additionally, three sets of ablation experiments are performed to investigate the impact of sliding window size. The results verify that a window size of 120 s yields the best effect, achieving an F1-score of 0.80, which surpasses most other single-mode models or mixed modes.
In our future work, we plan to accomplish the following three tasks: (1) construct a larger dataset for depression; (2) investigate more robust feature angles to enhance the performance of our handcrafted features; and (3) explore the fusion of angle features from multiple modalities, including textual speech, with our visual modal data.