MDPI - Publisher of Open Access Journals

23 pages, 4228 KiB

Open AccessArticle

Evaluation on AI-Generative Emotional Design Approach for Urban Vitality Spaces: A LoRA-Driven Framework and Empirical Research

by Ruoshi Zhang, Xiaoqing Tang, Lifang Wu, Yuchen Wang, Xiaojing He and Mengjie Liu

Land 2025, 14(6), 1300; https://doi.org/10.3390/land14061300 - 18 Jun 2025

Viewed by 656

Abstract

Recent advancements in urban vitality space design reflect increasing academic attention to emotional experience dimensions, paralleled by the emergence of AI-based generative technology as a transformative tool for systematically exploring the emotional attachment potential in preliminary designs. To effectively utilize AI-generative design results [...] Read more.

Recent advancements in urban vitality space design reflect increasing academic attention to emotional experience dimensions, paralleled by the emergence of AI-based generative technology as a transformative tool for systematically exploring the emotional attachment potential in preliminary designs. To effectively utilize AI-generative design results for spatial vitality creation and evaluation, exploring whether generated spaces respond to people’s emotional demands is necessary. This study establishes a comparative framework analyzing emotional attachment characteristics between LoRA-generated spatial designs and the real urban vitality space, using the representative case of THE BOX in Chaoyang, Beijing. Empirical data were collected through structured on-site surveys with 115 validated participants, enabling a comprehensive emotional attachment evaluation. SPSS 26.0 was employed for multi-dimensional analyses, encompassing aggregate attachment intensity, dimensional differentiation, and correlation mapping. Key findings reveal that while both generative and original spatial representations elicit measurable positive responses, AI-generated designs demonstrate a limited capacity to replicate the authentic three-dimensional experiential qualities inherent to physical environments, particularly regarding structural articulation and material tactility. Furthermore, significant deficiencies persist in the generative design’s cultural semiotic expression and visual-interactive spatial legibility, resulting in diminished user satisfaction. The analysis reveals that LoRA-generated spatial solutions require strategic enhancements in dynamic visual hierarchy, interactive integration, chromatic optimization, and material fidelity to bridge this experiential gap. These insights suggest viable pathways for integrating generative AI methodologies with conventional urban design practices, potentially enabling more sophisticated hybrid approaches that synergize digital innovation with built environment realities to cultivate enriched multisensory spatial experiences. Full article

(This article belongs to the Section Land Planning and Landscape Architecture)

► Show Figures

Figure 1

20 pages, 1849 KiB

Open AccessArticle

Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation

by John Lorenzo Bautista and Hyun Soon Shin

Appl. Sci. 2025, 15(2), 623; https://doi.org/10.3390/app15020623 - 10 Jan 2025

Cited by 1 | Viewed by 1702

Abstract

This paper introduces a novel joint model architecture for Speech Emotion Recognition (SER) that integrates both discrete and dimensional emotional representations, allowing for the simultaneous training of classification and regression tasks to improve the comprehensiveness and interpretability of emotion recognition. By employing a [...] Read more.

This paper introduces a novel joint model architecture for Speech Emotion Recognition (SER) that integrates both discrete and dimensional emotional representations, allowing for the simultaneous training of classification and regression tasks to improve the comprehensiveness and interpretability of emotion recognition. By employing a joint loss function that combines categorical and regression losses, the model ensures balanced optimization across tasks, with experiments exploring various weighting schemes using a tunable parameter to adjust task importance. Two adaptive weight balancing schemes, Dynamic Weighting and Joint Weighting, further enhance performance by dynamically adjusting task weights based on optimization progress and ensuring balanced emotion representation during backpropagation. The architecture employs parallel feature extraction through independent encoders, designed to capture unique features from multiple modalities, including Mel-frequency Cepstral Coefficients (MFCC), Short-term Features (STF), Mel-spectrograms, and raw audio signals. Additionally, pre-trained models such as Wav2Vec 2.0 and HuBERT are integrated to leverage their robust latent features. The inclusion of self-attention and co-attention mechanisms allows the model to capture relationships between input modalities and interdependencies among features, further improving its interpretability and integration capabilities. Experiments conducted on the IEMOCAP dataset using a leave-one-subject-out approach demonstrate the model’s effectiveness, with results showing a 1–2% accuracy improvement over classification-only models. The optimal configuration, incorporating the joint architecture, dynamic weighting, and parallel processing of multimodal features, achieves a weighted accuracy of 72.66%, an unweighted accuracy of 73.22%, and a mean Concordance Correlation Coefficient (CCC) of 0.3717. These results validate the effectiveness of the proposed joint model architecture and adaptive balancing weight schemes in improving SER performance. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

28 pages, 8081 KiB

Open AccessArticle

PortraitEmotion3D: A Novel Dataset and 3D Emotion Estimation Method for Artistic Portraiture Analysis

by Shao Liu, Sos Agaian and Artyom Grigoryan

Appl. Sci. 2024, 14(23), 11235; https://doi.org/10.3390/app142311235 - 2 Dec 2024

Cited by 1 | Viewed by 1663

Abstract

Facial Expression Recognition (FER) has been widely explored in realistic settings; however, its application to artistic portraiture presents unique challenges due to the stylistic interpretations of artists and the complex interplay of emotions conveyed by both the artist and the subject. This study [...] Read more.

Facial Expression Recognition (FER) has been widely explored in realistic settings; however, its application to artistic portraiture presents unique challenges due to the stylistic interpretations of artists and the complex interplay of emotions conveyed by both the artist and the subject. This study addresses these challenges through three key contributions. First, we introduce the PortraitEmotion3D (PE3D) dataset, designed explicitly for FER tasks in artistic portraits. This dataset provides a robust foundation for advancing emotion recognition in visual art. Second, we propose an innovative 3D emotion estimation method that leverages three-dimensional labeling to capture the nuanced emotional spectrum depicted in artistic works. This approach surpasses traditional two-dimensional methods by enabling a more comprehensive understanding of the subtle and layered emotions often in artistic representations. Third, we enhance the feature learning phase by integrating a self-attention module, significantly improving facial feature representation and emotion recognition accuracy in artistic portraits. This advancement addresses this domain’s stylistic variations and complexity, setting a new benchmark for FER in artistic works. Evaluation of the PE3D dataset demonstrates our method’s high accuracy and robustness compared to existing state-of-the-art FER techniques. The integration of our module yields an average accuracy improvement of over 1% in recent FER systems. Additionally, combining our method with ESR-9 achieves a comparable accuracy of 88.3% on the FER+ dataset, demonstrating its generalizability to other FER benchmarks. This research deepens our understanding of emotional expression in art and facilitates potential applications in diverse fields, including human–computer interaction, security, healthcare diagnostics, and the entertainment industry. Full article

(This article belongs to the Special Issue Advanced Digital Signal Processing and Its Applications)

► Show Figures

Figure 1

15 pages, 12596 KiB

Open AccessArticle

ARMNet: A Network for Image Dimensional Emotion Prediction Based on Affective Region Extraction and Multi-Channel Fusion

by Jingjing Zhang, Jiaying Sun, Chunxiao Wang, Zui Tao and Fuxiao Zhang

Sensors 2024, 24(21), 7099; https://doi.org/10.3390/s24217099 - 4 Nov 2024

Viewed by 1321

Abstract

Compared with discrete emotion space, image emotion analysis based on dimensional emotion space can more accurately represent fine-grained emotion. Meanwhile, this high-precision representation of emotion requires dimensional emotion prediction methods to sense and capture emotional information in images as accurately and richly as [...] Read more.

Compared with discrete emotion space, image emotion analysis based on dimensional emotion space can more accurately represent fine-grained emotion. Meanwhile, this high-precision representation of emotion requires dimensional emotion prediction methods to sense and capture emotional information in images as accurately and richly as possible. However, the existing methods mainly focus on emotion recognition by extracting the emotional regions where salient objects are located while ignoring the joint influence of objects and background on emotion. Furthermore, in the existing literature, when fusing multi-level features, no consideration has been given to the varying contributions of features from different levels to emotional analysis, which makes it difficult to distinguish valuable and useless features and cannot improve the utilization of effective features. This paper proposes an image emotion prediction network named ARMNet. In ARMNet, a unified affective region extraction method that integrates eye fixation detection and attention detection is proposed to enhance the combined influence of objects and backgrounds. Additionally, the multi-level features are fused with the consideration of their different contributions through an improved channel attention mechanism. In comparison to the existing methods, experiments conducted on the CGnA10766 dataset demonstrate that the performance of valence and arousal, as measured by Mean Squared Error (MSE), Mean Absolute Error (MAE), and Coefficient of Determination (R²), has improved by 4.74%, 3.53%, 3.62%, 1.93%, 6.29%, and 7.23%, respectively. Furthermore, the interpretability of the network is enhanced through the visualization of attention weights corresponding to emotional regions within the images. Full article

(This article belongs to the Special Issue Recent Advances in Smart Mobile Sensing Technology)

► Show Figures

Figure 1

22 pages, 1714 KiB

Open AccessArticle

Cardiometabolic Morbidity (Obesity and Hypertension) in PTSD: A Preliminary Investigation of the Validity of Two Structures of the Impact of Event Scale-Revised

by Amira Mohammed Ali, Saeed A. Al-Dossary, Carlos Laranjeira, Maha Atout, Haitham Khatatbeh, Abeer Selim, Abdulmajeed A. Alkhamees, Musheer A. Aljaberi, Annamária Pakai and Tariq Al-Dwaikat

J. Clin. Med. 2024, 13(20), 6045; https://doi.org/10.3390/jcm13206045 - 10 Oct 2024

Cited by 4 | Viewed by 2072

Abstract

Background: Posttraumatic stress disorder (PTSD) and/or specific PTSD symptoms may evoke maladaptive behaviors (e.g., compulsive buying, disordered eating, and an unhealthy lifestyle), resulting in adverse cardiometabolic events (e.g., hypertension and obesity), which may implicate the treatment of this complex condition. The diagnostic criteria [...] Read more.

Background: Posttraumatic stress disorder (PTSD) and/or specific PTSD symptoms may evoke maladaptive behaviors (e.g., compulsive buying, disordered eating, and an unhealthy lifestyle), resulting in adverse cardiometabolic events (e.g., hypertension and obesity), which may implicate the treatment of this complex condition. The diagnostic criteria for PTSD have lately expanded beyond the three common symptoms (intrusion, avoidance, and hyperarousal). Including additional symptoms such as emotional numbing, sleep disturbance, and irritability strengthens the representation of the Impact of Event Scale-Revised (IES-R), suggesting that models with four, five, or six dimensions better capture its structure compared to the original three-dimensional model. Methods: Using a convenience sample of 58 Russian dental healthcare workers (HCWs: mean age = 44.1 ± 12.2 years, 82.8% females), this instrumental study examined the convergent, concurrent, and criterion validity of two IES-R structures: IES-R3 and IES-R6. Results: Exploratory factor analysis uncovered five factors, which explained 76.0% of the variance in the IES-R. Subscales of the IES-R3 and the IES-R6 expressed good internal consistency (coefficient alpha range = 0.69–0.88), high convergent validity (item total correlations r range = 0.39–0.81, and correlations with the IES-R’s total score r range = 0.62–0.92), excellent concurrent validity through strong correlations with the PTSD Symptom Scale-Self Report (PSS-SR: r range = 0.42–0.69), while their criterion validity was indicated by moderate-to-low correlations with high body mass index (BMI: r range = 0.12–0.39) and the diagnosis of hypertension (r range = 0.12–0.30). In the receiver-operating characteristic (ROC) curve analysis, all IES-R models were perfectly associated with the PSS-SR (all areas under the curve (AUCs) > 0.9, p values < 0.001). The IES-R, both hyperarousal subscales, and the IES-R3 intrusion subscale were significantly associated with high BMI. Both avoidance subscales and the IES-R3 intrusion subscale, not the IES-R, were significantly associated with hypertension. In the two-step cluster analysis, five sets of all trauma variables (IES-R3/IES-R6, PSS-SR) classified the participants into two clusters according to their BMI (normal weight/low BMI vs. overweight/obese). Meanwhile, only the IES-R, PSS-SR, and IES-R3 dimensions successfully classified participants as having either normal blood pressure or hypertension. Participants in the overweight/obese and hypertensive clusters displayed considerably higher levels of most trauma symptoms. Input variables with the highest predictor importance in the cluster analysis were those variables expressing significant associations in correlations and ROC analyses. However, neither IES-R3 nor IES-R6 contributed to BMI or hypertension either directly or indirectly in the path analysis. Meanwhile, age significantly predicted both health conditions and current smoking. Irritability and numbing were the only IES-R dimensions that significantly contributed to current smoking. Conclusions: The findings emphasize the need for assessing the way through which various PTSD symptoms may implicate cardiometabolic dysfunctions and their risk factors (e.g., smoking and the intake of unhealthy foods) as well as the application of targeted dietary and exercise interventions to lower physical morbidity in PTSD patients. However, the internal and external validity of our tests may be questionable due to the low power of our sample size. Replicating the study in larger samples, which comprise different physical and mental conditions from heterogenous cultural contexts, is pivotal to validate the results (e.g., in specific groups, such as those with confirmed traumatic exposure and comorbid mood dysfunction). Full article

(This article belongs to the Special Issue Review Special Issue Series: Recent Advances in Epidemiology & Public Health)

► Show Figures

Figure 1

18 pages, 13718 KiB

Open AccessArticle

A Hybrid EEG-Based Stress State Classification Model Using Multi-Domain Transfer Entropy and PCANet

by Yuefang Dong, Lin Xu, Jian Zheng, Dandan Wu, Huanli Li, Yongcong Shao, Guohua Shi and Weiwei Fu

Brain Sci. 2024, 14(6), 595; https://doi.org/10.3390/brainsci14060595 - 12 Jun 2024

Cited by 1 | Viewed by 1576

Abstract

This paper proposes a new hybrid model for classifying stress states using EEG signals, combining multi-domain transfer entropy (TrEn) with a two-dimensional PCANet (2D-PCANet) approach. The aim is to create an automated system for identifying stress levels, which is crucial for early intervention [...] Read more.

This paper proposes a new hybrid model for classifying stress states using EEG signals, combining multi-domain transfer entropy (TrEn) with a two-dimensional PCANet (2D-PCANet) approach. The aim is to create an automated system for identifying stress levels, which is crucial for early intervention and mental health management. A major challenge in this field lies in extracting meaningful emotional information from the complex patterns observed in EEG. Our model addresses this by initially applying independent component analysis (ICA) to purify the EEG signals, enhancing the clarity for further analysis. We then leverage the adaptability of the fractional Fourier transform (FrFT) to represent the EEG data in time, frequency, and time–frequency domains. This multi-domain representation allows for a more nuanced understanding of the brain’s activity in response to stress. The subsequent stage involves the deployment of a two-layer 2D-PCANet network designed to autonomously distill EEG features associated with stress. These features are then classified by a support vector machine (SVM) to determine the stress state. Moreover, stress induction and data acquisition experiments are designed. We employed two distinct tasks known to trigger stress responses. Other stress-inducing elements that enhance the stress response were included in the experimental design, such as time limits and performance feedback. The EEG data collected from 15 participants were retained. The proposed algorithm achieves an average accuracy of over 92% on this self-collected dataset, enabling stress state detection under different task-induced conditions. Full article

(This article belongs to the Section Cognitive, Social and Affective Neuroscience)

► Show Figures

Figure 1

21 pages, 4367 KiB

Open AccessArticle

Chinese Multicategory Sentiment of E-Commerce Analysis Based on Deep Learning

by Hongchan Li, Jianwen Wang, Yantong Lu, Haodong Zhu and Jiming Ma

Electronics 2023, 12(20), 4259; https://doi.org/10.3390/electronics12204259 - 15 Oct 2023

Cited by 6 | Viewed by 1942

Abstract

With the continuous rise of information technology and social networks, and the explosive growth of network text information, text sentiment analysis technology now plays a vital role in public opinion monitoring and product development analysis on networks. Text data are high-dimensional and complex, [...] Read more.

With the continuous rise of information technology and social networks, and the explosive growth of network text information, text sentiment analysis technology now plays a vital role in public opinion monitoring and product development analysis on networks. Text data are high-dimensional and complex, and traditional binary classification can only classify sentiment from positive or negative aspects. This does not fully cover the various emotions of users, and, therefore, natural language semantic sentiment analysis has limitations. To solve this deficiency, we propose a new model for analyzing text sentiment that combines deep learning and the bidirectional encoder representation from transformers (BERT) model. We first use an advanced BERT language model to convert the input text into dynamic word vectors; then, we adopt a convolutional neural network (CNN) to obtain the relatively significant partial emotional characteristics of the text. After extraction, we use the bidirectional recurrent neural network (BiGRU) to bidirectionally capture the contextual feature message of the text. Finally, with the MultiHeadAttention mechanism we obtain correlations among the data in different information spaces from different subspaces so that the key information related to emotion in the text can be selectively extracted. The final emotional feature representation obtained is classified using Softmax. Compared with other similar existing methods, our model in this research paper showed a good effect in comparative experiments on an e-commerce text dataset, and the accuracy and F1-score of the classification were significantly improved. Full article

(This article belongs to the Special Issue Trends and Prospects in Hybrid Methods for Natural Language Processing)

► Show Figures

Figure 1

16 pages, 5093 KiB

Open AccessArticle

Research on Learning Concentration Recognition with Multi-Modal Features in Virtual Reality Environments

by Renhe Hu, Zihan Hui, Yifan Li and Jueqi Guan

Sustainability 2023, 15(15), 11606; https://doi.org/10.3390/su151511606 - 27 Jul 2023

Cited by 7 | Viewed by 3437

Abstract

Learning concentration, as a crucial factor influencing learning outcomes, provides the basis for learners’ self-regulation and teachers’ instructional adjustments and intervention decisions. However, the current research on learning concentration recognition lacks the integration of cognitive, emotional, and behavioral features, and the integration of [...] Read more.

Learning concentration, as a crucial factor influencing learning outcomes, provides the basis for learners’ self-regulation and teachers’ instructional adjustments and intervention decisions. However, the current research on learning concentration recognition lacks the integration of cognitive, emotional, and behavioral features, and the integration of interaction and vision data for recognition requires further exploration. The way data are collected in a head-mounted display differs from that in a traditional classroom or online learning. Therefore, it is vital to explore a recognition method for learning concentration based on multi-modal features in VR environments. This study proposes a multi-modal feature integration-based learning concentration recognition method in VR environments. It combines interaction and vision data, including measurements of interactive tests, text, clickstream, pupil facial expressions, and eye gaze data, to measure learners’ concentration in VR environments in terms of cognitive, emotional, and behavioral representation. The experimental results demonstrate that the proposed method, which integrates interaction and vision data to comprehensively represent the cognitive, emotional, and behavioral dimensions of learning concentration, outperforms single-dimensional and single-type recognition results in terms of accuracy. Additionally, it was found that learners with higher concentration levels achieve better learning outcomes, and learners’ perceived sense of immersion is an important factor influencing their concentration. Full article

► Show Figures

Figure 1

16 pages, 3386 KiB

Open AccessArticle

Multi-Level Attention-Based Categorical Emotion Recognition Using Modulation-Filtered Cochleagram

by Zhichao Peng, Wenhua He, Yongwei Li, Yegang Du and Jianwu Dang

Appl. Sci. 2023, 13(11), 6749; https://doi.org/10.3390/app13116749 - 1 Jun 2023

Cited by 3 | Viewed by 1678

Abstract

Speech emotion recognition is a critical component for achieving natural human–robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral–temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-level attention [...] Read more.

Speech emotion recognition is a critical component for achieving natural human–robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral–temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-level attention network to extract high-level emotional feature representations from the modulation-filtered cochleagram. Our approach utilizes channel-level attention and spatial-level attention modules to generate emotional saliency maps of channel and spatial feature representations, capturing significant emotional channel and feature space from the 3D convolution feature maps, respectively. Furthermore, we employ a temporal-level attention module to capture significant emotional regions from the concatenated feature sequence of the emotional saliency maps. Our experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset demonstrate that the modulation-filtered cochleagram significantly improves the prediction performance of categorical emotion compared to other evaluated features. Moreover, our emotion recognition framework achieves comparable unweighted accuracy of 71% in categorical emotion recognition by comparing with several existing approaches. In summary, our study demonstrates the effectiveness of the modulation-filtered cochleagram in speech emotion recognition, and our proposed multi-level attention framework provides a promising direction for future research in this field. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

14 pages, 1268 KiB

Open AccessArticle

Classification of Arabic Poetry Emotions Using Deep Learning

by Sakib Shahriar, Noora Al Roken and Imran Zualkernan

Computers 2023, 12(5), 89; https://doi.org/10.3390/computers12050089 - 22 Apr 2023

Cited by 12 | Viewed by 4128

Abstract

The automatic classification of poems into various categories, such as by author or era, is an interesting problem. However, most current work categorizing Arabic poems into eras or emotions has utilized traditional feature engineering and machine learning approaches. This paper explores deep learning [...] Read more.

The automatic classification of poems into various categories, such as by author or era, is an interesting problem. However, most current work categorizing Arabic poems into eras or emotions has utilized traditional feature engineering and machine learning approaches. This paper explores deep learning methods to classify Arabic poems into emotional categories. A new labeled poem emotion dataset was developed, containing 9452 poems with emotional labels of joy, sadness, and love. Various deep learning models were trained on this dataset. The results show that traditional deep learning models, such as one-dimensional Convolutional Neural Networks (1DCNN), Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM) networks, performed with F1-scores of 0.62, 0.62, and 0.53, respectively. However, the AraBERT model, an Arabic version of the Bidirectional Encoder Representations from Transformers (BERT), performed best, obtaining an accuracy of 76.5% and an F1-score of 0.77. This model outperformed the previous state-of-the-art in this domain. Full article

► Show Figures

Figure 1

18 pages, 2427 KiB

Open AccessArticle

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

by Zhongwen Tu, Bin Liu, Wei Zhao, Raoxin Yan and Yang Zou

Appl. Sci. 2023, 13(7), 4124; https://doi.org/10.3390/app13074124 - 24 Mar 2023

Cited by 14 | Viewed by 3414

Abstract

The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small [...] Read more.

The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

39 pages, 1368 KiB

Open AccessReview

A Survey on High-Dimensional Subspace Clustering

by Wentao Qu, Xianchao Xiu, Huangyue Chen and Lingchen Kong

Mathematics 2023, 11(2), 436; https://doi.org/10.3390/math11020436 - 13 Jan 2023

Cited by 18 | Viewed by 5495

Abstract

With the rapid development of science and technology, high-dimensional data have been widely used in various fields. Due to the complex characteristics of high-dimensional data, it is usually distributed in the union of several low-dimensional subspaces. In the past several decades, subspace clustering [...] Read more.

With the rapid development of science and technology, high-dimensional data have been widely used in various fields. Due to the complex characteristics of high-dimensional data, it is usually distributed in the union of several low-dimensional subspaces. In the past several decades, subspace clustering (SC) methods have been widely studied as they can restore the underlying subspace of high-dimensional data and perform fast clustering with the help of the data self-expressiveness property. The SC methods aim to construct an affinity matrix by the self-representation coefficient of high-dimensional data and then obtain the clustering results using the spectral clustering method. The key is how to design a self-expressiveness model that can reveal the real subspace structure of data. In this survey, we focus on the development of SC methods in the past two decades and present a new classification criterion to divide them into three categories based on the purpose of clustering, i.e., low-rank sparse SC, local structure preserving SC, and kernel SC. We further divide them into subcategories according to the strategy of constructing the representation coefficient. In addition, the applications of SC methods in face recognition, motion segmentation, handwritten digits recognition, and speech emotion recognition are introduced. Finally, we have discussed several interesting and meaningful future research directions. Full article

(This article belongs to the Special Issue Advances in Machine Learning, Optimization, and Control Applications)

► Show Figures

Figure 1

18 pages, 3106 KiB

Open AccessArticle

A Video Sequence Face Expression Recognition Method Based on Squeeze-and-Excitation and 3DPCA Network

by Chang Li, Chenglin Wen and Yiting Qiu

Sensors 2023, 23(2), 823; https://doi.org/10.3390/s23020823 - 11 Jan 2023

Cited by 7 | Viewed by 3162

Abstract

Expression recognition is a very important direction for computers to understand human emotions and human-computer interaction. However, for 3D data such as video sequences, the complex structure of traditional convolutional neural networks, which stretch the input 3D data into vectors, not only leads [...] Read more.

Expression recognition is a very important direction for computers to understand human emotions and human-computer interaction. However, for 3D data such as video sequences, the complex structure of traditional convolutional neural networks, which stretch the input 3D data into vectors, not only leads to a dimensional explosion, but also fails to retain structural information in 3D space, simultaneously leading to an increase in computational cost and a lower accuracy rate of expression recognition. This paper proposes a video sequence face expression recognition method based on Squeeze-and-Excitation and 3DPCA Network (SE-3DPCANet). The introduction of a 3DPCA algorithm in the convolution layer directly constructs tensor convolution kernels to extract the dynamic expression features of video sequences from the spatial and temporal dimensions, without weighting the convolution kernels of adjacent frames by shared weights. Squeeze-and-Excitation Network is introduced in the feature encoding layer, to automatically learn the weights of local channel features in the tensor features, thus increasing the representation capability of the model and further improving recognition accuracy. The proposed method is validated on three video face expression datasets. Comparisons were made with other common expression recognition methods, achieving higher recognition rates while significantly reducing the time required for training. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

13 pages, 1249 KiB

Open AccessArticle

A Dual-Path Cross-Modal Network for Video-Music Retrieval

by Xin Gu, Yinghua Shen and Chaohui Lv

Sensors 2023, 23(2), 805; https://doi.org/10.3390/s23020805 - 10 Jan 2023

Cited by 5 | Viewed by 2861

Abstract

In recent years, with the development of the internet, video has become more and more widely used in life. Adding harmonious music to a video is gradually becoming an artistic task. However, artificially adding music takes a lot of time and effort, so [...] Read more.

In recent years, with the development of the internet, video has become more and more widely used in life. Adding harmonious music to a video is gradually becoming an artistic task. However, artificially adding music takes a lot of time and effort, so we propose a method to recommend background music for videos. The emotional message of music is rarely taken into account in current work, but it is crucial for video music retrieval. To achieve this, we design two paths to process content information and emotional information between modals. Based on the characteristics of video and music, we design various feature extraction schemes and common representation spaces. In the content path, the pre-trained network is used as the feature extraction network. As these features contain some redundant information, we use an encoder–decoder structure for dimensionality reduction. Where encoder weights are shared to obtain content sharing features for video and music. In the emotion path, an emotion key frames scheme was used for video and a channel attention mechanism was used for music in order to obtain the emotion information effectively. We also added emotion distinguish loss to guarantee that the network acquires the emotion information effectively. More importantly, we propose a way to combine content information with emotional information. That is, content features are first stitched together with sentiment features and then passed through a fused shared space structured as an MLP to obtain more effective fused shared features. In addition, a polarity penalty factor has been added to the classical metric loss function to make it more suitable for this task. Experiments show that this dual path video music retrieval network can effectively merge information. Compared with existing methods, the retrieval task evaluation index increases Recall@1 by 3.94. Full article

(This article belongs to the Special Issue Multimodal Data Fusion Technologies and Applications in Intelligent System)

► Show Figures

Figure 1

15 pages, 3771 KiB

Open AccessArticle

Facial Emotion Recognition with Inter-Modality-Attention-Transformer-Based Self-Supervised Learning

by Aayushi Chaudhari, Chintan Bhatt, Achyut Krishna and Carlos M. Travieso-González

Electronics 2023, 12(2), 288; https://doi.org/10.3390/electronics12020288 - 5 Jan 2023

Cited by 23 | Viewed by 4756

Abstract

Emotion recognition is a very challenging research field due to its complexity, as individual differences in cognitive–emotional cues involve a wide variety of ways, including language, expressions, and speech. If we use video as the input, we can acquire a plethora of data [...] Read more.

Emotion recognition is a very challenging research field due to its complexity, as individual differences in cognitive–emotional cues involve a wide variety of ways, including language, expressions, and speech. If we use video as the input, we can acquire a plethora of data for analyzing human emotions. In this research, we use features derived from separately pretrained self-supervised learning models to combine text, audio (speech), and visual data modalities. The fusion of features and representation is the biggest challenge in multimodal emotion classification research. Because of the large dimensionality of self-supervised learning characteristics, we present a unique transformer and attention-based fusion method for incorporating multimodal self-supervised learning features that achieved an accuracy of 86.40% for multimodal emotion classification. Full article

(This article belongs to the Special Issue Signal and Image Processing Applications in Artificial Intelligence)

► Show Figures

Figure 1

Search Results (36)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (36)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI