MDPI - Publisher of Open Access Journals

19 pages, 1311 KB

Open AccessArticle

An Interpretable Soft-Sensor Framework for Dissertation Peer Review Using BERT

by Meng Wang, Jincheng Su, Zhide Chen, Wencheng Yang and Xu Yang

Sensors 2025, 25(20), 6411; https://doi.org/10.3390/s25206411 - 17 Oct 2025

Viewed by 226

Graduate education has entered the era of big data, and systematic analysis of dissertation evaluations has become crucial for quality monitoring. However, the complexity and subjectivity inherent in peer-review texts pose significant challenges for automated analysis. While natural language processing (NLP) offers potential [...] Read more.

Graduate education has entered the era of big data, and systematic analysis of dissertation evaluations has become crucial for quality monitoring. However, the complexity and subjectivity inherent in peer-review texts pose significant challenges for automated analysis. While natural language processing (NLP) offers potential solutions, most existing methods fail to adequately capture nuanced disciplinary criteria or provide interpretable inferences for educators. Inspired by soft-sensor, this study employs a BERT-based model enhanced with additional attention mechanisms to quantify latent evaluation dimensions from dissertation reviews. The framework integrates Shapley Additive exPlanations (SHAP) to ensure the interpretability of model predictions, combining deep semantic modeling with SHAP to quantify characteristic importance in academic evaluation. The experimental results demonstrate that the implemented model outperforms baseline methods in accuracy, precision, recall, and F1-score. Furthermore, its interpretability mechanism reveals key evaluation dimensions experts prioritize during the paper assessment. This analytical framework establishes an interpretable soft-sensor paradigm that bridges NLP with substantive review principles, providing actionable insights for enhancing dissertation improvement strategies. Full article

(This article belongs to the Special Issue AI and Sensors in Computer-Based Educational Systems)

► Show Figures

Figure 1

15 pages, 2861 KB

Open AccessArticle

Sustainable Real-Time NLP with Serverless Parallel Processing on AWS

by Chaitanya Kumar Mankala and Ricardo J. Silva

Information 2025, 16(10), 903; https://doi.org/10.3390/info16100903 - 15 Oct 2025

Viewed by 574

Abstract

This paper proposes a scalable serverless architecture for real-time natural language processing (NLP) on large datasets using Amazon Web Services (AWS). The framework integrates AWS Lambda, Step Functions, and S3 to enable fully parallel sentiment analysis with Transformer-based models such as DistilBERT, RoBERTa, [...] Read more.

This paper proposes a scalable serverless architecture for real-time natural language processing (NLP) on large datasets using Amazon Web Services (AWS). The framework integrates AWS Lambda, Step Functions, and S3 to enable fully parallel sentiment analysis with Transformer-based models such as DistilBERT, RoBERTa, and ClinicalBERT. By containerizing inference workloads and orchestrating parallel execution, the system eliminates the need for dedicated servers while dynamically scaling to workload demand. Experimental evaluation on the IMDb Reviews dataset demonstrates substantial efficiency gains: parallel execution achieved a 6.07× reduction in wall-clock duration, an 81.2% reduction in total computing time and energy consumption, and a 79.1% reduction in variable costs compared to sequential processing. These improvements directly translate into a smaller carbon footprint, highlighting the sustainability benefits of serverless architectures for AI workloads. The findings show that the proposed framework is model-independent and provides consistent advantages across diverse Transformer variants. This work illustrates how cloud-native, event-driven infrastructures can democratize access to large-scale NLP by reducing cost, processing time, and environmental impact while offering a reproducible pathway for real-world research and industrial applications. Full article

(This article belongs to the Special Issue Generative AI Transformations in Industrial and Societal Applications)

► Show Figures

Graphical abstract

21 pages, 771 KB

Open AccessArticle

LLM-Driven Offloading Decisions for Edge Object Detection in Smart City Deployments

by Xingyu Yuan and He Li

Smart Cities 2025, 8(5), 169; https://doi.org/10.3390/smartcities8050169 - 10 Oct 2025

Viewed by 460

Abstract

Object detection is a critical technology for smart city development. As request volumes surge, inference is increasingly offloaded from centralized clouds to user-proximal edge sites to reduce latency and backhaul traffic. However, heterogeneous workloads, fluctuating bandwidth, and dynamic device capabilities make offloading and [...] Read more.

Object detection is a critical technology for smart city development. As request volumes surge, inference is increasingly offloaded from centralized clouds to user-proximal edge sites to reduce latency and backhaul traffic. However, heterogeneous workloads, fluctuating bandwidth, and dynamic device capabilities make offloading and scheduling difficult to optimize in edge environments. Deep reinforcement learning (DRL) has proved effective for this problem, but in practice, it relies on manually engineered reward functions that must be redesigned whenever service objectives change. To address this limitation, we introduce an LLM-driven framework that retargets DRL policies for edge object detection directly through natural language instructions. By leveraging understanding of the text and encoding capabilities of large language models (LLMs), our system (i) interprets the current optimization objective; (ii) generates an executable, environment-compatible reward function code; and (iii) iteratively refines the reward via closed-loop simulation feedback. In simulations for a real-world dataset, policies trained with LLM-generated rewards adapt from prompts alone and outperform counterparts trained with expert-designed rewards, while eliminating manual reward engineering. Full article

(This article belongs to the Section Internet of Things)

► Show Figures

Figure 1

47 pages, 3137 KB

Open AccessArticle

DietQA: A Comprehensive Framework for Personalized Multi-Diet Recipe Retrieval Using Knowledge Graphs, Retrieval-Augmented Generation, and Large Language Models

by Ioannis Tsampos and Emmanouil Marakakis

Computers 2025, 14(10), 412; https://doi.org/10.3390/computers14100412 - 29 Sep 2025

Viewed by 663

Abstract

Recipes available on the web often lack nutritional transparency and clear indicators of dietary suitability. While searching by title is straightforward, exploring recipes that meet combined dietary needs, nutritional goals, and ingredient-level preferences remains challenging. Most existing recipe search systems do not effectively [...] Read more.

Recipes available on the web often lack nutritional transparency and clear indicators of dietary suitability. While searching by title is straightforward, exploring recipes that meet combined dietary needs, nutritional goals, and ingredient-level preferences remains challenging. Most existing recipe search systems do not effectively support flexible multi-dietary reasoning in combination with user preferences and restrictions. For example, users may seek gluten-free and dairy-free dinners with suitable substitutions, or compound goals such as vegan and low-fat desserts. Recent systematic reviews report that most food recommender systems are content-based and often non-personalized, with limited support for dietary restrictions, ingredient-level exclusions, and multi-criteria nutrition goals. This paper introduces DietQA, an end-to-end, language-adaptable chatbot system that integrates a Knowledge Graph (KG), Retrieval-Augmented Generation (RAG), and a Large Language Model (LLM) to support personalized, dietary-aware recipe search and question answering. DietQA crawls Greek-language recipe websites to extract structured information such as titles, ingredients, and quantities. Nutritional values are calculated using validated food composition databases, and dietary tags are inferred automatically based on ingredient composition. All information is stored in a Neo4j-based knowledge graph, enabling flexible querying via Cypher. Users interact with the system through a natural language chatbot friendly interface, where they can express preferences for ingredients, nutrients, dishes, and diets, and filter recipes based on multiple factors such as ingredient availability, exclusions, and nutritional goals. DietQA supports multi-diet recipe search by retrieving both compliant recipes and those adaptable via ingredient substitutions, explaining how each result aligns with user preferences and constraints. An LLM extracts intents and entities from user queries to support rule-based Cypher retrieval, while the RAG pipeline generates contextualized responses using the user query and preferences, retrieved recipes, statistical summaries, and substitution logic. The system integrates real-time updates of recipe and nutritional data, supporting up-to-date, relevant, and personalized recommendations. It is designed for language-adaptable deployment and has been developed and evaluated using Greek-language content. DietQA provides a scalable framework for transparent and adaptive dietary recommendation systems powered by conversational AI. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling (2nd Edition))

► Show Figures

Graphical abstract

19 pages, 662 KB

Open AccessArticle

Mind the Link: Discourse Link-Aware Hallucination Detection in Summarization

by Dawon Lee, Hyuckchul Jung and Yong Suk Choi

Appl. Sci. 2025, 15(19), 10506; https://doi.org/10.3390/app151910506 - 28 Sep 2025

Viewed by 478

Abstract

Recent studies on detecting hallucinations in summaries follow a method of decomposing summaries into atomic content units (ACUs) and then determining whether each unit logically matches the document text based on natural language inference. However, this fails to consider discourse link relations such [...] Read more.

Recent studies on detecting hallucinations in summaries follow a method of decomposing summaries into atomic content units (ACUs) and then determining whether each unit logically matches the document text based on natural language inference. However, this fails to consider discourse link relations such as temporal order, causality, and purpose, leading to the inability to detect conflicts in semantic connections between individual summary ACUs, even when the conflicts are present in the document. To overcome this limitation, this study proposes a method of extracting Discourse Link-Aware Content Unit (DL-ACU) by converting the summary into an Abstract Meaning Representation (AMR) graph and structuring the discourse link relations between ACUs. Additionally, to align summary ACUs with corresponding document information in a fine-grained manner, we propose a Selective Document-Atomic Content Unit (SD-ACU). For each summary ACU, the SD-ACU retrieves only the most relevant document sentences and then decomposes them into document ACUs. Applying the DL-ACU module to existing hallucination detection systems such as FIZZ and FENICE reduces the error rate of discourse link errors on FRANK. When both modules are combined, the system improves balanced accuracy and ROC-AUC across major benchmarks. This suggests the proposed method effectively captures discourse link errors while enabling ACU-to-ACU alignment. Full article

► Show Figures

Figure 1

20 pages, 3793 KB

Open AccessArticle

Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models

by Taewook Hwang, Hyein Seo, Jeesu Jung and Sangkeun Jung

Appl. Sci. 2025, 15(19), 10434; https://doi.org/10.3390/app151910434 - 26 Sep 2025

Viewed by 689

Abstract

In recent years, methods that selectively fine-tune or reduce the number of layers in large language models (LLMs) have garnered attention as an efficient alternative to traditional fine-tuning, where all layers are trained. In this study, we revisit the classical concept of layer [...] Read more.

In recent years, methods that selectively fine-tune or reduce the number of layers in large language models (LLMs) have garnered attention as an efficient alternative to traditional fine-tuning, where all layers are trained. In this study, we revisit the classical concept of layer freezing and propose a simple, effective strategy that selectively fine-tunes only a portion of transformer layers. We show that freezing the bottom 25% or 50% of layers in small-scale LLMs with sub-3 billion parameters yields significant improvements in memory efficiency and training speed while maintaining, or even surpassing, the performance of full fine-tuning and Low-Rank Adaptation (LoRA). Through experiments on Natural Language Inference (NLI) tasks using LLMs with fewer than 3 billion parameters, our approach achieves up to 50% memory savings and 30% faster training. Notably, our method does not require architectural modifications or additional parameters, making it particularly suitable for resource-constrained environments. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

31 pages, 2653 KB

Open AccessFeature PaperArticle

A Machine Learning and Econometric Framework for Credibility-Aware AI Adoption Measurement and Macroeconomic Impact Assessment in the Energy Sector

by Adriana AnaMaria Davidescu, Marina-Diana Agafiței, Mihai Gheorghe and Vasile Alecsandru Strat

Mathematics 2025, 13(19), 3075; https://doi.org/10.3390/math13193075 - 24 Sep 2025

Viewed by 568

Abstract

Artificial intelligence (AI) adoption in strategic sectors such as energy is often framed in optimistic narratives, yet its actual economic contribution remains under-quantified. This study proposes a novel, multi-stage methodology at the intersection of machine learning, statistics, and big data analytics to bridge [...] Read more.

Artificial intelligence (AI) adoption in strategic sectors such as energy is often framed in optimistic narratives, yet its actual economic contribution remains under-quantified. This study proposes a novel, multi-stage methodology at the intersection of machine learning, statistics, and big data analytics to bridge this gap. First, we construct a media-derived AI Adoption Score using natural language processing (NLP) techniques, including dictionary-based keyword extraction, sentiment analysis, and zero-shot classification, applied to a large corpus of firm-related news and scientific publications. To enhance reliability, we introduce a Misinformation Bias Score (MBS)—developed via zero-shot classification and named entity recognition—to penalise speculative or biased reporting, yielding a credibility-adjusted adoption metric. Using these scores, we classify firms and apply a Fixed Effects Difference-in-Differences (FE DiD) econometric model to estimate the causal effect of AI adoption on turnover. Finally, we scale firm-level results to the macroeconomic level via a Leontief Input–Output model, quantifying direct, indirect, and induced contributions to GDP and employment. Results show that AI adoption in Romania’s energy sector accounts for up to 42.8% of adopter turnover, contributing 3.54% to national GDP in 2023 and yielding a net employment gain of over 65,000 jobs, despite direct labour displacement. By integrating machine learning-based text analytics, statistical causal inference, and big data-driven macroeconomic modelling, this study delivers a replicable framework for measuring credible AI adoption and its economy-wide impacts, offering valuable insights for policymakers and researchers in digital transformation, energy economics, and sustainable development. Full article

(This article belongs to the Special Issue Machine Learning, Statistics and Big Data, 2nd Edition)

► Show Figures

Figure 1

21 pages, 3747 KB

Open AccessArticle

Open-Vocabulary Crack Object Detection Through Attribute-Guided Similarity Probing

by Hyemin Yoon and Sangjin Kim

Appl. Sci. 2025, 15(19), 10350; https://doi.org/10.3390/app151910350 - 24 Sep 2025

Viewed by 815

Abstract

Timely detection of road surface defects such as cracks and potholes is critical for ensuring traffic safety and reducing infrastructure maintenance costs. While recent advances in image-based deep learning techniques have shown promise for automated road defect detection, existing models remain limited to [...] Read more.

Timely detection of road surface defects such as cracks and potholes is critical for ensuring traffic safety and reducing infrastructure maintenance costs. While recent advances in image-based deep learning techniques have shown promise for automated road defect detection, existing models remain limited to closed-set detection settings, making it difficult to recognize newly emerging or fine-grained defect types. To address this limitation, we propose an attribute-aware open-vocabulary crack detection (AOVCD) framework, which leverages the alignment capability of pretrained vision–language models to generalize beyond fixed class labels. In this framework, crack types are represented as combinations of visual attributes, enabling semantic grounding between image regions and natural language descriptions. To support this, we extend the existing PPDD dataset with attribute-level annotations and incorporate a multi-label attribute recognition task as an auxiliary objective. Experimental results demonstrate that the proposed AOVCD model outperforms existing baselines. In particular, compared to CLIP-based zero-shot inference, the proposed model achieves approximately a 10-fold improvement in average precision (AP) for novel crack categories. Attribute classification performance—covering geometric, spatial, and textural features—also increases by 40% in balanced accuracy (BACC) and 23% in AP. These results indicate that integrating structured attribute information enhances generalization to previously unseen defect types, especially those involving subtle visual cues. Our study suggests that incorporating attribute-level alignment within a vision–language framework can lead to more adaptive and semantically grounded defect recognition systems. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

26 pages, 1823 KB

Open AccessArticle

Scalable Gender Profiling from Turkish Texts Using Deep Embeddings and Meta-Heuristic Feature Selection

by Hakan Gunduz

J. Theor. Appl. Electron. Commer. Res. 2025, 20(4), 253; https://doi.org/10.3390/jtaer20040253 - 24 Sep 2025

Viewed by 496

Abstract

Accurate gender identification from written text is critical for author profiling, recommendation systems, and demographic analytics in digital ecosystems. This study introduces a scalable framework for gender classification in Turkish, combining contextualized BERTurk and subword-aware FastText embeddings with three meta-heuristic feature selection algorithms: [...] Read more.

Accurate gender identification from written text is critical for author profiling, recommendation systems, and demographic analytics in digital ecosystems. This study introduces a scalable framework for gender classification in Turkish, combining contextualized BERTurk and subword-aware FastText embeddings with three meta-heuristic feature selection algorithms: Genetic Algorithm (GA), Jaya and Artificial Rabbit Optimization (ARO). Evaluated on the IAG-TNKU corpus of 43,292 balanced Turkish news articles, the best-performing model—BERTurk+GA+LSTM—achieves 89.7% accuracy, while ARO reduces feature dimensionality by 90% with minimal performance loss. Beyond in-domain results, exploratory zero-shot and few-shot adaptation experiments on Turkish e-commerce product reviews demonstrate the framework’s transferability: while zero-shot performance dropped to 59.8%, few-shot adaptation with only 200–400 labeled samples raised accuracy to 69.6–72.3%. These findings highlight both the limitations of training exclusively on news articles and the practical feasibility of adapting the framework to consumer-generated content with minimal supervision. In addition to technical outcomes, we critically examine ethical considerations in gender inference, including fairness, representation, and the binary nature of current datasets. This work contributes a reproducible and linguistically informed baseline for gender profiling in morphologically rich, low-resource languages, with demonstrated potential for adaptation across domains such as social media and e-commerce personalization. Full article

(This article belongs to the Special Issue Human–Technology Synergies in AI-Driven E-Commerce Environments)

► Show Figures

Figure 1

29 pages, 1260 KB

Open AccessArticle

Modelling Social Attachment and Mental States from Facebook Activity with Machine Learning

by Stavroula Kridera and Andreas Kanavos

Information 2025, 16(9), 772; https://doi.org/10.3390/info16090772 - 5 Sep 2025

Viewed by 688

Abstract

Social networks generate vast amounts of data that can reveal patterns of human behaviour, social attachment, and mental states. This paper explores advanced machine learning techniques to detect and model such patterns, focusing on community structures, influential users, and information diffusion pathways. To [...] Read more.

Social networks generate vast amounts of data that can reveal patterns of human behaviour, social attachment, and mental states. This paper explores advanced machine learning techniques to detect and model such patterns, focusing on community structures, influential users, and information diffusion pathways. To address the scale, noise, and heterogeneity of social data, we leverage recent advances in graph theory, natural language processing, and anomaly detection. Our framework combines clustering for community detection, sentiment analysis for emotional state inference, and centrality metrics for influence estimation, while integrating multimodal data—including textual and visual content—for richer behavioural insights. Experimental results demonstrate that the proposed approach effectively extracts actionable knowledge, supporting mental well-being and strengthening digital social ties. Furthermore, we emphasise the role of privacy-preserving methods, such as federated learning, to ensure ethical analysis. These findings lay the groundwork for responsible and effective applications of machine learning in social network analysis. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

17 pages, 1307 KB

Open AccessArticle

Representationalism and Enactivism in Cognitive Translation Studies: A Predictive Processing Perspective

by Michael Carl

Information 2025, 16(9), 751; https://doi.org/10.3390/info16090751 - 29 Aug 2025

Viewed by 882

Abstract

Representational Theories of Mind have long dominated Cognitive Translation Studies, typically assuming that translation involves the manipulation of internal representations (symbols) that stand in for external states of affairs. In recent years, classical representationalism has given way to more nuanced, inferential, interpretive, context-sensitive, [...] Read more.

Representational Theories of Mind have long dominated Cognitive Translation Studies, typically assuming that translation involves the manipulation of internal representations (symbols) that stand in for external states of affairs. In recent years, classical representationalism has given way to more nuanced, inferential, interpretive, context-sensitive, and modern representational models, some of which align naturally with probabilistic and predictive approaches. While these frameworks remain broadly compatible with one another, radical enactivism offers a more disruptive alternative: it denies representational content altogether, viewing translation instead as an affectively grounded, context-sensitive, self-evidencing activity shaped by the translator’s embodied engagement with text, context, and sociocultural norms. From an enactivist standpoint, translation emerges not from static symbolic mappings, but from situated, embodied, and affectively modulated inference processes that dynamically negotiate meaning across languages. The paper provides a theoretical synthesis, arguing that the Free Energy Principle under Predictive Processing and Active Inference provides a suitable mathematical framework amenable to representational and enactive accounts. Full article

(This article belongs to the Special Issue Human and Machine Translation: Recent Trends and Foundations)

► Show Figures

Figure 1

23 pages, 16525 KB

Open AccessArticle

Real-Time Vision–Language Analysis for Autonomous Underwater Drones: A Cloud–Edge Framework Using Qwen2.5-VL

by Wannian Li and Fan Zhang

Drones 2025, 9(9), 605; https://doi.org/10.3390/drones9090605 - 27 Aug 2025

Viewed by 1527

Abstract

Autonomous Underwater Vehicles (AUVs) equipped with vision systems face unique challenges in real-time environmental perception due to harsh underwater conditions and computational constraints. This paper presents a novel cloud–edge framework for real-time vision–language analysis in underwater drones using the Qwen2.5-VL model. Our system [...] Read more.

Autonomous Underwater Vehicles (AUVs) equipped with vision systems face unique challenges in real-time environmental perception due to harsh underwater conditions and computational constraints. This paper presents a novel cloud–edge framework for real-time vision–language analysis in underwater drones using the Qwen2.5-VL model. Our system employs a uniform frame sampling mechanism that balances temporal resolution with processing capabilities, achieving near real-time analysis at 1 fps from 23 fps input streams. We construct a comprehensive data flow model encompassing image enhancement, communication latency, cloud-side inference, and semantic result return, which is supported by a theoretical latency framework and sustainable processing rate analysis. Simulation-based experimental results across three challenging underwater scenarios—pipeline inspection, coral reef monitoring, and wreck investigation—demonstrate consistent scene comprehension with end-to-end latencies near 1 s. The Qwen2.5-VL model successfully generates natural language summaries capturing spatial structure, biological content, and habitat conditions, even under turbidity and occlusion. Our results show that vision–language models (VLMs) can provide rich semantic understanding of underwater scenes despite challenging conditions, enabling AUVs to perform complex monitoring tasks with natural language scene descriptions. This work contributes to advancing AI-powered perception systems for the growing autonomous underwater drone market, supporting applications in environmental monitoring, offshore infrastructure inspection, and marine ecosystem assessment. Full article

(This article belongs to the Special Issue Advances in Autonomous Underwater Drones: 2nd Edition)

► Show Figures

Figure 1

17 pages, 2418 KB

Open AccessArticle

InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation

by Guihe Gu, Yuan Xue, Zhengqian Wu, Lin Song and Chao Liang

Sensors 2025, 25(16), 5195; https://doi.org/10.3390/s25165195 - 21 Aug 2025

Viewed by 975

Abstract

In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through [...] Read more.

In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through complex or evolving instructions. To address this challenge, we propose a novel cross-modal representation learning framework that incorporates an instruction-aware dynamic query generation mechanism, augmented by the semantic reasoning capabilities of large language models (LLMs). The framework dynamically constructs and iteratively refines query representations conditioned on natural language instructions and guided by user feedback, thereby enabling the system to effectively infer and adapt to implicit retrieval intent. Extensive experiments on standard multimodal retrieval benchmarks demonstrate that our method significantly improves retrieval accuracy and adaptability, outperforming fixed-query baselines and showing enhanced cross-modal alignment and generalization across diverse retrieval tasks. Full article

(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications—2nd Edition)

► Show Figures

Figure 1

14 pages, 2231 KB

Open AccessArticle

OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation

by Viktor Ungur and Călin-Adrian Popa

Appl. Sci. 2025, 15(16), 9087; https://doi.org/10.3390/app15169087 - 18 Aug 2025

Viewed by 1506

Abstract

Open-vocabulary semantic segmentation aims to label each pixel of an image based on text descriptions provided at inference time. Recent approaches for this task are based on methods which require two stages: the first one uses a mask generator to generate mask proposals, [...] Read more.

Open-vocabulary semantic segmentation aims to label each pixel of an image based on text descriptions provided at inference time. Recent approaches for this task are based on methods which require two stages: the first one uses a mask generator to generate mask proposals, while the other one deals with segment classification using a pre-trained vision–language model, such as CLIP. However, since CLIP is pre-trained on natural images, the model struggles with segmentation masks because of their abstract nature. In this paper, we introduce OpenMamba, a novel approach to creating high-level guidance maps to assist in extracting CLIP features within the masked regions for classification. High-level guidance maps are generated by leveraging both visual and textual modalities and introducing State Space Duality (SSD) as an efficient way to tackle the open-vocabulary semantic segmentation task. Also, we propose a new matching technique for the mask proposals, based on IoU with a dynamic threshold conditioned by mask quality and we introduce a contrastive-based loss to assure that similar mask proposals achieve similar CLIP embeddings. Comprehensive experiments across open-vocabulary benchmarks show that our method can achieve superior performance compared to other approaches while managing to reduce memory consumption. Full article

(This article belongs to the Special Issue Application of Machine Learning to Image Classification and Image Segmentation)

► Show Figures

Figure 1

25 pages, 1734 KB

Open AccessArticle

A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis

by Yanhong Yuan, Shuangsheng Duo, Xuming Tong and Yapeng Wang

Algorithms 2025, 18(8), 513; https://doi.org/10.3390/a18080513 - 14 Aug 2025

Viewed by 1136

Abstract

Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the [...] Read more.

Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the naturalness, expressiveness, and response efficiency of human–computer emotional interaction. By introducing a modular layered design, a six-dimensional emotional space, a gated attention mechanism, and a dynamic model scheduling strategy, the system overcomes challenges such as limited emotional representation, modality misalignment, and high-latency responses. Experimental results demonstrate that the framework achieves superior performance in speech synthesis quality (MOS: 4.35), emotion recognition accuracy (91.6%), and response latency (<1.2 s), outperforming baseline models like Tacotron2 and FastSpeech2. Through model lightweighting, GPU parallel inference, and load balancing optimization, the system validates its robustness and generalizability across English and Chinese corpora in cross-linguistic tests. The modular architecture and dynamic scheduling ensure scalability and efficiency, enabling a more humanized and immersive interaction experience in typical application scenarios such as psychological companionship, intelligent education, and high-concurrency customer service. This study provides an effective technical pathway for developing the next generation of personalized and immersive affective intelligent interaction systems. Full article

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

► Show Figures

Figure 1

Search Results (163)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (163)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI