MDPI - Publisher of Open Access Journals

23 pages, 6093 KB

Open AccessArticle

Quantifying Risk Levels for Active Safety Systems in Autonomous Forest Machinery Using Vision Language Models

by Kengo Usui

Forests 2026, 17(6), 708; https://doi.org/10.3390/f17060708 - 17 Jun 2026

Viewed by 183

Forestry is recognized as one of the most dangerous industries in the world. To enhance forestry safety, autonomous machinery and safety systems for such machinery are essential. This study aims to introduce large language models (LLMs)—especially their extensions to images, vision–language models (VLMs)—to [...] Read more.

Forestry is recognized as one of the most dangerous industries in the world. To enhance forestry safety, autonomous machinery and safety systems for such machinery are essential. This study aims to introduce large language models (LLMs)—especially their extensions to images, vision–language models (VLMs)—to enable human-like decision-making for autonomous forest machinery. This research focused on VLMs as an active safety system that can adapt to environments and evaluated the effectiveness of a system that quantitatively makes decisions regarding hazard levels using contrastive language–image pretraining (CLIP). The results of industry type, tree state, and road state classification using pretrained models showed that for three tasks—forestry identification, hung-up tree detection, and road collapse sensing—the target classes consistently exhibited higher similarity with disaster texts compared with nontarget classes. Although the F1 scores were 0.693, 0.324 and 0.634, respectively—indicating that the system is insufficient as a direct active safety system—the application of a similarity threshold optimized to maintain a recall of 0.9 yielded F1 scores of 0.291 and 0.584 for tree state and road state, respectively. These results suggest that the system can potentially be used as a quantitative indicator of hazard by setting a threshold on the similarity score. Full article

(This article belongs to the Section Forest Operations and Engineering)

► Show Figures

Figure 1

13 pages, 248 KB

Open AccessProtocol

Storytelling as a Means to Reduce Polarization on Climate Change: A Protocol Paper

by Daryl Stephens, Saraniya Tharmarajah, Valicia Browne, Graham Sack, Wonjung Bae and Rajiv N. Rimal

Climate 2026, 14(6), 122; https://doi.org/10.3390/cli14060122 - 9 Jun 2026

Viewed by 502

Abstract

Despite overwhelming scientific consensus that human activity drives climate change, public opinion in the United States remains sharply polarized along political lines. This project tests whether a theory-driven narrative intervention can reduce divergence between individuals skeptical of climate change and those who accept [...] Read more.

Despite overwhelming scientific consensus that human activity drives climate change, public opinion in the United States remains sharply polarized along political lines. This project tests whether a theory-driven narrative intervention can reduce divergence between individuals skeptical of climate change and those who accept the scientific consensus. Guided by narrative transportation theory, we hypothesize that an inclusive, character-driven video grounded in the authentic language of skeptical audiences will reduce polarization and increase civic engagement. The study proceeds in three phases. Phase 1 uses focus group discussions to identify words, phrases, and perspectives used by skeptical and accepting participants. Phase 2 integrates these findings into the production of a 2–3 min narrative short film, refined through iterative audience testing. Phase 3 employs a stratified online experiment assessing climate attitudes, policy support, and activism behaviors before exposure, immediately after, and one week later. Mediators include narrative transportation, perceived similarity, and character identification. We test whether pre-exposure divergence narrows over time and whether engagement mechanisms explain observed changes. Findings will inform climate communication policy, intervention design, and broader research on depolarization in polarized public issues. Full article

(This article belongs to the Section Climate Adaptation and Mitigation)

19 pages, 1764 KB

Open AccessArticle

Automated Dataset Construction for Composed Video Retrieval in Soccer

by Riku Yoshida, Ryota Goka, Keisuke Maeda, Takahiro Ogawa and Miki Haseyama

Appl. Sci. 2026, 16(11), 5360; https://doi.org/10.3390/app16115360 - 27 May 2026

Viewed by 253

Abstract

Composed Video Retrieval (CoVR) enables flexible video search by retrieving a target video that reflects a specified modification to a query video. The triplet datasets—consisting of query videos, query text, and target videos—required for model training have been collected manually. Recent studies have [...] Read more.

Composed Video Retrieval (CoVR) enables flexible video search by retrieving a target video that reflects a specified modification to a query video. The triplet datasets—consisting of query videos, query text, and target videos—required for model training have been collected manually. Recent studies have explored automatic construction of training triplets for CoVR; however, most existing approaches rely heavily on caption similarity. This limitation is particularly problematic in soccer videos, where identical or highly similar captions can correspond to visually distinct situations, making it difficult to construct triplets with appropriate relationships. To address this issue, this paper proposes a multimodal triplet construction framework specialized for soccer videos. The key idea is to explicitly incorporate visual similarity alongside textual similarity. Specifically, candidate target videos are selected by combining visual similarity with commentary caption filtering, enabling the identification of videos that are visually similar yet semantically different. The semantic difference between videos is then generated as query text using a large language model (LLM) without manual annotation. Furthermore, a multimodal large language model (MLLM) is introduced to estimate whether the generated modification is visually and semantically consistent with the video pair. Rather than replacing human verification, this step provides an automated screening signal to identify potentially unreliable triplets. The experiments show that the proposed framework automatically constructs triplets with reasonable validity under limited human validation. These results demonstrate the potential of scalable triplet construction for CoVR in soccer videos. Full article

(This article belongs to the Collection Computer Science in Sport)

► Show Figures

Figure 1

34 pages, 1346 KB

Open AccessArticle

by Dan Curavale, Georgian Nicolae, Alexandru Caranica, Horia Cucu, Corneliu Burileanu, Valentina Davidoiu, Andi Buzo and Georg Pelz

Electronics 2026, 15(11), 2301; https://doi.org/10.3390/electronics15112301 - 26 May 2026

Viewed by 274

Abstract

Component obsolescence and supply-chain disruptions increasingly force engineers to spend significant time manually searching and comparing PDF datasheets to identify compatible replacement parts. We propose an AI-powered datasheet assistant based on a Retrieval-Augmented Generation (RAG) pipeline that automatically processes datasheets to accelerate component [...] Read more.

Component obsolescence and supply-chain disruptions increasingly force engineers to spend significant time manually searching and comparing PDF datasheets to identify compatible replacement parts. We propose an AI-powered datasheet assistant based on a Retrieval-Augmented Generation (RAG) pipeline that automatically processes datasheets to accelerate component identification and matching. The core contribution is a summary-driven retrieval mechanism: a Large Language Model (LLM) generates a structured semantic summary of an input datasheet, and the vector embedding of this summary is used to retrieve semantically similar components from a reference database. The system also supports natural language question answering and structured component comparison. Its architecture separates scalable text-only reference indexing from more expensive query-time summarization and reranking. Validation includes a controlled synthetic benchmark and a pilot-scale real-world evaluation on 18 publicly listed microcontroller datasheets grouped into six engineering families. The synthetic benchmark is used to assess pipeline behavior under controlled conditions, while the real-world evaluation measures performance on heterogeneous manufacturer datasheets. In the real-world evaluation, structured summaries generated with Claude Sonnet 4.5 combined with cross-encoder reranking achieved a 72.2% Family Retrieval Rate at

k = 1

(13/18; Wilson 95% CI: 49.1–87.5%). Additional experiments with local LLM summaries indicate that retrieval performance depends strongly on summary quality and model capability, with lightweight local summarizers producing lower first-candidate retrieval performance in this setup. The analysis further reports confidence intervals, no-summary baselines, chunking sensitivity, and an Image Reference Rate metric used as a lexical reference proxy rather than a direct measure of visual grounding. Full article

(This article belongs to the Special Issue AI-Powered Natural Language Processing Applications)

► Show Figures

Figure 1

20 pages, 1571 KB

Open AccessArticle

Construction Safety Risk Identification and Coupling Analysis Based on Data Mining

by Guozong Zhang, Dexin Yang and Yuan Sun

Buildings 2026, 16(10), 1917; https://doi.org/10.3390/buildings16101917 - 12 May 2026

Viewed by 330

Abstract

Frequent accidents in the construction sector arise from the dynamic coupling of multiple risk factors, while conventional single-factor approaches fail to capture the underlying complexity. Drawing on 702 accident investigation reports, this study develops an intelligent, data-driven framework that integrates large language model–based [...] Read more.

Frequent accidents in the construction sector arise from the dynamic coupling of multiple risk factors, while conventional single-factor approaches fail to capture the underlying complexity. Drawing on 702 accident investigation reports, this study develops an intelligent, data-driven framework that integrates large language model–based risk identification with association rule mining to systematically uncover risk factors and their coupling patterns. A DeepSeek-based model is employed to extract risk factors from unstructured text, followed by cosine similarity–based optimization to refine factor representations. The FP-Growth algorithm is then applied to identify strong association rules among risk factors. The results reveal that deficiencies in the management dimension account for 68.30% of all identified risks, with inadequate safety education and training emerging as the central hub in the risk coupling network, which is further corroborated by complex network analysis. Moreover, a cascading transmission pathway is identified, whereby environmental deficiencies induce weakened safety awareness, which in turn leads to unsafe behaviors. These findings further demonstrate the nonlinear amplification effects arising from concurrent management failures. By establishing a transformation pathway from unstructured textual data to structured risk knowledge, this study provides a robust, data-driven foundation for precise risk identification and systematic prevention in construction safety management. Full article

(This article belongs to the Section Construction Management, and Computers & Digitization)

► Show Figures

Figure 1

35 pages, 12654 KB

Open AccessArticle

An Integrated BIM–NLP Framework for Design-Informed Automated Construction Schedule Generation

by Mahmoud Donia, Emad Elbeltagi, Ahmed Elhakeem and Hossam Wefki

Designs 2026, 10(2), 43; https://doi.org/10.3390/designs10020043 - 7 Apr 2026

Viewed by 1548

Abstract

Artificial intelligence has attracted increasing attention in the construction industry; however, automated time scheduling remains limited in practical applications. Schedule development remains manual, requiring planners to analyze project documents, define activities, estimate durations, and identify relationships based on logical sequence. This process primarily [...] Read more.

Artificial intelligence has attracted increasing attention in the construction industry; however, automated time scheduling remains limited in practical applications. Schedule development remains manual, requiring planners to analyze project documents, define activities, estimate durations, and identify relationships based on logical sequence. This process primarily depends on individual experience and skills, making it both time-consuming and prone to human error. From an engineering design perspective, delayed or inconsistent schedule development weakens design-to-construction feedback, limiting the ability to evaluate constructability and time implications of alternative design decisions during early-stage planning. This study proposes an integrated BIM–Natural Language Processing (NLP) framework to automate activity identification, duration estimation, and logical sequencing for construction scheduling. The framework extracts project data from Revit, organizes it into a bill of quantities format, and then generates an activity list, each activity with a unique ID. Using Sentence-BERT (SBERT) embeddings, the framework estimates activity durations based on semantic similarity. The same semantic process is combined with rule-based reasoning to identify logical relationships, including sequences, supported by an Excel-based reference dictionary that includes logical relationships, productivity, and ID structure. Finally, the framework incorporates a crashing module that proportionally adjusts the duration of activities on the longest path to target the project’s completion time without violating relationships. The proposed framework was validated using real construction project data and produced reliable results. By producing a tool-ready schedule directly from design-model information, the proposed workflow enables earlier schedule feedback loops and supports design-informed planning by allowing designers and planners to assess the time consequences of model-driven scope changes. The results demonstrate that integrating BIM and NLP can transform conventional schedules into faster, more consistent processes, thereby supporting the construction industry. Full article

(This article belongs to the Special Issue Quantum and AI Technologies in Engineering and Construction Projects: Design Challenges and Opportunities)

► Show Figures

Figure 1

24 pages, 2520 KB

Open AccessArticle

MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering

by Manal Ali Al-Qahtani, Bader Fahad Alkhamees and Mourad Ykhlef

Data 2026, 11(3), 64; https://doi.org/10.3390/data11030064 - 20 Mar 2026

Viewed by 863

Abstract

Developing reliable Arabic question answering (QA) systems for Islamic fatwas requires datasets that capture the linguistic complexity and multi-step reasoning inherent in jurisprudential inquiries. However, the existing Arabic religious QA datasets primarily focus on direct retrieval or classification, often failing to address the [...] Read more.

Developing reliable Arabic question answering (QA) systems for Islamic fatwas requires datasets that capture the linguistic complexity and multi-step reasoning inherent in jurisprudential inquiries. However, the existing Arabic religious QA datasets primarily focus on direct retrieval or classification, often failing to address the multi-hop reasoning necessary for complex fatwa questions. To bridge this gap, we introduce MAFQA, a benchmark dataset specifically designed for multi-hop Arabic fatwa question answering. MAFQA was constructed from an extensive corpus of authentic fatwa records sourced from authoritative Islamic institutions. The dataset was developed via a semi-automated pipeline that integrates expert-guided identification of complex inquiries with a structured decomposition framework. This framework employs automated reasoning-pattern classification, semantic feature extraction, and template-guided annotation of subquestions and subanswers, followed by rigorous validation to ensure contextual grounding, logical coherence, and structural consistency. To evaluate the utility of the dataset, we conduct an extensive benchmarking study using Arabic-specialized, multilingual, and instruction-tuned language models across two primary tasks: question decomposition (QD) and generative question answering (QA). Performance is assessed using a comprehensive suite of lexical, semantic, relevance, and faithfulness metrics. Experimental results demonstrate that Arabic-specialized models consistently outperform their multilingual counterparts, with AraT5-base and AraBART achieving the highest performance in terms of lexical similarity, semantic alignment, and answer faithfulness. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

51 pages, 1067 KB

Open AccessArticle

Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance

by Juuso Eronen, Michal Ptaszynski, Tomasz Wicherkiewicz, Robert Borges, Katarzyna Janic, Zhenzhen Liu, Tanjim Mahmud and Fumito Masui

Mach. Learn. Knowl. Extr. 2026, 8(3), 65; https://doi.org/10.3390/make8030065 - 7 Mar 2026

Viewed by 3679

Abstract

Selecting a source language for zero-shot cross-lingual transfer is typically done by intuition or by defaulting to English, despite large performance differences across language pairs. We study whether linguistic similarity can predict transfer performance and support principled source-language selection. We introduce quantified WALS [...] Read more.

Selecting a source language for zero-shot cross-lingual transfer is typically done by intuition or by defaulting to English, despite large performance differences across language pairs. We study whether linguistic similarity can predict transfer performance and support principled source-language selection. We introduce quantified WALS (qWALS), a typology-based similarity metric derived from features in the World Atlas of Language Structures, and evaluate it against existing similarity baselines. Validation uses three complementary signals: computational similarity scores, zero-shot transfer performance of multilingual transformers (mBERT and XLM-R) on four NLP tasks (dependency parsing, named entity recognition, sentiment analysis, and abusive language identification) across eight languages, and an expert-linguist similarity survey. Across tasks and models, higher linguistic similarity is associated with better transfer, and the survey provides independent support for the computational metrics. Full article

(This article belongs to the Special Issue Advancing Natural Language Processing for Low-Resource Languages and Dialects)

► Show Figures

Figure 1

19 pages, 1898 KB

Open AccessArticle

A Backdoor Label Verification Method Based on Consensus Deviation for Pre-Trained Language Models

by Xiang Yang, Kai Zeng, Jiangming Luo, Peicheng Yang and Xiaohui Zhang

Electronics 2026, 15(5), 1015; https://doi.org/10.3390/electronics15051015 - 28 Feb 2026

Viewed by 484

Abstract

Backdoor attacks pose a critical security risk to pre-trained language models (PLMs) by utilizing concealed triggers to manipulate model outputs. Existing defense strategies largely depend on statistical thresholds, which often struggle to identify sophisticated backdoor samples that exhibit high cognitive similarity to benign [...] Read more.

Backdoor attacks pose a critical security risk to pre-trained language models (PLMs) by utilizing concealed triggers to manipulate model outputs. Existing defense strategies largely depend on statistical thresholds, which often struggle to identify sophisticated backdoor samples that exhibit high cognitive similarity to benign data. Such similarities make precise threshold calibration difficult, frequently leading to unreliable or failed detection. To overcome these limitations, we propose a backdoor detection method based on consensus deviation, shifting the defensive paradigm from surface-level statistical metrics to deep cognitive consensus verification. This approach obviates the reliance on fixed thresholds, enabling the more robust identification of covert triggers. Extensive experiments on the SST-2, HSOL, and AG‘s News datasets revealed that our method achieved significantly lower attack success rates (ASRs) and enhanced robustness compared with the current baselines across word-, sentence-, and structural-level attack scenarios. Full article

(This article belongs to the Special Issue Research on Privacy and Security Issues in Cloud Computing)

► Show Figures

Figure 1

22 pages, 2153 KB

Open AccessArticle

Benchmark of Genomic Language Models on Human and Rice Genomic Tasks

by Xiaosheng Gao, Shunyao Wu and Weihua Pan

Appl. Sci. 2026, 16(4), 1745; https://doi.org/10.3390/app16041745 - 10 Feb 2026

Viewed by 1004

Abstract

Genomic Language Models (GLMs), leveraging their vast parameter scales and the similarities between DNA sequences and natural languages, demonstrate immense potential in processing large-scale genomic data and elucidating gene regulation and evolutionary relationships. However, the cross-species generalization capability of large genomic models has [...] Read more.

Genomic Language Models (GLMs), leveraging their vast parameter scales and the similarities between DNA sequences and natural languages, demonstrate immense potential in processing large-scale genomic data and elucidating gene regulation and evolutionary relationships. However, the cross-species generalization capability of large genomic models has not yet been systematically evaluated. This study addresses this critical gap by benchmarking five GLMs (DNABERT-2, GROVER, HyenaDNA, NT-V2, and AgroNT) and a CNN baseline model using human (Homo sapiens) and rice (Oryza sativa) genomes across four downstream tasks: promoter detection, transcription start site (TSS) scanning, species classification, and gene region identification, through both zero-shot testing and fine-tuning. During testing, factors such as hyperparameters, early stopping protocols, and computational resources were fixed to ensure fairness, enabling us to systematically evaluate their performance and cross-species generalization capabilities. The results were further analyzed from multiple mathematical and representational perspectives to provide a more rigorous and objective assessment of each model’s performance. The results show that AgroNT consistently leads on rice tasks, while NT-V2 and DNABERT-2 achieved the best overall performance in fine-tuning and zero-shot experiments, respectively. Although their pretraining data did not include plants, they demonstrate excellent performance on rice-related tasks thanks to cross-species pretraining that enhances their generalization ability across human–rice domains. This benchmark study offers guidance on selecting appropriate genomic language models based on task characteristics and provides insights for future development in this field. Full article

► Show Figures

Figure 1

27 pages, 3763 KB

Open AccessFeature PaperArticle

GO-PILL: A Geometry-Aware OCR Pipeline for Reliable Recognition of Debossed and Curved Pill Imprints

by Jaehyeon Jo, Sungan Yoon and Jeongho Cho

Mathematics 2026, 14(2), 356; https://doi.org/10.3390/math14020356 - 21 Jan 2026

Viewed by 1546

Abstract

Manual pill identification is often inefficient and error-prone due to the large variety of medications and frequent visual similarity among pills, leading to misuse or dispensing errors. These challenges are exacerbated when pill imprints are engraved, curved, or irregularly arranged, conditions under which [...] Read more.

Manual pill identification is often inefficient and error-prone due to the large variety of medications and frequent visual similarity among pills, leading to misuse or dispensing errors. These challenges are exacerbated when pill imprints are engraved, curved, or irregularly arranged, conditions under which conventional optical character recognition (OCR)-based methods degrade significantly. To address this problem, we propose GO-PILL, a geometry-aware OCR pipeline for robust pill imprint recognition. The framework extracts text centerlines and imprint regions using the TextSnake algorithm. During imprint refinement, background noise is suppressed and contrast is enhanced to improve the visibility of embossed and debossed imprints. The imprint localization and alignment stage then rectifies curved or obliquely oriented text into a linear representation, producing geometrically normalized inputs suitable for OCR decoding. The refined imprints are processed by a multimodal OCR module that integrates a non-autoregressive language–vision fusion architecture for accurate character-level recognition. Experiments on a pill image dataset from the U.S. National Library of Medicine show that GO-PILL achieves an F1-score of 81.83% under set-based evaluation and a Top-10 pill identification accuracy of 76.52% in a simulated clinical scenario. GO-PILL consistently outperforms existing methods under challenging imprint conditions, demonstrating strong robustness and practical feasibility. Full article

(This article belongs to the Special Issue Applications of Deep Learning and Convolutional Neural Network)

► Show Figures

Figure 1

26 pages, 2786 KB

Open AccessArticle

Time-Series Modeling and LLM-Based Agents for Peak Energy Management in Smart Campus Environments

by Mossab Batal, Youness Tace, Hassna Bensag, Sanaa El Filali and Mohamed Tabaa

Sustainability 2026, 18(2), 875; https://doi.org/10.3390/su18020875 - 15 Jan 2026

Viewed by 1275

Abstract

A Smart campus increasingly operates on the basis of data-driven operations, but an increasing demand for energy puts their control over costs and sustainability at risk. This study addresses the challenge of anticipating and managing energy consumption peaks in multi-campus environments by proposing [...] Read more.

A Smart campus increasingly operates on the basis of data-driven operations, but an increasing demand for energy puts their control over costs and sustainability at risk. This study addresses the challenge of anticipating and managing energy consumption peaks in multi-campus environments by proposing a hybrid framework that combines advanced time-series forecasting models with a large language model (LLM)-driven multi-agent system. Based on the UNICON dataset, LSTM, CNN, GRU, and a combination architecture are trained and compared in terms of MAE and RMSE. The hybrid configuration achieves the greatest forecasting results by returning the minimum loss values. For the identification of critical periods, we employed a strategy based on median thresholding, which offers a categorization into low, normal, and extreme category, allowing the targeting of peak mitigation actions. We also introduce a multi-agent system based on the LLM, including the data aggregator, the forecaster, and the policy advisor, which create actionable policies informed by context. We also compare LLMs (Qwen-2.5, Gemma-2, Phi-4, Mistral, Llama-3.3) in terms of context accuracy, response relevance, semantic similarity, and retrieval/recall accuracy and fidelity, with Llama-3.3 achieving the best overall results. This framework has shown great potential, not only for energy consumption forecasting but also for developing precise policies on how to effectively manage energy consumption peaks. Full article

(This article belongs to the Section Environmental Sustainability and Applications)

► Show Figures

Figure 1

23 pages, 2741 KB

Open AccessArticle

Subjective Evaluation of Operator Responses for Mobile Defect Identification in Remanufacturing: Application of NLP and Disagreement Tagging

by Abbirah Ahmed, Reenu Mohandas, Arash Joorabchi and Martin J. Hayes

Big Data Cogn. Comput. 2025, 9(12), 312; https://doi.org/10.3390/bdcc9120312 - 4 Dec 2025

Viewed by 998

Abstract

In the context of remanufacturing, particularly mobile device refurbishing, effective operator training is crucial for accurate defect identification and process inspection efficiency. This study examines the application of Natural Language Processing (NLP) techniques to evaluate operator expertise based on subjective textual responses gathered [...] Read more.

In the context of remanufacturing, particularly mobile device refurbishing, effective operator training is crucial for accurate defect identification and process inspection efficiency. This study examines the application of Natural Language Processing (NLP) techniques to evaluate operator expertise based on subjective textual responses gathered during a defect analysis task. Operators were asked to describe screen defects using open-ended questions, and their responses were compared with expert responses to evaluate their accuracy and consistency. We employed four NLP models, including finetuned Sentence-BERT (SBERT), pre-trained SBERT, Word2Vec, and Dice similarity, to determine their effectiveness in interpreting short, domain-specific text. A novel disagreement tagging framework was introduced to supplement traditional similarity metrics with explainable insights. This framework identifies the root causes of model–human misalignment across four categories: defect type, severity, terminology, and location. Results show that a finetuned SBERT model significantly outperforms other models by achieving Pearsons’s correlation of 0.93 with MAE and RMSE scores of 0.07 and 0.12, respectively, providing more accurate and context-aware evaluations. In contrast, other models exhibit limitations in semantic understanding and consistency. The results highlight the importance of finetuning NLP models for domain-specific applications and demonstrate how qualitative tagging methods can enhance interpretability and model debugging. This combined approach indicates a scalable and transparent methodology for the evaluation of operator responses, supporting the development of more effective training programmes in industrial settings where remanufacturing and sustainability generally are a key performance metric. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

► Show Figures

Figure 1

19 pages, 2154 KB

Open AccessArticle

Mining Patient Cohort Discovery: A Synergy of Medical Embeddings and Approximate Nearest Neighbor Search

by Dimitrios Karapiperis, Antonios P. Antoniadis and Vassilios S. Verykios

Electronics 2025, 14(22), 4505; https://doi.org/10.3390/electronics14224505 - 18 Nov 2025

Cited by 1 | Viewed by 838

Abstract

Traditional methods for patient cohort identification from Electronic Health Records (EHRs) are often slow, labor-intensive, and fail to capture the rich semantic nuance embedded in unstructured clinical narratives. This paper introduces a scalable, end-to-end framework that creates a synergy between deep medical embeddings [...] Read more.

Traditional methods for patient cohort identification from Electronic Health Records (EHRs) are often slow, labor-intensive, and fail to capture the rich semantic nuance embedded in unstructured clinical narratives. This paper introduces a scalable, end-to-end framework that creates a synergy between deep medical embeddings and Approximate Nearest Neighbor Search (ANNs) to overcome these limitations. We detail a complete pipeline that begins with preprocessing multi-modal EHR data and creating holistic patient representations using a domain-specific language model combined with an intelligent gated fusion mechanism. These high-dimensional embeddings are then indexed using an ANN method to enable near real-time retrieval. A comprehensive experimental evaluation was conducted on the MIMIC-III and MIMIC-IV datasets, comparing the retrieval performance of ClinicalBERT against BioBERT across several ANN algorithms. The results demonstrate that the combination of ClinicalBERT and HNSW consistently achieves the highest retrieval accuracy, with F1-Scores exceeding 0.78, and query latencies under 10 ms. This framework enables a paradigm shift towards high-speed, semantic patient similarity search, with significant implications for accelerating clinical trial recruitment, augmenting clinical decision support, and paving the way for a new era in data-driven precision medicine. Full article

(This article belongs to the Special Issue Advanced Research in Technology and Information Systems, 2nd Edition)

► Show Figures

Figure 1

31 pages, 3629 KB

Open AccessReview

Sequence-Based Protein–Protein Interaction Prediction and Its Applications in Drug Discovery

by François Charih, James R. Green and Kyle K. Biggar

Cells 2025, 14(18), 1449; https://doi.org/10.3390/cells14181449 - 16 Sep 2025

Cited by 2 | Viewed by 5974

Abstract

Aberrant protein–protein interactions (PPIs) underpin a plethora of human diseases, and disruption of these harmful interactions constitute a compelling treatment avenue. Advances in computational approaches to PPI prediction have closely followed progress in deep learning and natural language processing. In this review, we [...] Read more.

Aberrant protein–protein interactions (PPIs) underpin a plethora of human diseases, and disruption of these harmful interactions constitute a compelling treatment avenue. Advances in computational approaches to PPI prediction have closely followed progress in deep learning and natural language processing. In this review, we outline the state-of-the-art methods for sequence-based PPI prediction and explore their impact on target identification and drug discovery. We begin with an overview of commonly used training data sources and techniques used to curate these data to enhance the quality of the training set. Subsequently, we survey various PPI predictor types, including traditional similarity-based approaches, and deep learning-based approaches with a particular emphasis on transformer architecture. Finally, we provide examples of PPI prediction in system-level proteomics analyses, target identification, and designs of therapeutic peptides and antibodies. This review sheds light on sequence-based PPI prediction, a broadly applicable alternative to structure-based methods, from a unique perspective that emphasizes their roles in the drug discovery process and rigorous model assessment. Full article

(This article belongs to the Special Issue Unraveling Protein–Protein Interactions for Innovative Therapeutics and Nanodelivery)

► Show Figures

Figure 1

Search Results (79)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (79)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI