Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (88)

Search Parameters:
Keywords = text document clustering

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 1474 KB  
Article
Trends of CEO Messages in Corporate Sustainability Reports: Text Mining and CONCOR Analysis
by Yoojin Shin and Hyejin Lee
Sustainability 2026, 18(2), 856; https://doi.org/10.3390/su18020856 - 14 Jan 2026
Viewed by 248
Abstract
Sustainability has become a central concern globally, and efforts to enhance it are being made across various fields. In line with this trend, corporate sustainability reports have become more widely published. These reports provide both financial and non-financial information on a company’s sustainability. [...] Read more.
Sustainability has become a central concern globally, and efforts to enhance it are being made across various fields. In line with this trend, corporate sustainability reports have become more widely published. These reports provide both financial and non-financial information on a company’s sustainability. In this context, this study aims to, first, analyze the key keywords contained in CEO messages. Second, it examines whether the keywords emphasized by CEOs change in response to shifts in corporate risk under economic uncertainty. Finally, it identifies how the categories of words included in these messages are classified. To address these research questions, text analysis was selected as the methodology. Specifically, a qualitative research approach using text mining and CONCOR analysis was conducted on the text from sustainability report. According to the Term Frequency and Term Frequency-Inverse Document Frequency analyses, the most frequently occurring keywords were ESG, Sustainable, Society, Stakeholders, Growth, Environment, Effort, and Future. Centrality analysis identified the following keywords as having high centrality: Sustainable, ESG, Society, Environment, Growth, Effort, and Stakeholders. Finally, CONCOR analysis revealed four clusters: Eco-friendly Energy, ESG Management, Global Crisis, and Technological Competitiveness. This study is significant in that it analyzes the major keywords and their changes within unstructured text data using text mining and CONCOR analysis, and it suggests the possibility of future quantitative analysis of non-financial information using these keywords. Full article
(This article belongs to the Special Issue Sustainable Organization Management and Entrepreneurial Leadership)
Show Figures

Figure 1

33 pages, 465 KB  
Article
A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature
by Jantima Polpinij, Manasawee Kaenampornpan, Christopher S. G. Khoo, Wei-Ning Cheng and Bancha Luaphol
Mathematics 2026, 14(2), 299; https://doi.org/10.3390/math14020299 - 14 Jan 2026
Viewed by 184
Abstract
Extracting and organizing knowledge from the agricultural crop disease research literature are challenging tasks because of the heterogeneous terminologies, complicated symptom descriptions, and unstructured nature of scientific documents. In this study, we developed a multi-stage natural language processing (NLP) pipeline to automate knowledge [...] Read more.
Extracting and organizing knowledge from the agricultural crop disease research literature are challenging tasks because of the heterogeneous terminologies, complicated symptom descriptions, and unstructured nature of scientific documents. In this study, we developed a multi-stage natural language processing (NLP) pipeline to automate knowledge extraction, organization, and integration from the agricultural research literature into a domain-consistent crop disease knowledge graph. The model combines transformer-based sentence embeddings with variational deep clustering to extract topics, which are further refined via facet-aware relevance scoring for sentence selection to be included in the summary. Lexicon-guided named entity recognition helps in the precise identification and normalization of terms for crops, diseases, symptoms, etc. Relation extraction based on a combination of lexical, semantic, and contextual features leads to the meaningful generation of triplets for the knowledge graph. The experimental results show that the method yielded consistently good results at each stage of the knowledge extraction process. Among the combinations of embedding and deep clustering methods, SciBERT + VaDE achieved the best clustering results. The extraction of representative sentences for disease symptoms, control/treatment, and prevention obtained high F1-scores of around 0.8. The resulting knowledge graph has high node coverage and high relation completeness, as well as high precision and recall in triplet generation. The multi-stage NLP pipeline effectively converts unstructured agricultural research texts into a coherent and semantically rich knowledge graph, providing a basis for further research in crop disease analysis, knowledge retrieval, and data-driven decision support in agricultural informatics. Full article
Show Figures

Figure 1

25 pages, 3423 KB  
Article
Unsupervised Text Feature Selection for Clustering via a Hybrid Breeding Cooperative Whale Optimization Algorithm
by Yufeng Zheng, Zhiwei Ye and Songsong Zhang
Algorithms 2026, 19(1), 44; https://doi.org/10.3390/a19010044 - 5 Jan 2026
Viewed by 280
Abstract
In machine learning, feature selection (FS) is crucial for simplifying data while preserving the variables that most influence predictive performance. Although FS has been extensively studied, addressing it in an unsupervised setting remains challenging. Without class labels, optimization is more prone to slow [...] Read more.
In machine learning, feature selection (FS) is crucial for simplifying data while preserving the variables that most influence predictive performance. Although FS has been extensively studied, addressing it in an unsupervised setting remains challenging. Without class labels, optimization is more prone to slow convergence and the local optima. In particular, unsupervised text FS has received comparatively little attention, and its effectiveness is often limited by the underlying search strategy. To address this issue, we propose a hybrid breeding cooperative whale optimization algorithm (HBCWOA) tailored to unsupervised text FS. HBCWOA combines the cooperative evolutionary mechanism of hybrid breeding optimization with the global search capability of the whale optimization algorithm. The population is partitioned into three lines that evolve independently, while high-quality candidates are periodically exchanged among them to maintain diversity and promote stable, progressive convergence. Moreover, we design an adaptive dynamic accurate probabilistic transfer function (ADAPTF) to balance exploration and exploitation. By integrating the refinement ability of S-shaped transfer functions with the broader search ability of V-shaped ones, ADAPTF adaptively adjusts the exploration depth, reduces redundancy, and improves the convergence stability. After FS, K-means clustering is employed to assess how well the selected features structure document groups. Experiments on the CEC2022 benchmark functions and eight text datasets, under multiple evaluation metrics, show that HBCWOA attains faster convergence, more effective search exploration, and higher clustering accuracy than its S-shaped and V-shaped variants as well as several competitive text FS methods. Full article
Show Figures

Figure 1

24 pages, 802 KB  
Article
AI-Facilitated Lecturers in Higher Education Videos as a Tool for Sustainable Education: Legal Framework, Education Theory and Learning Practice
by Anastasia Atabekova, Atabek Atabekov and Tatyana Shoustikova
Sustainability 2026, 18(1), 40; https://doi.org/10.3390/su18010040 - 19 Dec 2025
Viewed by 569
Abstract
The study aims to establish a comprehensive framework aligning institutional governance, pedagogical theories, and teaching practice for the sustainable adoption of AI-facilitated digital representatives of human instructors in higher education videos within universities. The study employs a systemic qualitative approach and grounded theory [...] Read more.
The study aims to establish a comprehensive framework aligning institutional governance, pedagogical theories, and teaching practice for the sustainable adoption of AI-facilitated digital representatives of human instructors in higher education videos within universities. The study employs a systemic qualitative approach and grounded theory principles to analyze administrative/legal documents and academic publications. The methodology includes source searching and screening, automated text analysis using the Lexalytics tool, clustering and thematic interpretation of the findings, and a subsequent discussion of the emerging perspectives. Following the analysis of international/supranational/national regulations, the findings reveal a significant regulatory gap for humans’ digital representatives in educational videos and suggest a governance baseline for tailored institutional guidelines that address data protection, copyright, and ethical compliance. Theoretically, the study synthesizes evidence-informed educational theories and concepts to form a robust theoretical foundation for using humans’ digital representatives in higher education instructional videos and identifies constructivism, student-centered personalized learning, multimodal multimedia-based learning principles, smart and flipped classrooms, and post-digital relations pedagogy as crucial foundational concepts. The findings suggest a thematic taxonomy that outlines diverse digital representative types, their varying efficiency based on knowledge and course type, and university community attitudes highlighting benefits and challenges. The overall contribution of this research lies in an integrated interdisciplinary framework—including the legal context, pedagogical theory, and promising practices—that guides the responsible use of digital human representatives in higher education videos. Full article
Show Figures

Figure 1

20 pages, 1030 KB  
Article
VISTA: A Multi-View, Hierarchical, and Interpretable Framework for Robust Topic Modelling
by Tvrtko Glunčić, Domjan Barić and Matko Glunčić
Mach. Learn. Knowl. Extr. 2025, 7(4), 162; https://doi.org/10.3390/make7040162 - 8 Dec 2025
Viewed by 619
Abstract
Topic modeling is a fundamental technique in natural language processing used to uncover latent themes in large text corpora, yet existing approaches struggle to jointly achieve interpretability, semantic coherence, and scalability. Classical probabilistic models such as LDA and NMF rely on bag-of-words assumptions [...] Read more.
Topic modeling is a fundamental technique in natural language processing used to uncover latent themes in large text corpora, yet existing approaches struggle to jointly achieve interpretability, semantic coherence, and scalability. Classical probabilistic models such as LDA and NMF rely on bag-of-words assumptions that obscure contextual meaning, while embedding-based methods (e.g., BERTopic, Top2Vec) improve coherence at the expense of diversity and stability. Prompt-based frameworks (e.g., TopicGPT) enhance interpretability but remain sensitive to prompt design and are computationally costly on large datasets. This study introduces VISTA (Vector-Similarity Topic Analysis), a multi-view, hierarchical, and interpretable framework that integrates complementary document embeddings, mutual-nearest-neighbor hierarchical clustering with selective dimension analysis, and large language model (LLM)-based topic labeling enforcing hierarchical coherence. Experiments on three heterogeneous corpora—BBC News, BillSum, and a mixed U.S. Government agency news + Twitter dataset—show that VISTA consistently ranks among the top-performing models, achieving the highest C_UCI coherence and a strong balance between topic diversity and semantic consistency. Qualitative analyses confirm that VISTA identifies domain-relevant themes overlooked by probabilistic or prompt-based models. Overall, VISTA provides a scalable, semantically robust, and interpretable framework for topic discovery, bridging probabilistic, embedding-based, and LLM-driven paradigms in a unified and reproducible design. Full article
(This article belongs to the Section Visualization)
Show Figures

Graphical abstract

21 pages, 3145 KB  
Article
Machine Learning-Based Semantic Analysis of Scientific Publications for Knowledge Extraction in Safety-Critical Domains
by Pavlo Nosov, Oleksiy Melnyk, Mykola Malaksiano, Pavlo Mamenko, Dmytro Onyshko, Oleksij Fomin, Václav Píštěk and Pavel Kučera
Mach. Learn. Knowl. Extr. 2025, 7(4), 150; https://doi.org/10.3390/make7040150 - 24 Nov 2025
Cited by 1 | Viewed by 666
Abstract
This article presents the development of a modular software suite for automated analysis of scientific publications in PDF format. The system integrates vectorization, clustering, topic modelling, dimensionality reduction, and fuzzy logic to combine both formal (vector-based) and semantic (topic-based) approaches. Interactive 3D visualization [...] Read more.
This article presents the development of a modular software suite for automated analysis of scientific publications in PDF format. The system integrates vectorization, clustering, topic modelling, dimensionality reduction, and fuzzy logic to combine both formal (vector-based) and semantic (topic-based) approaches. Interactive 3D visualization supports intuitive exploration of thematic clusters, allowing users to highlight relevant documents and adjust analytical parameters. Validation on a maritime safety case study confirmed the system’s ability to process large publication collections, identify relevant sources, and reveal underlying knowledge structures. Compared to established frameworks such as PRISMA or Scopus/WoS Analytics, the proposed tool operates directly on full-text content, provides deeper thematic classification, and does not require subscription-based databases. The study also addresses the limitations arising from data bias and reproducibility issues in the semantic interpretability of safety-critical decision-making systems. The approach offers practical value for organizations in safety-critical domains—including transportation, energy, cybersecurity, and human–machine interaction—where rapid access to thematically related research is essential. Full article
Show Figures

Graphical abstract

24 pages, 1537 KB  
Article
Creative Tourist Segmentation for Nature-Based Tourism: A Social Media Framework for Sustainable Recreation Planning and Development in Thailand’s National Parks
by Kinggarn Sinsup and Sangsan Phumsathan
Sustainability 2025, 17(22), 10005; https://doi.org/10.3390/su172210005 - 9 Nov 2025
Viewed by 1111
Abstract
This study investigates the potential of creative tourism in Thailand’s national parks and the role of social media in promoting creative tourism experiences. The objectives were to examine creative tourism activities, identify visitor segments based on activity preferences and media use, and propose [...] Read more.
This study investigates the potential of creative tourism in Thailand’s national parks and the role of social media in promoting creative tourism experiences. The objectives were to examine creative tourism activities, identify visitor segments based on activity preferences and media use, and propose targeted communication strategies to enhance engagement and support sustainable tourism. A mixed-methods design combined document reviews of 133 national parks, field surveys in 10 parks, and a structured visitor survey with 1133 respondents across terrestrial and marine parks. The study identified 25 tourism activities, of which 20 were classified as creative tourism. Exploratory Factor Analysis revealed four key dimensions: nature-based learning, scenic immersion, community participation, and culinary experiences. Cluster analysis segmented visitors into five groups: Local Advocates, Nature Explorers, Food Enthusiasts, Nature Learners, and Diverse Enthusiasts. Media preferences varied across groups. Nature Explorers and Food Enthusiasts engaged strongly with short-form videos and scenic visuals, while Local Advocates and Nature Learners preferred educational and text-based formats. Diverse Enthusiasts, the largest segment, interacted with multiple content types. Scenic imagery emerged as the most influential theme overall. These results provide practical implications for designing creative tourism strategies and creating social media campaigns to diverse groups of tourists in Thailand’s national parks. Full article
Show Figures

Figure 1

18 pages, 397 KB  
Article
Towards Stringent Ecological Protection and Sustainable Spatial Planning: Institutional Grammar Analysis of China’s Urban–Rural Land Use Policy Regulations
by Yuewen Chen, Cheng Zhou and Clare Richardson-Barlow
Land 2025, 14(9), 1896; https://doi.org/10.3390/land14091896 - 16 Sep 2025
Cited by 2 | Viewed by 1387
Abstract
Emerging hybrid governance models are transforming conventional approaches to land-use regulation by simultaneously enabling urban–rural development and enforcing ecological safeguards. This study investigates the regulatory mechanisms underpinning China’s urban–rural land-use policies through an innovative mixed-methods approach, integrating systematic text analysis and the Institutional [...] Read more.
Emerging hybrid governance models are transforming conventional approaches to land-use regulation by simultaneously enabling urban–rural development and enforcing ecological safeguards. This study investigates the regulatory mechanisms underpinning China’s urban–rural land-use policies through an innovative mixed-methods approach, integrating systematic text analysis and the Institutional Grammar Tool (IGT). Drawing on a comprehensive dataset of 62 national policy documents (2012–2024), we employ textual coding and thematic clustering to identify seven core policy pathways, ranging from territorial spatial planning to ecological protection. These pathways are further deconstructed using IGT to assess their regulatory intensity, revealing a tripartite governance model: (1) flexible AIC-strategies (e.g., land market mechanisms), which enable local experimentation by specifying actors, aims, and conditions without rigid obligations; (2) adaptive ADIC-norms (e.g., collective land reforms), which balance central directives with localized discretion through conditional deontic rules; and (3) rigid ADICO-rules (e.g., ecological redlines), which enforce absolute compliance through binding sanctions. Through systematic analysis of land use policy regulations, we reveal how China’s hybrid governance system operationalizes a tripartite institutional logic—maintaining rigid regulatory control (ADICO-rules) in ecologically critical zones, adaptive policy experimentation (ADIC-norms) in transitional areas, and flexible market-based instruments (AIC-strategies) in development zones—thereby dynamically reconciling environmental conservation with socioeconomic diversification. The study advances both institutional theory through its grammatical analysis of policy instruments and governance theory by transcending the traditional command-and-control versus flexible governance dichotomy. Practically, the research offers actionable insights for policymakers in emerging economies, emphasizing spatially differentiated regulation, dynamic monitoring system, and strategic coupling of binding rules with flexible implementation mechanisms. Full article
(This article belongs to the Section Land Socio-Economic and Political Issues)
Show Figures

Figure 1

27 pages, 1481 KB  
Article
Integration of Associative Tokens into Thematic Hyperspace: A Method for Determining Semantically Significant Clusters in Dynamic Text Streams
by Dmitriy Rodionov, Boris Lyamin, Evgenii Konnikov, Elena Obukhova, Gleb Golikov and Prokhor Polyakov
Big Data Cogn. Comput. 2025, 9(8), 197; https://doi.org/10.3390/bdcc9080197 - 25 Jul 2025
Cited by 1 | Viewed by 1305
Abstract
With the exponential growth of textual data, traditional topic modeling methods based on static analysis demonstrate limited effectiveness in tracking the dynamics of thematic content. This research aims to develop a method for quantifying the dynamics of topics within text corpora using a [...] Read more.
With the exponential growth of textual data, traditional topic modeling methods based on static analysis demonstrate limited effectiveness in tracking the dynamics of thematic content. This research aims to develop a method for quantifying the dynamics of topics within text corpora using a thematic signal (TS) function that accounts for temporal changes and semantic relationships. The proposed method combines associative tokens with original lexical units to reduce thematic entropy and information noise. Approaches employed include topic modeling (LDA), vector representations of texts (TF-IDF, Word2Vec), and time series analysis. The method was tested on a corpus of news texts (5000 documents). Results demonstrated robust identification of semantically meaningful thematic clusters. An inverse relationship was observed between the level of thematic significance and semantic diversity, confirming a reduction in entropy using the proposed method. This approach allows for quantifying topic dynamics, filtering noise, and determining the optimal number of clusters. Future applications include analyzing multilingual data and integration with neural network models. The method shows potential for monitoring information flows and predicting thematic trends. Full article
Show Figures

Figure 1

36 pages, 3724 KB  
Article
Security Hardening and Compliance Assessment of Kubernetes Control Plane and Workloads
by Zlatan Morić, Vedran Dakić and Tomislav Čavala
J. Cybersecur. Priv. 2025, 5(2), 30; https://doi.org/10.3390/jcp5020030 - 4 Jun 2025
Cited by 1 | Viewed by 5581
Abstract
Containerized applications are pivotal to contemporary cloud-native architectures, yet they present novel security challenges. Kubernetes, a prevalent open-source platform for container orchestration, provides robust automation but lacks inherent security measures. The intricate architecture and scattered security documentation may result in misconfigurations and vulnerabilities, [...] Read more.
Containerized applications are pivotal to contemporary cloud-native architectures, yet they present novel security challenges. Kubernetes, a prevalent open-source platform for container orchestration, provides robust automation but lacks inherent security measures. The intricate architecture and scattered security documentation may result in misconfigurations and vulnerabilities, jeopardizing system confidentiality, integrity, and availability. This paper analyzes the key aspects of Kubernetes security by combining theoretical examination with practical application, concentrating on architectural hardening, access control, image security, and compliance assessment. The text commences with a synopsis of Kubernetes architecture, networking, and storage, analyzing prevalent security issues in containerized environments. The emphasis transitions to practical methodologies for safeguarding clusters, encompassing image scanning, authentication and authorization, monitoring, and logging. The paper also examines recognized Kubernetes CVEs and illustrates vulnerability scanning on a local cluster. The objective is to deliver explicit, implementable recommendations for enhancing Kubernetes security, assisting organizations in constructing more robust containerized systems. Full article
(This article belongs to the Special Issue Cyber Security and Digital Forensics—2nd Edition)
Show Figures

Figure 1

28 pages, 5257 KB  
Article
Comparative Evaluation of Sequential Neural Network (GRU, LSTM, Transformer) Within Siamese Networks for Enhanced Job–Candidate Matching in Applied Recruitment Systems
by Mateusz Łępicki, Tomasz Latkowski, Izabella Antoniuk, Michał Bukowski, Bartosz Świderski, Grzegorz Baranik, Bogusz Nowak, Robert Zakowicz, Łukasz Dobrakowski, Bogdan Act and Jarosław Kurek
Appl. Sci. 2025, 15(11), 5988; https://doi.org/10.3390/app15115988 - 26 May 2025
Cited by 2 | Viewed by 2988
Abstract
Job–candidate matching is pivotal in recruitment, yet traditional manual or keyword-based methods can be laborious and prone to missing qualified candidates. In this study, we introduce the first Siamese framework that systematically contrasts GRU, LSTM, and Transformer sequential heads on top of a [...] Read more.
Job–candidate matching is pivotal in recruitment, yet traditional manual or keyword-based methods can be laborious and prone to missing qualified candidates. In this study, we introduce the first Siamese framework that systematically contrasts GRU, LSTM, and Transformer sequential heads on top of a multilingual Sentence Transformer backbone, which is trained end-to-end with triplet loss on real-world recruitment data. This combination captures both long-range dependencies across document segments and global semantics, representing a substantial advance over approaches that rely solely on static embeddings. We compare the three heads using ranking metrics such as Top-K accuracy and Mean Reciprocal Rank (MRR). The Transformer-based model yields the best overall performance, with an MRR of 0.979 and a Top-100 accuracy of 87.20% on the test set. Visualization of learned embeddings (t-SNE) shows that self-attention more effectively clusters matching texts and separates them from irrelevant ones. These findings underscore the potential of combining multilingual base embeddings with specialized sequential layers to reduce manual screening efforts and improve recruitment efficiency. Full article
(This article belongs to the Special Issue Innovations in Artificial Neural Network Applications)
Show Figures

Figure 1

15 pages, 2051 KB  
Article
Analysis of Short Texts Using Intelligent Clustering Methods
by Jamalbek Tussupov, Akmaral Kassymova, Ayagoz Mukhanova, Assyl Bissengaliyeva, Zhanar Azhibekova, Moldir Yessenova and Zhanargul Abuova
Algorithms 2025, 18(5), 289; https://doi.org/10.3390/a18050289 - 19 May 2025
Cited by 2 | Viewed by 2074
Abstract
This article presents a comprehensive review of short text clustering using state-of-the-art methods: Bidirectional Encoder Representations from Transformers (BERT), Term Frequency-Inverse Document Frequency (TF-IDF), and the novel hybrid method Latent Dirichlet Allocation + BERT + Autoencoder (LDA + BERT + AE). The article [...] Read more.
This article presents a comprehensive review of short text clustering using state-of-the-art methods: Bidirectional Encoder Representations from Transformers (BERT), Term Frequency-Inverse Document Frequency (TF-IDF), and the novel hybrid method Latent Dirichlet Allocation + BERT + Autoencoder (LDA + BERT + AE). The article begins by outlining the theoretical foundation of each technique and its merits and limitations. BERT is critiqued for its ability to understand word dependence in text, while TF-IDF is lauded for its applicability in terms of importance assessment. The experimental section compares the efficacy of these methods in clustering short texts, with a specific focus on the hybrid LDA + BERT + AE approach. A detailed examination of the LDA-BERT model’s training and validation loss over 200 epochs shows that the loss values start above 1.2 and quickly decrease to around 0.8 within the first 25 epochs, eventually stabilizing at approximately 0.4. The close alignment of these curves suggests the model’s practical learning and generalization capabilities, with minimal overfitting. The study demonstrates that the hybrid LDA + BERT + AE method significantly enhances text clustering quality compared to individual methods. Based on the findings, the study recommends the optimum choice and use of clustering methods for different short texts and natural language processing operations. The applications of these methods in industrial and educational settings, where successful text handling and categorization are critical, are also addressed. The study ends by emphasizing the importance of the holistic handling of short texts for deeper semantic comprehension and effective information retrieval. Full article
(This article belongs to the Section Databases and Data Structures)
Show Figures

Graphical abstract

13 pages, 1277 KB  
Article
Variations in Sleep, Fatigue, and Difficulty with Concentration Among Emergency Medical Services Clinicians During Shifts of Different Durations
by Paul D. Patterson, Sarah E. Martin, Sean A. MacAllister, Matthew D. Weaver and Charity G. Patterson
Int. J. Environ. Res. Public Health 2025, 22(4), 573; https://doi.org/10.3390/ijerph22040573 - 6 Apr 2025
Cited by 1 | Viewed by 1906
Abstract
We sought to characterize momentary changes in fatigue, sleepiness, and difficulty with concentration during short and long duration shifts worked by emergency medical services (EMS) and fire personnel across the United States. In addition, we tested for differences in pre-shift and on-shift sleep [...] Read more.
We sought to characterize momentary changes in fatigue, sleepiness, and difficulty with concentration during short and long duration shifts worked by emergency medical services (EMS) and fire personnel across the United States. In addition, we tested for differences in pre-shift and on-shift sleep stratified by shift duration. We examined real-time mobile-phone text message queries during scheduled shifts from the EMS Sleep Health Study, a nationwide, cluster-randomized trial (ClinicalTrials.gov Identifier: NCT04218279). Linear mixed effects models were used and Bonferroni p-values reported for multiple comparisons. In total, 388 EMS clinicians from 35 EMS/fire agencies documented 4573 shifts and responded to 64.6% of 161,888 text message queries. Most shifts (85.5%) were 12 or 24 h in duration. Mean sleep hours pre-shift was 6.2 (SD1.9) and mean sleep hours on shift was 3.4 (SD2.9) and varied by shift duration (p < 0.0001). The highest level of fatigue, sleepiness, and difficulty with concentration during any shift occurred during 24 h shifts and corresponded to the early morning hours at 03:00 or 04:00 a.m. The real-time assessments of sleep hours and fatigue in this study revealed deficits in sleep health for EMS and fire personnel and critical time points for intervention during shifts when the risk to safety is high. Full article
(This article belongs to the Topic New Research in Work-Related Diseases, Safety and Health)
Show Figures

Figure 1

17 pages, 4169 KB  
Article
Benchmarking Interpretability in Healthcare Using Pattern Discovery and Disentanglement
by Pei-Yuan Zhou, Amane Takeuchi, Fernando Martinez-Lopez, Malikeh Ehghaghi, Andrew K. C. Wong and En-Shiun Annie Lee
Bioengineering 2025, 12(3), 308; https://doi.org/10.3390/bioengineering12030308 - 18 Mar 2025
Cited by 1 | Viewed by 1586
Abstract
The healthcare industry seeks to integrate AI into clinical applications, yet understanding AI decision making remains a challenge for healthcare practitioners as these systems often function as black boxes. Our work benchmarks the Pattern Discovery and Disentanglement (PDD) system’s unsupervised learning algorithm, which [...] Read more.
The healthcare industry seeks to integrate AI into clinical applications, yet understanding AI decision making remains a challenge for healthcare practitioners as these systems often function as black boxes. Our work benchmarks the Pattern Discovery and Disentanglement (PDD) system’s unsupervised learning algorithm, which provides interpretable outputs and clustering results from clinical notes to aid decision making. Using the MIMIC-IV dataset, we process free-text clinical notes and ICD-9 codes with Term Frequency-Inverse Document Frequency and Topic Modeling. The PDD algorithm discretizes numerical features into event-based features, discovers association patterns from a disentangled statistical feature value association space, and clusters clinical records. The output is an interpretable knowledge base linking knowledge, patterns, and data to support decision making. Despite being unsupervised, PDD demonstrated performance comparable to supervised deep learning models, validating its clustering ability and knowledge representation. We benchmark interpretability techniques—Feature Permutation, Gradient SHAP, and Integrated Gradients—on the best-performing models (in terms of F1, ROC AUC, balanced accuracy, etc.), evaluating these based on sufficiency, comprehensiveness, and sensitivity metrics. Our findings highlight the limitations of feature importance ranking and post hoc analysis for clinical diagnosis. Meanwhile, PDD’s global interpretability effectively compensates for these issues, helping healthcare practitioners understand the decision-making process and providing suggestive clusters of diseases to assist their diagnosis. Full article
(This article belongs to the Special Issue Mathematical Models for Medical Diagnosis and Testing)
Show Figures

Figure 1

32 pages, 1286 KB  
Article
Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter
by Ondřej Rozinek, Jaroslav Marek, Jan Panuš and Jan Mareš
Algorithms 2025, 18(3), 150; https://doi.org/10.3390/a18030150 - 6 Mar 2025
Cited by 2 | Viewed by 2578
Abstract
In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words [...] Read more.
In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of O(1). In the second stage, FRMS runs for a polynomial time of approximately O(n4) and models human perception of record similarity by maximum weight matching in a bipartite graph. The FRMS architecture has widespread applications in structured document storage such as databases and has already been commercialized by one of the largest IT companies. As a side result, we explain the behavior of the singularity of the Q-gram filter and the advantages of a padding extension. Overall, our method provides a more accurate and efficient approach to approximate string matching and search with real-time runtime. Full article
(This article belongs to the Section Analysis of Algorithms and Complexity Theory)
Show Figures

Figure 1

Back to TopTop