Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (266)

Search Parameters:
Keywords = bag-of-words

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 1030 KB  
Article
VISTA: A Multi-View, Hierarchical, and Interpretable Framework for Robust Topic Modelling
by Tvrtko Glunčić, Domjan Barić and Matko Glunčić
Mach. Learn. Knowl. Extr. 2025, 7(4), 162; https://doi.org/10.3390/make7040162 - 8 Dec 2025
Viewed by 586
Abstract
Topic modeling is a fundamental technique in natural language processing used to uncover latent themes in large text corpora, yet existing approaches struggle to jointly achieve interpretability, semantic coherence, and scalability. Classical probabilistic models such as LDA and NMF rely on bag-of-words assumptions [...] Read more.
Topic modeling is a fundamental technique in natural language processing used to uncover latent themes in large text corpora, yet existing approaches struggle to jointly achieve interpretability, semantic coherence, and scalability. Classical probabilistic models such as LDA and NMF rely on bag-of-words assumptions that obscure contextual meaning, while embedding-based methods (e.g., BERTopic, Top2Vec) improve coherence at the expense of diversity and stability. Prompt-based frameworks (e.g., TopicGPT) enhance interpretability but remain sensitive to prompt design and are computationally costly on large datasets. This study introduces VISTA (Vector-Similarity Topic Analysis), a multi-view, hierarchical, and interpretable framework that integrates complementary document embeddings, mutual-nearest-neighbor hierarchical clustering with selective dimension analysis, and large language model (LLM)-based topic labeling enforcing hierarchical coherence. Experiments on three heterogeneous corpora—BBC News, BillSum, and a mixed U.S. Government agency news + Twitter dataset—show that VISTA consistently ranks among the top-performing models, achieving the highest C_UCI coherence and a strong balance between topic diversity and semantic consistency. Qualitative analyses confirm that VISTA identifies domain-relevant themes overlooked by probabilistic or prompt-based models. Overall, VISTA provides a scalable, semantically robust, and interpretable framework for topic discovery, bridging probabilistic, embedding-based, and LLM-driven paradigms in a unified and reproducible design. Full article
(This article belongs to the Section Visualization)
Show Figures

Graphical abstract

21 pages, 737 KB  
Article
A Comparative Analysis of Corporate Sustainability Reporting: A Multi-Method Approach to China and the United States
by Qiao Meng, Daniel Knapp, Leo Brecht and Roland Eckert
Sustainability 2025, 17(22), 10315; https://doi.org/10.3390/su172210315 - 18 Nov 2025
Viewed by 1201
Abstract
The increasing importance of sustainability reporting requires a deeper understanding of how companies communicate their sustainability efforts across regions and sectors. This study focuses on China and the United States as subjects. By analyzing corporate sustainability reports from these two major economies in [...] Read more.
The increasing importance of sustainability reporting requires a deeper understanding of how companies communicate their sustainability efforts across regions and sectors. This study focuses on China and the United States as subjects. By analyzing corporate sustainability reports from these two major economies in 2022, it evaluates the effects of regional and sectoral differences on sustainable practices, with the aim of deepening the understanding of organizational sustainability. Using topic modeling, this study identified the key topics and patterns that companies in the two countries prioritize in their corporate sustainability reporting. A bag-of-words approach was adopted to analyze the attitudes of corporations in two countries toward environmental, social, and governance dimensions, with a focus on sector-specific differences. Finally, sentiment analysis with ClimateBERT assessed the tone of the reports. The findings reveal similarities and sector-specific differences in corporate sustainability reporting between China and the United States, as well as displaying divergent emphases on climate-related risks and opportunities. This study offers a multi-method approach to evaluating corporate sustainability reporting, contributing to a better understanding of sustainability practices in different national and industrial contexts, and offering effective guidance for actual industry regulators and stakeholders. Full article
(This article belongs to the Section Economic and Business Aspects of Sustainability)
Show Figures

Figure 1

14 pages, 1592 KB  
Article
Fine-Tuning Large Language Models for Effective Nutrition Support in Residential Aged Care: A Domain Expertise Approach
by Mohammad Alkhalaf, Dinithi Vithanage, Jun Shen, Hui Chen (Rita) Chang, Chao Deng and Ping Yu
Healthcare 2025, 13(20), 2614; https://doi.org/10.3390/healthcare13202614 - 17 Oct 2025
Viewed by 813
Abstract
Background: Malnutrition is a serious health concern among older adults in residential aged care (RAC), and timely identification is critical for effective intervention. Recent advancements in transformer-based large language models (LLMs), such as RoBERTa, provide context-aware embeddings that improve predictive performance in clinical [...] Read more.
Background: Malnutrition is a serious health concern among older adults in residential aged care (RAC), and timely identification is critical for effective intervention. Recent advancements in transformer-based large language models (LLMs), such as RoBERTa, provide context-aware embeddings that improve predictive performance in clinical tasks. Fine-tuning these models on domain-specific corpora, like nursing progress notes, can further enhance their applicability in healthcare. Methodology: We developed a RAC domain-specific LLM by training RoBERTa on 500,000 nursing progress notes from RAC electronic health records (EHRs). The model’s embeddings were used for two downstream tasks: malnutrition note identification and malnutrition prediction. Long sequences were truncated and processed in segments of up to 1536 tokens to fit RoBERTa’s 512-token input limit. Performance was compared against Bag of Words, GloVe, baseline RoBERTa, BlueBERT, ClinicalBERT, BioClinicalBERT, and PubMed models. Results: Using 5-fold cross-validation, the RAC domain-specific LLM outperformed other models. For malnutrition note identification, it achieved an F1-score of 0.966, and for malnutrition prediction, it achieved an F1-score of 0.687. Conclusions: This approach demonstrates the feasibility of developing specialised LLMs for identifying and predicting malnutrition among older adults in RAC. Future work includes further optimisation of prediction performance and integration with clinical workflows to support early intervention. Full article
Show Figures

Figure 1

20 pages, 2227 KB  
Article
Tuberculosis Detection from Cough Recordings Using Bag-of-Words Classifiers
by Irina Pavel and Iulian B. Ciocoiu
Sensors 2025, 25(19), 6133; https://doi.org/10.3390/s25196133 - 3 Oct 2025
Viewed by 1368
Abstract
The paper proposes the use of Bag-of-Words classifiers for the reliable detection of tuberculosis infection from cough recordings. The effect of using both independent and combined distinct feature extraction procedures and encoding strategies is evaluated in terms of standard performance metrics such as [...] Read more.
The paper proposes the use of Bag-of-Words classifiers for the reliable detection of tuberculosis infection from cough recordings. The effect of using both independent and combined distinct feature extraction procedures and encoding strategies is evaluated in terms of standard performance metrics such as the Area Under Curve (AUC), accuracy, sensitivity, and F1-score. Experiments were conducted on two distinct large datasets, using both the original recordings and extended versions obtained by augmentation techniques. Performances were assessed by repeated k-fold cross-validation and by employing external datasets. An extensive ablation study revealed that the proposed approach yields up to 0.77 accuracy and 0.84 AUC values, comparing favorably against existing solutions and exhibiting robustness against various combinations of the setup parameters. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

16 pages, 1007 KB  
Article
Learning SMILES Semantics: Word2Vec and Transformer Embeddings for Molecular Property Prediction
by Saya Hashemian, Zak Khan, Pulkit Kalhan and Yang Liu
Algorithms 2025, 18(9), 547; https://doi.org/10.3390/a18090547 - 1 Sep 2025
Viewed by 1323
Abstract
This paper investigates the effectiveness of Word2Vec-based molecular representation learning on SMILES (Simplified Molecular Input Line Entry System) strings for a downstream prediction task related to the market approvability of chemical compounds. Here, market approvability is treated as a proxy classification label derived [...] Read more.
This paper investigates the effectiveness of Word2Vec-based molecular representation learning on SMILES (Simplified Molecular Input Line Entry System) strings for a downstream prediction task related to the market approvability of chemical compounds. Here, market approvability is treated as a proxy classification label derived from approval status, where only the molecular structure is analyzed. We train character-level embeddings using Continuous Bag of Words (CBOW) and Skip-Gram with Negative Sampling architectures and apply the resulting embeddings in a downstream classification task using a multi-layer perceptron (MLP). To evaluate the utility of these lightweight embedding techniques, we conduct experiments on a curated SMILES dataset labeled by approval status under both imbalanced and SMOTE-balanced training conditions. In addition to our Word2Vec-based models, we include a ChemBERTa-based baseline using the pretrained ChemBERTa-77M model. Our findings show that while ChemBERTa achieves a higher performance, the Word2Vec-based models offer a favorable trade-off between accuracy and computational efficiency. This efficiency is especially relevant in large-scale compound screening, where rapid exploration of the chemical space can support early-stage cheminformatics workflows. These results suggest that traditional embedding models can serve as viable alternatives for scalable and interpretable cheminformatics pipelines, particularly in resource-constrained environments. Full article
Show Figures

Figure 1

29 pages, 1051 KB  
Article
Urdu Toxicity Detection: A Multi-Stage and Multi-Label Classification Approach
by Ayesha Rashid, Sajid Mahmood, Usman Inayat and Muhammad Fahad Zia
AI 2025, 6(8), 194; https://doi.org/10.3390/ai6080194 - 21 Aug 2025
Cited by 1 | Viewed by 2968
Abstract
Social media empowers freedom of expression but is often misused for abuse and hate. The detection of such content is crucial, especially in under-resourced languages like Urdu. To address this challenge, this paper designed a comprehensive multilabel dataset, the Urdu toxicity corpus (UTC). [...] Read more.
Social media empowers freedom of expression but is often misused for abuse and hate. The detection of such content is crucial, especially in under-resourced languages like Urdu. To address this challenge, this paper designed a comprehensive multilabel dataset, the Urdu toxicity corpus (UTC). Second, the Urdu toxicity detection model is developed, which detects toxic content from an Urdu dataset presented in Nastaliq Font. The proposed framework initially processed the gathered data and then applied feature engineering using term frequency-inverse document frequency, bag-of-words, and N-gram techniques. Subsequently, the synthetic minority over-sampling technique is used to address the data imbalance problem, and manual data annotation is performed to ensure label accuracy. Four machine learning models, namely logistic regression, support vector machine, random forest, and gradient boosting, are applied to preprocessed data. The results indicate that the RF outperformed all evaluation metrics. Deep learning algorithms, including long short-term memory (LSTM), Bidirectional LSTM, and gated recurrent unit, have also been applied to UTC for classification purposes. Random forest outperforms the other models, achieving a precision, recall, F1-score, and accuracy of 0.97, 0.99, 0.98, and 0.99, respectively. The proposed model demonstrates a strong potential to detect rude, offensive, abusive, and hate speech content from user comments in Urdu Nastaliq. Full article
Show Figures

Figure 1

23 pages, 978 KB  
Article
Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons
by Ron Keinan, Efraim Margalit and Dan Bouhnik
Electronics 2025, 14(15), 3067; https://doi.org/10.3390/electronics14153067 - 31 Jul 2025
Cited by 1 | Viewed by 880
Abstract
This paper explores emotional analysis in Hebrew texts, focusing on improving machine learning techniques for depression detection by integrating psychological feature lexicons. Hebrew’s complex morphology makes emotional analysis challenging, and this study seeks to address that by combining traditional machine learning methods with [...] Read more.
This paper explores emotional analysis in Hebrew texts, focusing on improving machine learning techniques for depression detection by integrating psychological feature lexicons. Hebrew’s complex morphology makes emotional analysis challenging, and this study seeks to address that by combining traditional machine learning methods with sentiment lexicons. The dataset consists of over 350,000 posts from 25,000 users on the health-focused social network “Camoni” from 2010 to 2021. Various machine learning models—SVM, Random Forest, Logistic Regression, and Multi-Layer Perceptron—were used, alongside ensemble techniques like Bagging, Boosting, and Stacking. TF-IDF was applied for feature selection, with word and character n-grams, and pre-processing steps like punctuation removal, stop word elimination, and lemmatization were performed to handle Hebrew’s linguistic complexity. The models were enriched with sentiment lexicons curated by professional psychologists. The study demonstrates that integrating sentiment lexicons significantly improves classification accuracy. Specific lexicons—such as those for negative and positive emojis, hostile words, anxiety words, and no-trust words—were particularly effective in enhancing model performance. Our best model classified depression with an accuracy of 84.1%. These findings offer insights into depression detection, suggesting that practitioners in mental health and social work can improve their machine learning models for detecting depression in online discourse by incorporating emotion-based lexicons. The societal impact of this work lies in its potential to improve the detection of depression in online Hebrew discourse, offering more accurate and efficient methods for mental health interventions in online communities. Full article
(This article belongs to the Special Issue Techniques and Applications of Multimodal Data Fusion)
Show Figures

Figure 1

22 pages, 579 KB  
Article
Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics
by Klaus Lehmann, Elio Villaseñor, Alejandro Pimentel, Javiera Preuss, Nicolás Berhó, Oswaldo Diaz and Ignacio Agloni
Stats 2025, 8(3), 68; https://doi.org/10.3390/stats8030068 - 30 Jul 2025
Viewed by 2720
Abstract
This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier [...] Read more.
This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier with bag-of-words features and word embeddings features, an LSTM network using pretrained Spanish word embeddings as a language model, and a fine-tuned BERT language model (BETO). Deep learning models outperformed the traditional baseline, with BETO achieving the highest accuracy. The new ENUSC (Encuesta Nacional Urbana de Seguridad Ciudadana) workflow integrates the selected model into an API for automated classification, incorporating a certainty threshold to distinguish between cases suitable for automation and those requiring expert review. This hybrid strategy led to a 68.4% reduction in manual review workload while preserving high-quality standards. This study represents the first documented application of deep learning for the automated classification of victimization narratives in official statistics, demonstrating its feasibility and impact in a real-world production environment. Our results demonstrate that deep learning can significantly improve the efficiency and consistency of crime statistics coding, offering a scalable solution for other national statistical offices. Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
Show Figures

Figure 1

32 pages, 465 KB  
Article
EsCorpiusBias: The Contextual Annotation and Transformer-Based Detection of Racism and Sexism in Spanish Dialogue
by Ksenia Kharitonova, David Pérez-Fernández, Javier Gutiérrez-Hernando, Asier Gutiérrez-Fandiño, Zoraida Callejas and David Griol
Future Internet 2025, 17(8), 340; https://doi.org/10.3390/fi17080340 - 28 Jul 2025
Viewed by 987
Abstract
The rise in online communication platforms has significantly increased exposure to harmful discourse, presenting ongoing challenges for digital moderation and user well-being. This paper introduces the EsCorpiusBias corpus, designed to enhance the automated detection of sexism and racism within Spanish-language online dialogue, specifically [...] Read more.
The rise in online communication platforms has significantly increased exposure to harmful discourse, presenting ongoing challenges for digital moderation and user well-being. This paper introduces the EsCorpiusBias corpus, designed to enhance the automated detection of sexism and racism within Spanish-language online dialogue, specifically sourced from the Mediavida forum. By means of a systematic, context-sensitive annotation protocol, approximately 1000 three-turn dialogue units per bias category are annotated, ensuring the nuanced recognition of pragmatic and conversational subtleties. Here, annotation guidelines are meticulously developed, covering explicit and implicit manifestations of sexism and racism. Annotations are performed using the Prodigy tool (v1. 16.0) resulting in moderate to substantial inter-annotator agreement (Cohen’s Kappa: 0.55 for sexism and 0.79 for racism). Models including logistic regression, SpaCy’s baseline n-gram bag-of-words model, and transformer-based BETO are trained and evaluated, demonstrating that contextualized transformer-based approaches significantly outperform baseline and general-purpose models. Notably, the single-turn BETO model achieves an ROC-AUC of 0.94 for racism detection, while the contextual BETO model reaches an ROC-AUC of 0.87 for sexism detection, highlighting BETO’s superior effectiveness in capturing nuanced bias in online dialogues. Additionally, lexical overlap analyses indicate a strong reliance on explicit lexical indicators, highlighting limitations in handling implicit biases. This research underscores the importance of contextually grounded, domain-specific fine-tuning for effective automated detection of toxicity, providing robust resources and methodologies to foster socially responsible NLP systems within Spanish-speaking online communities. Full article
(This article belongs to the Special Issue Deep Learning and Natural Language Processing—3rd Edition)
Show Figures

Figure 1

18 pages, 591 KB  
Article
Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings
by Radosław Pytlak, Paweł Cichosz, Bartłomiej Fajdek and Bogdan Jastrzębski
Appl. Sci. 2025, 15(14), 7955; https://doi.org/10.3390/app15147955 - 17 Jul 2025
Viewed by 1091
Abstract
Systems supporting systematic literature reviews often use machine learning algorithms to create classification models to assess the relevance of articles to study topics. The proper choice of text representation for such algorithms may have a significant impact on their predictive performance. This article [...] Read more.
Systems supporting systematic literature reviews often use machine learning algorithms to create classification models to assess the relevance of articles to study topics. The proper choice of text representation for such algorithms may have a significant impact on their predictive performance. This article presents an in-depth investigation of the utility of the bag of concepts representation for this purpose, which can be considered an enhanced form of the ubiquitous bag of words representation, with features corresponding to ontology concepts rather than words. Its utility is evaluated in the active learning setting, in which a sequence of classification models is created, with training data iteratively expanded by adding articles selected for human screening. Different versions of the bag of concepts are compared with bag of words, as well as with combined representations, including both word-based and concept-based features. The evaluation uses the support vector machine, naive Bayes, and random forest algorithms and is performed on datasets from 15 systematic medical literature review studies. The results show that concept-based features may have additional predictive value in comparison to standard word-based features and that the combined bag of concepts and bag of words representation is the most useful overall. Full article
Show Figures

Figure 1

34 pages, 5774 KB  
Article
Approach to Semantic Visual SLAM for Bionic Robots Based on Loop Closure Detection with Combinatorial Graph Entropy in Complex Dynamic Scenes
by Dazheng Wang and Jingwen Luo
Biomimetics 2025, 10(7), 446; https://doi.org/10.3390/biomimetics10070446 - 6 Jul 2025
Viewed by 934
Abstract
In complex dynamic environments, the performance of SLAM systems on bionic robots is susceptible to interference from dynamic objects or structural changes in the environment. To address this problem, we propose a semantic visual SLAM (vSLAM) algorithm based on loop closure detection with [...] Read more.
In complex dynamic environments, the performance of SLAM systems on bionic robots is susceptible to interference from dynamic objects or structural changes in the environment. To address this problem, we propose a semantic visual SLAM (vSLAM) algorithm based on loop closure detection with combinatorial graph entropy. First, in terms of the dynamic feature detection results of YOLOv8-seg, the feature points at the edges of the dynamic object are finely judged by calculating the mean absolute deviation (MAD) of the depth of the pixel points. Then, a high-quality keyframe selection strategy is constructed by combining the semantic information, the average coordinates of the semantic objects, and the degree of variation in the dense region of feature points. Subsequently, the unweighted and weighted graphs of keyframes are constructed according to the distribution of feature points, characterization points, and semantic information, and then a high-performance loop closure detection method based on combinatorial graph entropy is developed. The experimental results show that our loop closure detection approach exhibits higher precision and recall in real scenes compared to the bag-of-words (BoW) model. Compared with ORB-SLAM2, the absolute trajectory accuracy in high-dynamic sequences improved by an average of 97.01%, while the number of extracted keyframes decreased by an average of 61.20%. Full article
(This article belongs to the Special Issue Artificial Intelligence for Autonomous Robots: 3rd Edition)
Show Figures

Figure 1

24 pages, 41430 KB  
Article
An Optimal Viewpoint-Guided Visual Indexing Method for UAV Autonomous Localization
by Zhiyang Ye, Yukun Zheng, Zheng Ji and Wei Liu
Remote Sens. 2025, 17(13), 2194; https://doi.org/10.3390/rs17132194 - 25 Jun 2025
Viewed by 1962
Abstract
The autonomous positioning of drone-based remote sensing plays an important role in navigation in urban environments. Due to GNSS (Global Navigation Satellite System) signal occlusion, obtaining precise drone locations is still a challenging issue. Inspired by vision-based positioning methods, we proposed an autonomous [...] Read more.
The autonomous positioning of drone-based remote sensing plays an important role in navigation in urban environments. Due to GNSS (Global Navigation Satellite System) signal occlusion, obtaining precise drone locations is still a challenging issue. Inspired by vision-based positioning methods, we proposed an autonomous positioning method based on multi-view reference images rendered from the scene’s 3D geometric mesh and apply a bag-of-words (BoW) image retrieval pipeline to achieve efficient and scalable positioning, without utilizing deep learning-based retrieval or 3D point cloud registration. To minimize the number of reference images, scene coverage quantification and optimization are employed to generate the optimal viewpoints. The proposed method jointly exploits a visual-bag-of-words tree to accelerate reference image retrieval and improve retrieval accuracy, and the Perspective-n-Point (PnP) algorithm is utilized to obtain the drone’s pose. Experiments are conducted in urban real-word scenarios and the results show that positioning errors are decreased, with accuracy ranging from sub-meter to 5 m and an average latency of 0.7–1.3 s; this indicates that our method significantly improves accuracy and latency, offering robust, real-time performance over extensive areas without relying on GNSS or dense point clouds. Full article
(This article belongs to the Section Engineering Remote Sensing)
Show Figures

Figure 1

18 pages, 839 KB  
Article
From Narratives to Diagnosis: A Machine Learning Framework for Classifying Sleep Disorders in Aging Populations: The sleepCare Platform
by Christos A. Frantzidis
Brain Sci. 2025, 15(7), 667; https://doi.org/10.3390/brainsci15070667 - 20 Jun 2025
Cited by 1 | Viewed by 1881
Abstract
Background/Objectives: Sleep disorders are prevalent among aging populations and are often linked to cognitive decline, chronic conditions, and reduced quality of life. Traditional diagnostic methods, such as polysomnography, are resource-intensive and limited in accessibility. Meanwhile, individuals frequently describe their sleep experiences through [...] Read more.
Background/Objectives: Sleep disorders are prevalent among aging populations and are often linked to cognitive decline, chronic conditions, and reduced quality of life. Traditional diagnostic methods, such as polysomnography, are resource-intensive and limited in accessibility. Meanwhile, individuals frequently describe their sleep experiences through unstructured narratives in clinical notes, online forums, and telehealth platforms. This study proposes a machine learning pipeline (sleepCare) that classifies sleep-related narratives into clinically meaningful categories, including stress-related, neurodegenerative, and breathing-related disorders. The proposed framework employs natural language processing (NLP) and machine learning techniques to support remote applications and real-time patient monitoring, offering a scalable solution for the early identification of sleep disturbances. Methods: The sleepCare consists of a three-tiered classification pipeline to analyze narrative sleep reports. First, a baseline model used a Multinomial Naïve Bayes classifier with n-gram features from a Bag-of-Words representation. Next, a Support Vector Machine (SVM) was trained on GloVe-based word embeddings to capture semantic context. Finally, a transformer-based model (BERT) was fine-tuned to extract contextual embeddings, using the [CLS] token as input for SVM classification. Each model was evaluated using stratified train-test splits and 10-fold cross-validation. Hyperparameter tuning via GridSearchCV optimized performance. The dataset contained 475 labeled sleep narratives, classified into five etiological categories relevant for clinical interpretation. Results: The transformer-based model utilizing BERT embeddings and an optimized Support Vector Machine classifier achieved an overall accuracy of 81% on the test set. Class-wise F1-scores ranged from 0.72 to 0.91, with the highest performance observed in classifying normal or improved sleep (F1 = 0.91). The macro average F1-score was 0.78, indicating balanced performance across all categories. GridSearchCV identified the optimal SVM parameters (C = 4, kernel = ‘rbf’, gamma = 0.01, degree = 2, class_weight = ‘balanced’). The confusion matrix revealed robust classification with limited misclassifications, particularly between overlapping symptom categories such as stress-related and neurodegenerative sleep disturbances. Conclusions: Unlike generic large language model applications, our approach emphasizes the personalized identification of sleep symptomatology through targeted classification of the narrative input. By integrating structured learning with contextual embeddings, the framework offers a clinically meaningful, scalable solution for early detection and differentiation of sleep disorders in diverse, real-world, and remote settings. Full article
(This article belongs to the Special Issue Perspectives of Artificial Intelligence (AI) in Aging Neuroscience)
Show Figures

Graphical abstract

25 pages, 2920 KB  
Article
Compiler Identification with Divisive Analysis and Support Vector Machine
by Changlan Liu, Yingsong Zhang, Peng Zuo and Peng Wang
Symmetry 2025, 17(6), 867; https://doi.org/10.3390/sym17060867 - 3 Jun 2025
Viewed by 907
Abstract
Compilers play a crucial role in software development, as most software must be compiled into binaries before release. Analyzing the compiler version from binary files is of great importance in software reverse engineering, maintenance, traceability, and information security. In this work, we propose [...] Read more.
Compilers play a crucial role in software development, as most software must be compiled into binaries before release. Analyzing the compiler version from binary files is of great importance in software reverse engineering, maintenance, traceability, and information security. In this work, we propose a novel framework for compiler version identification. Firstly, we generated 1000 C language source codes using CSmith and subsequently compiled them into 16,000 binary files using 16 distinct versions of compilers. The symmetric distribution of the dataset among different compiler versions may ensure unbiased model training. Then, IDA Pro was used to decompile the binary files into assembly instruction sequences. From these sequences, we extracted frequency-based features via the Bag-of-Words (BOW) model and sequence-based features derived from the grey-level co-occurrence matrix (GLCM). Finally, we introduced a divide-and-conquer framework (DIANA-SVM) to effectively classify compiler versions. The experimental results demonstrate that traditional Support Vector Machine (SVM) models struggle to accurately identify compiler versions using compiled executable files. In contrast, DIANA-SVM’s symmetric data separation approach enhances performance, achieving an accuracy of 94% (±0.375%). This framework enables precise identification of high-risk compiler versions, offering a reliable tool for software supply chain security. Theoretically, our GLCM-based sequence modeling and divide-and-conquer framework advance feature extraction methodologies for binary files, offering a scalable solution for similar classification tasks beyond compiler identification. Full article
(This article belongs to the Special Issue Advanced Studies of Symmetry/Asymmetry in Cybersecurity)
Show Figures

Figure 1

40 pages, 3224 KB  
Article
A Comparative Study of Image Processing and Machine Learning Methods for Classification of Rail Welding Defects
by Mohale Emmanuel Molefe, Jules Raymond Tapamo and Siboniso Sithembiso Vilakazi
J. Sens. Actuator Netw. 2025, 14(3), 58; https://doi.org/10.3390/jsan14030058 - 29 May 2025
Cited by 1 | Viewed by 3503
Abstract
Defects formed during the thermite welding process of two sections of rails require the welded joints to be inspected for quality, and the most used non-destructive method for inspection is radiography testing. However, the conventional defect investigation process from the obtained radiography images [...] Read more.
Defects formed during the thermite welding process of two sections of rails require the welded joints to be inspected for quality, and the most used non-destructive method for inspection is radiography testing. However, the conventional defect investigation process from the obtained radiography images is costly, lengthy, and subjective as it is conducted manually by trained experts. Additionally, it has been shown that most rail breaks occur due to a crack initiated from the weld joint defect that was either misclassified or undetected. To improve the condition monitoring of rails, the railway industry requires an automated defect investigation system capable of detecting and classifying defects automatically. Therefore, this work proposes a method based on image processing and machine learning techniques for the automated investigation of defects. Histogram Equalization methods are first applied to improve image quality. Then, the extraction of the weld joint from the image background is achieved using the Chan–Vese Active Contour Model. A comparative investigation is carried out between Deep Convolution Neural Networks, Local Binary Pattern extractors, and Bag of Visual Words methods (with the Speeded-Up Robust Features extractor) for extracting features in weld joint images. Classification of features extracted by local feature extractors is achieved using Support Vector Machines, K-Nearest Neighbor, and Naive Bayes classifiers. The highest classification accuracy of 95% is achieved by the Deep Convolution Neural Network model. A Graphical User Interface is provided for the onsite investigation of defects. Full article
(This article belongs to the Special Issue AI-Assisted Machine-Environment Interaction)
Show Figures

Figure 1

Back to TopTop