Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (86)

Search Parameters:
Keywords = fastText Embeddings

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
37 pages, 4917 KB  
Article
Transformer and Pre-Transformer Model-Based Sentiment Prediction with Various Embeddings: A Case Study on Amazon Reviews
by Ismail Duru and Ayşe Saliha Sunar
Entropy 2025, 27(12), 1202; https://doi.org/10.3390/e27121202 - 27 Nov 2025
Viewed by 1245
Abstract
Sentiment analysis is essential for understanding consumer opinions, yet selecting the optimal models and embedding methods remains challenging, especially when handling ambiguous expressions, slang, or mismatched sentiment–rating pairs. This study provides a comprehensive comparative evaluation of sentiment classification models across three paradigms: traditional [...] Read more.
Sentiment analysis is essential for understanding consumer opinions, yet selecting the optimal models and embedding methods remains challenging, especially when handling ambiguous expressions, slang, or mismatched sentiment–rating pairs. This study provides a comprehensive comparative evaluation of sentiment classification models across three paradigms: traditional machine learning, pre-transformer deep learning, and transformer-based models. Using the Amazon Magazine Subscriptions 2023 dataset, we evaluate a range of embedding techniques, including static embeddings (GloVe, FastText) and contextual transformer embeddings (BERT, DistilBERT, etc.). To capture predictive confidence and model uncertainty, we include categorical cross-entropy as a key evaluation metric alongside accuracy, precision, recall, and F1-score. In addition to detailed quantitative comparisons, we conduct a systematic qualitative analysis of misclassified samples to reveal model-specific patterns of uncertainty. Our findings show that FastText consistently outperforms GloVe in both traditional and LSTM-based models, particularly in recall, due to its subword-level semantic richness. Transformer-based models demonstrate superior contextual understanding and achieve the highest accuracy (92%) and lowest cross-entropy loss (0.25) with DistilBERT, indicating well-calibrated predictions. To validate the generalisability of our results, we replicated our experiments on the Amazon Gift Card Reviews dataset, where similar trends were observed. We also adopt a resource-aware approach by reducing the dataset size from 25 K to 20 K to reflect real-world hardware constraints. This study contributes to both sentiment analysis and sustainable AI by offering a scalable, entropy-aware evaluation framework that supports informed, context-sensitive model selection for practical applications. Full article
Show Figures

Figure 1

24 pages, 7431 KB  
Article
Research on Technical Condition of Concrete Bridges Based on FastText+CNN
by Shiwen Li, Zhihai Deng, Junguang Wang, Xiaoguang Wu and Qingyuan Feng
Appl. Sci. 2025, 15(23), 12386; https://doi.org/10.3390/app152312386 - 21 Nov 2025
Viewed by 416
Abstract
Addressing the challenges of scarce measured data for Class 3–4 bridges and strong subjectivity in manual assessments in bridge technical-condition evaluation, this study innovatively proposes a FastText+CNN evaluation model that integrates semantic features with spatial pattern recognition. By constructing a hierarchical data structure [...] Read more.
Addressing the challenges of scarce measured data for Class 3–4 bridges and strong subjectivity in manual assessments in bridge technical-condition evaluation, this study innovatively proposes a FastText+CNN evaluation model that integrates semantic features with spatial pattern recognition. By constructing a hierarchical data structure of bridge scale matrices using the analytic hierarchy process (AHP) and generating a balanced training set encompassing Class 1–5 bridges through computational code, the model overcomes the bottleneck of training under small-sample conditions. It employs N-Gram embeddings to achieve semantic representation of defect feature combinations, combines one-dimensional convolutional neural networks to capture cross-component spatial correlation patterns, and utilizes hierarchical Softmax to optimize multi-classification efficiency. Experiments show that the model achieves 92.4% accuracy on the test set, outperforming random forest and multi-layer CNN models by 15.9% and 3.7%, respectively, with recognition rates for Class 3–5 bridges rising to 85% and cross-entropy loss reduced to 0.36. Validated with data from 30 actual bridges, the model maintains 92.3% accuracy and demonstrates the ability to discover implicit patterns in cross-component defect chains, providing an intelligent solution for bridge technical condition evaluation that combines semantic understanding with spatial feature extraction. Full article
Show Figures

Figure 1

28 pages, 1341 KB  
Article
HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text
by Hendry Hendry, Tukino Tukino, Eko Sediyono, Ahmad Fauzi and Baenil Huda
Information 2025, 16(11), 995; https://doi.org/10.3390/info16110995 - 17 Nov 2025
Viewed by 858
Abstract
This study is intended to evaluate and contrast the performance of varying combinations of embedding algorithms and weighting systems in measuring perception-based text similarity using the Cosine Similarity approach. Within a structured experiment design, a hybrid model referred to as HyEWCos (Hybrid Embedding [...] Read more.
This study is intended to evaluate and contrast the performance of varying combinations of embedding algorithms and weighting systems in measuring perception-based text similarity using the Cosine Similarity approach. Within a structured experiment design, a hybrid model referred to as HyEWCos (Hybrid Embedding and Weighting for Cosine Similarity) was built incorporating conventional embedding models (Word2Vec, FastText), transformer-based models (BERT, GPT), and statistical and linguistic word weighting schemes (TFIDF, BM25, POS-weighting, and N-weighting). The test results indicate that Word2Vec merged with the CBOW architecture and TFIDF weighting always returned the most reliable performance, with lowest error values (RMSE and MAE of 0.9868) and the highest rating correlation with expert judgment (Pearson’s, 0.524; Spearman’s, 0.543). These results show that contextually conditioned distributional representation approaches perform better in maintaining the semantic subtlety of short and subjective texts than transformer models that are not fine-tuned. This work is unique in terms of its evaluation framework because it integrates embedding and weighting approaches that have hitherto been examined mostly in separation. The main contribution of the study is the development of an experimental framework that serves as a foundation for building more stable and accurate text-based assessment systems. The research also proves the need for making decisions on representation methods based on the data type and domain and opens a door for continuing research in adaptive hybrid models and how their potential can be achieved through combining the best of various approaches. Full article
Show Figures

Figure 1

26 pages, 1823 KB  
Article
Scalable Gender Profiling from Turkish Texts Using Deep Embeddings and Meta-Heuristic Feature Selection
by Hakan Gunduz
J. Theor. Appl. Electron. Commer. Res. 2025, 20(4), 253; https://doi.org/10.3390/jtaer20040253 - 24 Sep 2025
Cited by 1 | Viewed by 974
Abstract
Accurate gender identification from written text is critical for author profiling, recommendation systems, and demographic analytics in digital ecosystems. This study introduces a scalable framework for gender classification in Turkish, combining contextualized BERTurk and subword-aware FastText embeddings with three meta-heuristic feature selection algorithms: [...] Read more.
Accurate gender identification from written text is critical for author profiling, recommendation systems, and demographic analytics in digital ecosystems. This study introduces a scalable framework for gender classification in Turkish, combining contextualized BERTurk and subword-aware FastText embeddings with three meta-heuristic feature selection algorithms: Genetic Algorithm (GA), Jaya and Artificial Rabbit Optimization (ARO). Evaluated on the IAG-TNKU corpus of 43,292 balanced Turkish news articles, the best-performing model—BERTurk+GA+LSTM—achieves 89.7% accuracy, while ARO reduces feature dimensionality by 90% with minimal performance loss. Beyond in-domain results, exploratory zero-shot and few-shot adaptation experiments on Turkish e-commerce product reviews demonstrate the framework’s transferability: while zero-shot performance dropped to 59.8%, few-shot adaptation with only 200–400 labeled samples raised accuracy to 69.6–72.3%. These findings highlight both the limitations of training exclusively on news articles and the practical feasibility of adapting the framework to consumer-generated content with minimal supervision. In addition to technical outcomes, we critically examine ethical considerations in gender inference, including fairness, representation, and the binary nature of current datasets. This work contributes a reproducible and linguistically informed baseline for gender profiling in morphologically rich, low-resource languages, with demonstrated potential for adaptation across domains such as social media and e-commerce personalization. Full article
(This article belongs to the Special Issue Human–Technology Synergies in AI-Driven E-Commerce Environments)
Show Figures

Figure 1

21 pages, 588 KB  
Article
Research on an MOOC Recommendation Method Based on the Fusion of Behavioral Sequences and Textual Semantics
by Wenxin Zhao, Lei Zhao and Zhenbin Liu
Appl. Sci. 2025, 15(18), 10024; https://doi.org/10.3390/app151810024 - 13 Sep 2025
Viewed by 1115
Abstract
To address the challenges of user behavior sparsity and insufficient utilization of course semantics on MOOC platforms, this paper proposes a personalized recommendation method that integrates user behavioral sequences with course textual semantic features. First, shallow word-level features from course titles are extracted [...] Read more.
To address the challenges of user behavior sparsity and insufficient utilization of course semantics on MOOC platforms, this paper proposes a personalized recommendation method that integrates user behavioral sequences with course textual semantic features. First, shallow word-level features from course titles are extracted using FastText, and deep contextual semantic representations from course descriptions are obtained via a fine-tuned BERT model. The two sets of semantic features are concatenated to form a multi-level semantic representation of course content. Next, the fused semantic features are mapped into the same vector space as course ID embeddings through a linear projection layer and combined with the original course ID embeddings via an additive fusion strategy, enhancing the model’s semantic perception of course content. Finally, the fused features are fed into an improved SASRec model, where a multi-head self-attention mechanism is employed to model the evolution of user interests, enabling collaborative recommendations across behavioral and semantic modalities. Experiments conducted on the MOOCCubeX dataset (1.26 million users, 632 courses) demonstrated that the proposed method achieved NDCG@10 and HR@10 scores of 0.524 and 0.818, respectively, outperforming SASRec and semantic single-modality baselines. This study offers an efficient yet semantically rich recommendation solution for MOOC scenarios. Full article
Show Figures

Figure 1

22 pages, 1579 KB  
Article
Stance Detection in Arabic Tweets: A Machine Learning Framework for Identifying Extremist Discourse
by Arwa K. Alkhraiji and Aqil M. Azmi
Mathematics 2025, 13(18), 2965; https://doi.org/10.3390/math13182965 - 13 Sep 2025
Viewed by 1559
Abstract
Terrorism remains a critical global challenge, and the proliferation of social media has created new avenues for monitoring extremist discourse. This study investigates stance detection as a method to identify Arabic tweets expressing support for or opposition to specific organizations associated with extremist [...] Read more.
Terrorism remains a critical global challenge, and the proliferation of social media has created new avenues for monitoring extremist discourse. This study investigates stance detection as a method to identify Arabic tweets expressing support for or opposition to specific organizations associated with extremist activities, using Hezbollah as a case study. Thousands of relevant Arabic tweets were collected and manually annotated by expert annotators. After extensive preprocessing and feature extraction using term frequency–inverse document frequency (tf-idf), we implemented traditional machine learning (ML) classifiers—Support Vector Machines (SVMs) with multiple kernels, Multinomial Naïve Bayes, and Weighted K-Nearest Neighbors. ML models were selected over deep learning (DL) approaches due to (1) limited annotated Arabic data availability for effective DL training; (2) computational efficiency for resource-constrained environments; and (3) the critical need for interpretability in counterterrorism applications. While interpretability is not a core focus of this work, the use of traditional ML models (rather than DL) makes the system inherently more transparent and readily adaptable for future integration of interpretability techniques. Comparative experiments using FastText word embeddings and tf-idf with supervised classifiers revealed superior performance with the latter approach. Our best result achieved a macro F-score of 78.62% using SVMs with the RBF kernel, demonstrating that interpretable ML frameworks offer a viable and resource-efficient approach for monitoring extremist discourse in Arabic social media. These findings highlight the potential of such frameworks to support scalable and explainable counterterrorism tools in low-resource linguistic settings. Full article
(This article belongs to the Special Issue Machine Learning Theory and Applications)
Show Figures

Figure 1

23 pages, 3847 KB  
Article
Optimizing Sentiment Analysis in Multilingual Balanced Datasets: A New Comparative Approach to Enhancing Feature Extraction Performance with ML and DL Classifiers
by Hamza Jakha, Souad El Houssaini, Mohammed-Alamine El Houssaini, Souad Ajjaj and Abdelali Hadir
Appl. Syst. Innov. 2025, 8(4), 104; https://doi.org/10.3390/asi8040104 - 28 Jul 2025
Viewed by 5120
Abstract
Social network platforms have a big impact on the development of companies by influencing clients’ behaviors and sentiments, which directly affect corporate reputations. Analyzing this feedback has become an essential component of business intelligence, supporting the improvement of long-term marketing strategies on a [...] Read more.
Social network platforms have a big impact on the development of companies by influencing clients’ behaviors and sentiments, which directly affect corporate reputations. Analyzing this feedback has become an essential component of business intelligence, supporting the improvement of long-term marketing strategies on a larger scale. The implementation of powerful sentiment analysis models requires a comprehensive and in-depth examination of each stage of the process. In this study, we present a new comparative approach for several feature extraction techniques, including TF-IDF, Word2Vec, FastText, and BERT embeddings. These methods are applied to three multilingual datasets collected from hotel review platforms in the tourism sector in English, French, and Arabic languages. Those datasets were preprocessed through cleaning, normalization, labeling, and balancing before being trained on various machine learning and deep learning algorithms. The effectiveness of each feature extraction method was evaluated using metrics such as accuracy, F1-score, precision, recall, ROC AUC curve, and a new metric that measures the execution time for generating word representations. Our extensive experiments demonstrate significant and excellent results, achieving accuracy rates of approximately 99% for the English dataset, 94% for the Arabic dataset, and 89% for the French dataset. These findings confirm the important impact of vectorization techniques on the performance of sentiment analysis models. They also highlight the important relationship between balanced datasets, effective feature extraction methods, and the choice of classification algorithms. So, this study aims to simplify the selection of feature extraction methods and appropriate classifiers for each language, thereby contributing to advancements in sentiment analysis. Full article
(This article belongs to the Topic Social Sciences and Intelligence Management, 2nd Volume)
Show Figures

Figure 1

24 pages, 1991 KB  
Article
A Multi-Feature Semantic Fusion Machine Learning Architecture for Detecting Encrypted Malicious Traffic
by Shiyu Tang, Fei Du, Zulong Diao and Wenjun Fan
J. Cybersecur. Priv. 2025, 5(3), 47; https://doi.org/10.3390/jcp5030047 - 17 Jul 2025
Cited by 1 | Viewed by 2029
Abstract
With the increasing sophistication of network attacks, machine learning (ML)-based methods have showcased promising performance in attack detection. However, ML-based methods often suffer from high false rates when tackling encrypted malicious traffic. To break through these bottlenecks, we propose EFTransformer, an encrypted flow [...] Read more.
With the increasing sophistication of network attacks, machine learning (ML)-based methods have showcased promising performance in attack detection. However, ML-based methods often suffer from high false rates when tackling encrypted malicious traffic. To break through these bottlenecks, we propose EFTransformer, an encrypted flow transformer framework which inherits semantic perception and multi-scale feature fusion, can robustly and efficiently detect encrypted malicious traffic, and make up for the shortcomings of ML in the context of modeling ability and feature adequacy. EFTransformer introduces a channel-level extraction mechanism based on quintuples and a noise-aware clustering strategy to enhance the recognition ability of traffic patterns; adopts a dual-channel embedding method, using Word2Vec and FastText to capture global semantics and subword-level changes; and uses a Transformer-based classifier and attention pooling module to achieve dynamic feature-weighted fusion, thereby improving the robustness and accuracy of malicious traffic detection. Our systematic experiments on the ISCX2012 dataset demonstrate that EFTransformer achieves the best detection performance, with an accuracy of up to 95.26%, a false positive rate (FPR) of 6.19%, and a false negative rate (FNR) of only 5.85%. These results show that EFTransformer achieves high detection performance against encrypted malicious traffic. Full article
(This article belongs to the Section Security Engineering & Applications)
Show Figures

Figure 1

27 pages, 1817 KB  
Article
A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media
by Muhammad Usman, Muhammad Ahmad, Grigori Sidorov, Irina Gelbukh and Rolando Quintero Tellez
Computers 2025, 14(7), 279; https://doi.org/10.3390/computers14070279 - 15 Jul 2025
Cited by 2 | Viewed by 4125
Abstract
The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, [...] Read more.
The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, this study introduces a comprehensive approach for multilingual hate speech detection. To facilitate robust hate speech detection across diverse languages, this study makes several key contributions. First, we created a novel trilingual hate speech dataset consisting of 10,193 manually annotated tweets in English, Spanish, and Urdu. Second, we applied two innovative techniques—joint multilingual and translation-based approaches—for cross-lingual hate speech detection that have not been previously explored for these languages. Third, we developed detailed hate speech annotation guidelines tailored specifically to all three languages to ensure consistent and high-quality labeling. Finally, we conducted 41 experiments employing machine learning models with TF–IDF features, deep learning models utilizing FastText and GloVe embeddings, and transformer-based models leveraging advanced contextual embeddings to comprehensively evaluate our approach. Additionally, we employed a large language model with advanced contextual embeddings to identify the best solution for the hate speech detection task. The experimental results showed that our GPT-3.5-turbo model significantly outperforms strong baselines, achieving up to an 8% improvement over XLM-R in Urdu hate speech detection and an average gain of 4% across all three languages. This research not only contributes a high-quality multilingual dataset but also offers a scalable and inclusive framework for hate speech detection in underrepresented languages. Full article
(This article belongs to the Special Issue Recent Advances in Social Networks and Social Media)
Show Figures

Figure 1

24 pages, 2410 KB  
Article
UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers
by Muhammad Ahmad, Muhammad Waqas, Ameer Hamza, Sardar Usman, Ildar Batyrshin and Grigori Sidorov
Computers 2025, 14(6), 239; https://doi.org/10.3390/computers14060239 - 18 Jun 2025
Cited by 3 | Viewed by 5085
Abstract
The rise in social media has improved communication but also amplified the spread of hate speech, creating serious societal risks. Automated detection remains difficult due to subjectivity, linguistic diversity, and implicit language. While prior research focuses on high-resource languages, this study addresses the [...] Read more.
The rise in social media has improved communication but also amplified the spread of hate speech, creating serious societal risks. Automated detection remains difficult due to subjectivity, linguistic diversity, and implicit language. While prior research focuses on high-resource languages, this study addresses the underexplored multilingual challenges of Arabic and Urdu hate speech through a comprehensive approach. To achieve this objective, this study makes four different key contributions. First, we have created a unique multi-lingual, manually annotated binary and multi-class dataset (UA-HSD-2025) sourced from X, which contains the five most important multi-class categories of hate speech. Secondly, we created detailed annotation guidelines to make a robust and perfect hate speech dataset. Third, we explore two strategies to address the challenges of multilingual data: a joint multilingual and translation-based approach. The translation-based approach involves converting all input text into a single target language before applying a classifier. In contrast, the joint multilingual approach employs a unified model trained to handle multiple languages simultaneously, enabling it to classify text across different languages without translation. Finally, we have employed state-of-the-art 54 different experiments using different machine learning using TF-IDF, deep learning using advanced pre-trained word embeddings such as FastText and Glove, and pre-trained language-based models using advanced contextual embeddings. Based on the analysis of the results, our language-based model (XLM-R) outperformed traditional supervised learning approaches, achieving 0.99 accuracy in binary classification for Arabic, Urdu, and joint-multilingual datasets, and 0.95, 0.94, and 0.94 accuracy in multi-class classification for joint-multilingual, Arabic, and Urdu datasets, respectively. Full article
(This article belongs to the Special Issue Recent Advances in Social Networks and Social Media)
Show Figures

Figure 1

15 pages, 5650 KB  
Article
Enhancing Interprofessional Communication in Healthcare Using Large Language Models: Study on Similarity Measurement Methods with Weighted Noun Embeddings
by Ji-Young Yeo, Sungkwan Youm and Kwang-Seong Shin
Electronics 2025, 14(11), 2240; https://doi.org/10.3390/electronics14112240 - 30 May 2025
Viewed by 923
Abstract
Large language models (LLMs) are increasingly applied to specialized domains like medical education, necessitating tailored approaches to evaluate structured responses such as SBAR (Situation, Background, Assessment, Recommendation). This study developed an evaluation tool for nursing student responses using LLMs, focusing on word-based learning [...] Read more.
Large language models (LLMs) are increasingly applied to specialized domains like medical education, necessitating tailored approaches to evaluate structured responses such as SBAR (Situation, Background, Assessment, Recommendation). This study developed an evaluation tool for nursing student responses using LLMs, focusing on word-based learning and assessment methods to align automated scoring with expert evaluations. We propose a three-stage biasing approach: (1) integrating reference answers into the training corpus; (2) incorporating high-scoring student responses; (3) applying domain-critical token weighting through Weighted Noun Embeddings to enhance similarity measurements. By assigning higher weights to critical medical nouns and lower weights to less relevant terms, the embeddings prioritize domain-specific terminology. Employing Word2Vec and FastText models trained on general conversation, medical, and reference answer corpora alongside Sentence-BERT for comparison, our results demonstrate that biasing with reference answers, high-scoring responses, and weighted embeddings improves alignment with human evaluations. Word-based models, particularly after biasing, effectively distinguish high-performing responses from lower ones, as evidenced by increased cosine similarity differences. These findings validate that the proposed methodology enhances the precision and objectivity of evaluating descriptive answers, offering a practical solution for educational settings where fairness and consistency are paramount. Full article
(This article belongs to the Special Issue Deep Learning Approaches for Natural Language Processing)
Show Figures

Figure 1

28 pages, 1007 KB  
Article
Predicting the Event Types in the Human Brain: A Modeling Study Based on Embedding Vectors and Large-Scale Situation Type Datasets in Mandarin Chinese
by Xiaorui Ma and Hongchao Liu
Appl. Sci. 2025, 15(11), 5916; https://doi.org/10.3390/app15115916 - 24 May 2025
Viewed by 917
Abstract
Event types classify Chinese verbs based on the internal temporal structure of events. The categorization of verb event types is the most fundamental classification of concept types represented by verbs in the human brain. Meanwhile, event types exhibit strong predictive capabilities for exploring [...] Read more.
Event types classify Chinese verbs based on the internal temporal structure of events. The categorization of verb event types is the most fundamental classification of concept types represented by verbs in the human brain. Meanwhile, event types exhibit strong predictive capabilities for exploring collocational patterns between words, making them crucial for Chinese teaching. This work focuses on constructing a statistically validated gold-standard dataset, forming the foundation for achieving high accuracy in recognizing verb event types. Utilizing a manually annotated dataset of verbs and aspectual markers’ co-occurrence features, the research conducts hierarchical clustering of Chinese verbs. The resulting dendrogram indicates that verbs can be categorized into three event types—state, activity and transition—based on semantic distance. Two approaches are employed to construct vector matrices: a supervised method that derives word vectors based on linguistic features, and an unsupervised method that uses four models to extract embedding vectors, including Word2Vec, FastText, BERT and ChatGPT. The classification of verb event types is performed using three classifiers: multinomial logistic regression, support vector machines and artificial neural networks. Experimental results demonstrate the superior performance of embedding vectors. Employing the pre-trained FastText model in conjunction with an artificial neural network classifier, the model achieves an accuracy of 98.37% in predicting 3133 verbs, thereby enabling the automatic identification of event types at the level of Chinese verbs and validating the high accuracy and practical value of embedding vectors in addressing complex semantic relationships and classification tasks. This work constructs datasets of considerable semantic complexity, comprising a substantial volume of verbs along with their feature vectors and situation type labels, which can be used for evaluating large language models in the future. Full article
(This article belongs to the Special Issue Application of Artificial Intelligence and Semantic Mining Technology)
Show Figures

Figure 1

20 pages, 2912 KB  
Article
Effective Context-Aware File Path Embeddings for Anomaly Detection
by Ra-Kyung Lee, Hyun-Min Song and Taek-Young Youn
Systems 2025, 13(6), 403; https://doi.org/10.3390/systems13060403 - 23 May 2025
Viewed by 1928
Abstract
In digital forensics, especially Windows forensics, identifying anomalous file paths is crucial when dealing with large-scale data. Traditional static embedding methods, which aggregate token-level representations, discard hierarchical and sequential relationships in file paths, leading to misclassification of anomalies. This study introduces a Transformer-based [...] Read more.
In digital forensics, especially Windows forensics, identifying anomalous file paths is crucial when dealing with large-scale data. Traditional static embedding methods, which aggregate token-level representations, discard hierarchical and sequential relationships in file paths, leading to misclassification of anomalies. This study introduces a Transformer-based sequence modeling approach to classify anomalous file paths, addressing these limitations by preserving positional and contextual relationships. File paths from the NTFS Master File Table (MFT) were embedded using FastText to capture structural and contextual dependencies. Unlike static embeddings, the proposed method processes file paths as structured sequences to enhance anomaly detection accuracy. Extensive experiments showed that Transformer models generally outperformed traditional methods in detecting structured anomalies. The Transformer model with FastText embeddings (32 dimensions) achieved an accuracy of 0.9781 and an F1-score of 0.9782, while Random Forest with FastText embeddings (64 dimensions) achieved an accuracy of 0.9729 and an F1-score of 0.9729. These findings suggest that a hybrid anomaly detection framework combining Transformer-based models with traditional techniques could enhance robustness in forensic investigations. Future research should explore combining both methods to improve adaptability across diverse forensic scenarios. Full article
Show Figures

Figure 1

36 pages, 4245 KB  
Article
An Unsupervised Integrated Framework for Arabic Aspect-Based Sentiment Analysis and Abstractive Text Summarization of Traffic Services Using Transformer Models
by Alanoud Alotaibi and Farrukh Nadeem
Smart Cities 2025, 8(2), 62; https://doi.org/10.3390/smartcities8020062 - 8 Apr 2025
Cited by 4 | Viewed by 2795
Abstract
Social media is crucial for gathering public feedback on government services, particularly in the traffic sector. While Aspect-Based Sentiment Analysis (ABSA) offers a means to extract actionable insights from user posts, analyzing Arabic content poses unique challenges. Existing Arabic ABSA approaches heavily rely [...] Read more.
Social media is crucial for gathering public feedback on government services, particularly in the traffic sector. While Aspect-Based Sentiment Analysis (ABSA) offers a means to extract actionable insights from user posts, analyzing Arabic content poses unique challenges. Existing Arabic ABSA approaches heavily rely on supervised learning and manual annotation, limiting scalability. To tackle these challenges, we suggest an integrated framework combining unsupervised BERTopic-based Aspect Category Detection with distance supervision using a fine-tuned CAMeLBERT model for sentiment classification. This is further complemented by transformer-based summarization through a fine-tuned AraBART model. Key contributions of this paper include: (1) the first comprehensive Arabic traffic services dataset containing 461,844 tweets, enabling future research in this previously unexplored domain; (2) a novel unsupervised approach for Arabic ABSA that eliminates the need for large-scale manual annotation, using FastText custom embeddings and BERTopic to achieve superior topic clustering; (3) a pioneering integration of aspect detection, sentiment analysis, and abstractive summarization that provides a complete pipeline for analyzing Arabic traffic service feedback; (4) state-of-the-art performance metrics across all tasks, achieving 92% accuracy in ABSA and a ROUGE-L score of 0.79 for summarization, establishing new benchmarks for Arabic NLP in the traffic domain. The framework significantly enhances smart city traffic management by enabling automated processing of citizen feedback, supporting data-driven decision-making, and allowing authorities to monitor public sentiment, identify emerging issues, and allocate resources based on citizen needs, ultimately improving urban mobility and service responsiveness. Full article
Show Figures

Figure 1

16 pages, 1255 KB  
Article
Text Alignment in the Service of Text Reuse Detection
by Hadar Miller, Tsvi Kuflik and Moshe Lavee
Appl. Sci. 2025, 15(6), 3395; https://doi.org/10.3390/app15063395 - 20 Mar 2025
Cited by 1 | Viewed by 1470
Abstract
This study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then [...] Read more.
This study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorporated into the Smith–Waterman algorithm. This integration enables improved alignment accuracy, particularly for historical texts characterized by fluctuations, orthographic changes, transcription variations, and word transpositions. Our key contributions include (1) a refined distance function that integrates fastText embeddings, allowing robust handling of out-of-vocabulary words; (2) a typological correction mechanism that can be integrated into automatic transcription pipelines to enhance text normalization; and (3) an evaluation of historical Hebrew texts, demonstrating an 11% improvement in the F1 score over existing approaches. These findings underscore the importance of computational methodologies in digital humanities and lay the groundwork for future multilingual extensions. Full article
Show Figures

Figure 1

Back to TopTop