MDPI - Publisher of Open Access Journals

17 pages, 1662 KB

Open AccessProceeding Paper

Performance Analysis of IndoBERT for Detection of Online Gambling Promotion in YouTube Comments

by Kamdan Kamdan, Malik Pajar Anugrah, Moh Jeli Almutaali, Restu Ramdani and Ivana Lucia Kharisma

Eng. Proc. 2025, 107(1), 66; https://doi.org/10.3390/engproc2025107066 - 2 Sep 2025

Viewed by 1148

The proliferation of online gambling promotions on social media platforms, particularly YouTube, poses a significant challenge in digital security and regulation. This study evaluates the performance of IndoBERT in detecting online gambling-related spam in YouTube comments. The research utilizes YouTube Data API to collect comments, preprocess the text through cleaning and tokenization, and fine-tune IndoBERT for classification. The model’s performance is assessed using accuracy, precision, recall, and F1-score metrics. IndoBERT achieves outstanding results with an accuracy of 98.26%, proving its effectiveness in detecting online gambling promotion. The confusion matrix analysis highlights a low error rate, with minimal false positives and false negatives. IndoBERT is a promising tool for combating online gambling spam, offering high reliability for automated content moderation. Future improvements should focus on handling implicit promotional language, enhancing dataset diversity, and integrating rule-based filtering. This study contributes to NLP advancements in Indonesian text classification, supporting efforts to maintain a safer digital environment. Full article

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

► Show Figures

Figure 1

28 pages, 2499 KB

Open AccessArticle

Optimizing Aspect-Based Sentiment Analysis Using BERT for Comprehensive Analysis of Indonesian Student Feedback

by Ahmad Jazuli, Widowati and Retno Kusumaningrum

Appl. Sci. 2025, 15(1), 172; https://doi.org/10.3390/app15010172 - 28 Dec 2024

Cited by 5 | Viewed by 5519

Abstract

Evaluating the learning process requires a platform for students to express feedback and suggestions openly through online reviews. Sentiment analysis is often used to analyze review texts but typically captures only overall sentiment without identifying specific aspects. This study develops an aspect-based sentiment analysis (ABSA) model using IndoBERT, a pre-trained model tailored for the Indonesian language. The research uses 10,000 student reviews from Indonesian universities, processed through data labeling, text preprocessing, and splitting, followed by model training and performance evaluation. The model demonstrated superior performance with an aspect extraction accuracy of 0.973, an F1-score of 0.952, a sentiment classification accuracy of 0.979, and an F1-score of 0.974. Experimental results indicate that the proposed ABSA model surpasses previous state-of-the-art models in analyzing sentiment related to specific aspects of educational evaluation. By leveraging IndoBERT, the model effectively handles linguistic complexities and provides detailed insights into student experiences. These findings highlight the potential of the ABSA model in enhancing learning evaluations by offering precise, aspect-focused feedback, contributing to strategies for improving the quality of higher education. Full article

(This article belongs to the Special Issue Application of Artificial Intelligence and Semantic Mining Technology)

► Show Figures

Figure 1

28 pages, 2857 KB

Open AccessArticle

IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents

by Agus Riyadi, Mate Kovacs, Uwe Serdült and Victor Kryssanov

Big Data Cogn. Comput. 2024, 8(11), 153; https://doi.org/10.3390/bdcc8110153 - 9 Nov 2024

Cited by 3 | Viewed by 3369

Abstract

Achieving the Sustainable Development Goals (SDGs) requires collaboration among various stakeholders, particularly governments and non-state actors (NSAs). This collaboration results in but is also based on a continually growing volume of documents that needs to be analyzed and processed in a systematic way by government officials. Artificial Intelligence and Natural Language Processing (NLP) could, thus, offer valuable support for progressing towards SDG targets, including automating the government budget tagging and classifying NSA requests and initiatives, as well as helping uncover the possibilities for matching these two categories of activities. Many non-English speaking countries, including Indonesia, however, face limited NLP resources, such as, for instance, domain-specific pre-trained language models (PTLMs). This circumstance makes it difficult to automate document processing and improve the efficacy of SDG-related government efforts. The presented study introduces IndoGovBERT, a Bidirectional Encoder Representations from Transformers (BERT)-based PTLM built with domain-specific corpora, leveraging the Indonesian government’s public and internal documents. The model is intended to automate various laborious tasks of SDG document processing by the Indonesian government. Different approaches to PTLM development known from the literature are examined in the context of typical government settings. The most effective, in terms of the resultant model performance, but also most efficient, in terms of the computational resources required, methodology is determined and deployed for the development of the IndoGovBERT model. The developed model is then scrutinized in several text classification and similarity assessment experiments, where it is compared with four Indonesian general-purpose language models, a non-transformer approach of the Multilabel Topic Model (MLTM), as well as with a Multilingual BERT model. Results obtained in all experiments highlight the superior capability of the IndoGovBERT model for Indonesian government SDG document processing. The latter suggests that the proposed PTLM development methodology could be adopted to build high-performance specialized PTLMs for governments around the globe which face SDG document processing and other NLP challenges similar to the ones dealt with in the presented study. Full article

(This article belongs to the Special Issue Artificial Intelligence and Natural Language Processing)

► Show Figures

Figure 1

19 pages, 601 KB

Open AccessArticle

Multilingual Hate Speech Detection: A Semi-Supervised Generative Adversarial Approach

by Khouloud Mnassri, Reza Farahbakhsh and Noel Crespi

Entropy 2024, 26(4), 344; https://doi.org/10.3390/e26040344 - 18 Apr 2024

Cited by 7 | Viewed by 6722

Abstract

Social media platforms have surpassed cultural and linguistic boundaries, thus enabling online communication worldwide. However, the expanded use of various languages has intensified the challenge of online detection of hate speech content. Despite the release of multiple Natural Language Processing (NLP) solutions implementing cutting-edge machine learning techniques, the scarcity of data, especially labeled data, remains a considerable obstacle, which further requires the use of semisupervised approaches along with Generative Artificial Intelligence (Generative AI) techniques. This paper introduces an innovative approach, a multilingual semisupervised model combining Generative Adversarial Networks (GANs) and Pretrained Language Models (PLMs), more precisely mBERT and XLM-RoBERTa. Our approach proves its effectiveness in the detection of hate speech and offensive language in Indo-European languages (in English, German, and Hindi) when employing only 20% annotated data from the HASOC2019 dataset, thereby presenting significantly high performances in each of multilingual, zero-shot crosslingual, and monolingual training scenarios. Our study provides a robust mBERT-based semisupervised GAN model (SS-GAN-mBERT) that outperformed the XLM-RoBERTa-based model (SS-GAN-XLM) and reached an average F1 score boost of 9.23% and an accuracy increase of 5.75% over the baseline semisupervised mBERT model. Full article

(This article belongs to the Special Issue Advances in Complex Networks and Their Applications, from COMPLEX NETWORKS 2023)

► Show Figures

Figure 1

22 pages, 1108 KB

Open AccessEditor’s ChoiceArticle

We Know You Are Living in Bali: Location Prediction of Twitter Users Using BERT Language Model

by Lihardo Faisal Simanjuntak, Rahmad Mahendra and Evi Yulianti

Big Data Cogn. Comput. 2022, 6(3), 77; https://doi.org/10.3390/bdcc6030077 - 7 Jul 2022

Cited by 35 | Viewed by 6133

Abstract

Twitter user location data provide essential information that can be used for various purposes. However, user location is not easy to identify because many profiles omit this information, or users enter data that do not correspond to their actual locations. Several related works attempted to predict location on English-language tweets. In this study, we attempted to predict the location of Indonesian tweets. We utilized machine learning approaches, i.e., long-short term memory (LSTM) and bidirectional encoder representations from transformers (BERT) to infer Twitter users’ home locations using display name in profile, user description, and user tweets. By concatenating display name, description, and aggregated tweet, the model achieved the best accuracy of 0.77. The performance of the IndoBERT model outperformed several baseline models. Full article

(This article belongs to the Topic Machine and Deep Learning)

► Show Figures

Figure 1

Search Results (5)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (5)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI