A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection

Rivera-Hernandez, Jose Antonio; Barbosa-Santillán, Liliana Ibeth; Sánchez-Escobar, Juan Jaime

doi:10.3390/data11040078

Open AccessArticle

A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection

by

Jose Antonio Rivera-Hernandez

^1,†

,

Liliana Ibeth Barbosa-Santillán

^1,*,†

and

Juan Jaime Sánchez-Escobar

^2,*,†

¹

School of Engineering and Sciences, Monterrey Institute of Technology and Higher Education, Monterrey 45138, Nuevo León, Mexico

²

Subdirección de Investigación, Centro de Enseñanza Técnica Industrial, Guadalajara 44638, Jalisco, Mexico

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Data 2026, 11(4), 78; https://doi.org/10.3390/data11040078

Submission received: 9 February 2026 / Revised: 24 March 2026 / Accepted: 3 April 2026 / Published: 8 April 2026

Download

Browse Figures

Versions Notes

Abstract

Spam consists of unsolicited messages, and the posting of such irrelevant messages often presents significant challenges in technical forums. Two particular challenges are the dynamic nature of spamming tactics and the inadequacy of adaptable spam databases for automated classifiers. Our work addresses the need for a robust spam classification solution that can be seamlessly integrated with database, SQL, and APEX applications. We developed a labeled spam database by asking experts to categorize 1916 posts as spam or regular posts to ensure accurate classification and then created an SVM-based spam classification model that achieves an average validation accuracy of 90%. Our research enhances the current understanding of spam in technical forums and represents a solution for embedding spam classifiers into widely used platforms with an accuracy of 98.1%. Furthermore, we explore the incorporation of generative topics into our approach by integrating generative topic modeling techniques, such as latent Dirichlet allocation. In our work, the spam classifier is dynamically updated to account for emerging spam patterns and topics based on a generative approach that improves the robustness of the classifier against new spamming tactics and enables nuanced, context-aware filtering of messages. In addition, our experiments highlight the potential of text SVM classifiers for real-time applications through the fine-tuning of text features.

Keywords:

spam; technical forums; automated classifiers; spam databases; generative adversarial networks (GANs); latent Dirichlet allocation (LDA)

1. Introduction

Question and answer (QA) platforms have become an invaluable tool for finding answers quickly to various questions. Such platforms range from general forums, such as Yahoo Answers, to highly specialized ones, such as Stack Overflow, and serve a critical function by providing swift and reliable information. Their success rests on the quality of the content and the efficiency of their moderation processes. However, the vast amounts of spam that are received daily pose significant challenges. The presence of spam imposes a burden on content moderators in the form of additional work and diminishes user trust, discourages expert contributions, and, ultimately, lowers the quality and value of a platform. Artificial intelligence (AI) offers promising solutions to these challenges. By leveraging AI, QA systems can automatically detect and manage inappropriate content, thereby automating moderation. Our study explores how AI technologies can improve the moderation of a QA system, particularly by reducing spam and enhancing content quality. Research has shown that AI can effectively identify patterns of harmful behavior and unacceptable content. AI addresses the challenges associated with moderation by leveraging data analysis and pattern recognition, thus creating a more reliable and engaging user environment. Spam detection systems are crucial for preserving the integrity and enhancing the user experience on various communication platforms. Our study specifically targets spam detection in technical forums. Spam is a pervasive issue, as it significantly damages online discussions, complicates information retrieval, and disrupts the overall user experience. Given the dynamic nature of the tactics of spammers, existing spam databases often fail to keep pace with developments, rendering automated classifiers less effective. The aim of this work is to delve into the complexities of spam and its impact on technical forums. Our goal is to develop a robust spam classification solution that combines platforms such as databases, SQL, and APEX applications. In addition, this study seeks to create an accurately labeled spam database to improve the effectiveness of automated spam classifiers. To ensure accurate classification, content moderation experts meticulously categorized a spam pool of 1916 spam posts and sampled regular posts to form train/test benchmarks, thereby addressing the problem of the inadequacy of existing adaptable spam databases. Our objective is to develop an advanced spam detection system using AI techniques. The proposed system aims to automatically identify and filter spam in forums, thereby reducing the workload of human moderators and significantly improving the overall user experience and trust in forum communications. Our research focuses on the introduction of an AI-based model for detecting spam forums with high accuracy, in which patterns typical of spam messages are analyzed and identified. By efficiently filtering spam, we can enhance user trust and satisfaction in forum communications. Moreover, our automated spam detection system reduces the manual effort required of content moderators, thus allowing them to focus on other critical tasks. Our research question Q1 and hypothesis H1 are as follows: Q1: What impact does an AI-based model for detecting spam in forums have on user trust in forum communications?

Hypothesis 1.

The AI-based model for detecting spam in forums, specifically using SVM, will achieve higher accuracy in identifying spam messages.

Our work comprehensively evaluates the spam detection system by comparing it with existing methods to validate its effectiveness, and we show that AI can transform and elevate the reliability and user engagement of forum communication platforms. The remainder of this paper is structured as follows: Section 2 discusses the state of the art in this field; Section 3 provides background information and outlines our proposed architecture; Section 4 introduces the proposed SVM-based spam classification model; Section 5 describes some experiments conducted on our model; Section 6 presents the results obtained from these experiments; and this paper is concluded with some final remarks.

2. State of the Art

As described in [1], a technical forum is a virtual platform that facilitates discussion, exchange of information, and collaboration among professionals, experts, or enthusiasts with specialized knowledge in a specific technological area. Rheingold notes that platforms enable users to ask questions, share knowledge, address technical problems, and participate in dialogues centered on specific technological topics.

In their work, Ref. [2] mentioned that the distinctive features of a technical forum include the ability to post messages, create discussion threads, attach files, and organize information into categories. Online environments offer significant value to the technical community by providing an effective means of learning, solving practical problems, keeping up with the latest trends and technologies, and establishing connections with other professionals in the same sector.

Some of the most common features of technical forums, according to [1,2,3], are as follows:

Users can start conversations called “threads” on specific topics, and others can respond to these threads to contribute to the discussion.
Topics are organized into specific categories to facilitate navigation and the search for relevant information.
Most forums require users to register to participate in order to monitor each member’s activity and contributions.
Some forums have moderators who supervise discussions to ensure a respectful tone and compliance with the forum’s rules.
Forums usually provide the tools necessary to quote messages and send private messages and notify users about responses to their posts.

Research in the area of advanced spam detection includes specialized methodologies tailored for specific contexts. For instance, models designed for technical forums focus on real-time classification and leverage exploratory analysis of the characteristics of opinion spam and descriptive statistics [4] to identify the unique features of spam posts. Techniques such as user profile analysis [5] and frequent itemset mining are used to analyze user profiles, spamminess, and registration patterns, using SVM with radial basis function (RBF) kernels to detect patterns in user behavior. Moreover, spam filtering techniques for IRC discussions [6] leverage data from platforms such as Stack Overflow and YouTube to build and evaluate classifiers that distinguish on-topic from off-topic discussions. Studies of the prevalence of forum spamming have developed lightweight features based on elements such as spammers’ IP addresses [7], commenting activity, and post anatomy. Innovations such as the XRumer Forum Spam Automator Analysis involve the use of reverse engineering [8] to identify vulnerabilities and suggest countermeasures. Context-based analyses of spam blogs and honey forums [9] detect spam based on redirection and cloaking techniques.

The SMOTE approach [10] is an essential technique in this field as it addresses the issue of imbalanced training data by oversampling minority classes, thereby enhancing the performance of supervised classification algorithms. In addition, the link graph-based approach [11] classifies URLs as spam or legitimate mail by analyzing graph metrics and metadata, using techniques such as varying graph depths and subgraph aggregation to manage data noise effectively and ensure robust spam detection. Our research models for spam detection use a combination of SVMs and full-text search. This is an embedded solution that operates independently of external libraries, thus significantly reducing the burden on human moderators and enhancing user experience by maintaining the quality and relevance of forum discussions.

Ghourabi and Alohaly [12] developed a hybrid model for spam detection in SMS by combining transformers.

In [13], an SVM was combined with a TF-IDF for spam detection in forums. The authors highlighted that SVM was effective in high-dimensional spaces.

Jain et al. [14,15,16,17,18,19,20] also explored combinations of SVM with other classification methods and evaluated their performance across various types of spam, from SMS to forums and social networks. In their research, they concluded that SVM is effective when combined with dimensionality reduction techniques and hyperparameter optimization.

A comparative analysis of current spam detection methodologies is presented in Table 1.

3. A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection

In this paper, we present a spam classification model that detects and filters spam posted in technical forums using an SVM integrated with full-text search. The focus of this model is on English language content and data cleaning and preprocessing. Our model is designed for environments such as the Rapid Application Development (RAD) platform, with a fully embedded solution that does not rely on external libraries.

Dataset source and dataset definitions. The data used in this study were obtained from a public technical forum and spanned user-generated content from 1998 to 2023. From this source, we derived (i) a manually verified spam pool and (ii) sampled benchmark datasets for model development, as well as (iii) an independent, large-scale validation dataset.

Spam pool (N = 1916 spam posts). A total of 1916 posts previously flagged as spam were identified using two signals: (a) manual moderation, where posts were flagged by community moderators, users, or internal systems, and (b) historical data migration, where posts migrated from the earlier “Communities” system were flagged based on account activity (e.g., accounts previously marked as spammers). This pool is used as the source of positive (spam) examples for training and testing.

Sampled benchmark datasets (training/testing). For controlled model development and evaluation, we paired the spam pool with regular posts sampled from the same forum to construct two benchmark subsets with an approximately 2:1 regular-to-spam ratio: the training set contained 1159 spam and 2300 regular posts (3459 total) and the held-out test set contained 757 spam and 1536 regular posts (2293 total). These sampled ratios were used to ensure sufficient spam examples for stable learning and benchmarking.

Large scale validation dataset. To evaluate generalization under real-world prevalence, we additionally tested the classifier on an independent validation dataset comprising 1,980,755 regular posts and 2408 spam posts, as shown in Table 2.

The dataset size was limited because previously flagged spam data, as shown in Figure 1, were barely preserved during earlier versions of the forum. However, the SVM algorithm is well-suited to small datasets and can deliver robust, accurate results without requiring large amounts of training data. In addition to language filtering, a thorough content-based cleanup was performed. We distinguished two edge-case groups.

(1) Low-information/no-value posts: posts consisting only of punctuation, repeated characters, extremely short text, or otherwise lacking meaningful content. These entries do not provide sufficient context for a binary spam-vs.-regular classifier and would introduce label noise, so they were removed from the training corpus. (2) Ambiguous-but-meaningful posts: short or contextually incomplete messages (e.g., “Check this out!”) that can be legitimate but are difficult to label consistently as spam or regular. To preserve labeling reliability, such cases were not used as supervised training examples when a clear ground-truth label could not be established.

This cleanup step was intended to reduce noise and improve training data relevance; we do not claim a quantified performance gain from this step without an explicit ablation study.

The proposed spam classification system was seamlessly integrated with the rapid application development platform, and operated entirely within the relational database management system infrastructure, with no external dependencies. This integration offers low latency and scalability, making the model suitable for forums with high content volume. By automating spam detection, the system reduces the manual workload for forum moderators and improves the user experience by preventing spam from disrupting discussions. The data preprocessing workflow in Figure 2 shows the data flow through each preprocessing step, from raw text to the final processed dataset. The processed dataset was then used for training and testing.

A detailed preprocessing pipeline was applied to prepare the text data for classification. The steps included the following:

Tokenization:
- Each post (title and body) was tokenized, which involved breaking the text into individual words and punctuation marks.
- These unstructured data were converted into structured input suitable for the model.
Stopword removal:
- A custom stopword list was created to exclude frequent but non-informative words (e.g., “the”, “is”, “and”).
- A total of 109 stopwords were identified and removed to reduce noise and improve feature relevance (Table 3).
Stemming:
- The Porter stemming algorithm was applied to reduce words to their root forms (e.g., ”running“ to “run”).
- This step consolidated word variations, simplifying the feature space.
Feature extraction:
- The frequency-inverse document frequency (TF-IDF) metric quantified the importance of each word in the dataset.
- Higher weights were assigned to terms that were more frequent in a document but less common across the entire dataset.
Dimensionality reduction:
- Principal component analysis (PCA) was used to address the issue of high-dimensional text data.
- The top 100 principal components were retained, preserving 95% of the variance.
- This step improved computational efficiency and reduced the risk of overfitting.
- The configuration details for the full-text search SVM classifier were as follows:
Datastore setup:
- A multi-column datastore was used to combine the title and body into a virtual document for simultaneous analysis.
- The columns used were the title and body.
- Filtering was applied to remove irrelevant metadata or formatting tags (e.g., XML, HTML).
Lexer configuration:
- A lexer was applied to handle tokenization, stemming, and theme extraction for English-language text.
- Indexing was enabled for both text and thematic features.
- English stemming was applied to standardize the forms of words (Table 4).
Word list configuration:
- Full-text search capabilities were enhanced using stemming and fuzzy matching.
- An English language stemmer was applied to index word roots in order to improve search precision.
- Fuzzy matching was enabled with:
  - Fuzzy score: 60 (balancing tolerance for minor character variations);
  - Fuzzy result limit: 5000 entries to ensure comprehensive but manageable search outcomes (Table 5).
SVM classifier configuration (Figure 3):
- Maximum document term count: 8192.
- Maximum feature count: 100,000.

The dataset was split into three parts (training, testing, and validation sets) for evaluation of the SVM classifier, using dataset partitioning.

The split was as follows: the training set made up 60% of the dataset (1159 spam and 2300 regular posts), with a 2:1 ratio of regular posts to spam, to ensure a balanced training set. The testing set consisted of 40% of the data (757 spam and 1536 regular posts) and was used to evaluate the model after training. The validation set comprised an additional 1,980,755 regular posts and 2408 new spam posts and was used to test the model’s generalization capabilities on more extensive imbalanced data. To handle the class imbalance that arose due to the disproportionate number of regular posts compared to spam, class weights were applied: during the optimization process, the SVM classifier was fine-tuned by assigning higher weights to the minority spam class. Errors in spam classification were penalized more heavily than errors in regular post classification. The model was found to improve user engagement and reduce the need for manual intervention in forums. Its real-time application capability means it is a scalable solution that can be deployed across various technical forums.

4. Experiments and Results

One of the key challenges encountered during the development of the spam classification model was the presence of posts in multiple languages on the forum, since the model was initially designed to classify only content written in English. Posts were submitted in languages such as Spanish, Portuguese, Chinese, German (Figure 4), and Japanese, all of which were supported by the forum but were outside the scope of this phase of model development. These non-English posts posed a risk of introducing noise into the dataset, potentially reducing the accuracy of the classifier. A language identification and filtering process was therefore implemented using OCI AI-Language Detection.The OCI Language API was used; the coefficient determined the dominant language of the post. Posts with a low confidence score for being English, with a threshold of 0.4, were flagged as non-English and removed. This tool allowed for the automatic detection of the language of each post. As a result, posts not written in English were identified and filtered out. By excluding these non-English posts, the dataset was refined to ensure that it contained only relevant content for training the classifier. Through this filtering process, 9893 non-English posts were identified and removed from the dataset (see Figure 4). This choice matched our Oracle Text indexing configuration, which applied English stemming and English word list settings.

This step was critical in maintaining the consistency and quality of the training data, thereby ensuring that the model could accurately classify English language posts. Mechanistically, this alignment mattered because the Oracle Text lexer/word list settings used in our pipeline applied English stemming and English fuzzy matching. Mixing non-English content would have increased vocabulary sparsity and introduced feature–space mismatch for an English-configured index. By focusing exclusively on English posts in this phase, we avoided multilingual tokenization and stemming and reduced preprocessing/indexing complexity. Because our Oracle Text configuration was English-specific (e.g., English stemming and English word list settings), removing non-English posts helped maintain a consistent feature space for English language classification; we therefore describe this step as improving efficiency and preserving classification quality of English posts, rather than claiming a quantified accuracy gain without an explicit ablation study. In this manuscript, we use “performance” to refer to (i) classification metrics on English language posts (e.g., accuracy/precision/recall/F1) and (ii) computational efficiency (training/indexing time and real-time inference latency).

Real-time handling of low-information and ambiguous posts. In deployment, classification is executed synchronously during post creation (rather than as a background job) to avoid any window where inappropriate content could become visible before being assessed. Low-information/no-value submissions are handled through lightweight content validation (e.g., minimum content checks) prior to invoking the SVM classifier. For ambiguous-but-meaningful posts that pass basic validation but still lack context, the system relies on the existing moderator workflow: the model outputs a classification decision, and moderator tools are used to review edge cases and manage false positives/negatives when needed.

Figure 5 shows the class imbalance in the training dataset, which was composed of spam and regular posts. This reflects our sampled benchmark setting (approximately a 2:1 regular-to-spam ratio) used to ensure sufficient spam examples for stable model development; real-world prevalence was assessed separately using the large-scale validation dataset. A histogram illustrates this distribution, highlighting the prevalence 263 of regular posts. The spam posts, although fewer, were carefully curated for diversity 264 through a manual selection process, which involved a rigorous evaluation of the posts by content experts to ensure that the spam dataset included representative, varied instances of spam activity. Manual expert labeling involves domain experts reviewing posts and categorizing them based on their subjective judgments of factors such as “value” and “ambiguity”. The density trace in Figure 4 reveals the spread within each class: regular posts displayed a broad distribution, which captured the varied nature of legitimate content, while for the spam posts, a greater density could be seen, which represented targeted sampling efforts. These visualizations confirm the effectiveness of the sampling strategy in terms of mitigating imbalance, thus enhancing the model’s capability for accurate spam detection.

Figure 6 shows the class imbalance in the test dataset, which comprised 1536 regular posts and 757 spam posts, following a 2:1 ratio that was selected as a controlled sampling strategy to create a stable development benchmark with sufficient spam examples for evaluation. This balanced structure captured the diverse patterns of both regular and spam posts and provided a representative basis for evaluation of the model. The classifier achieved an impressive value of 98% for accuracy, with only 28 misclassified posts out of 2293 entries, which were processed within 73 s, showcasing the model’s precision in distinguishing spam from regular content. The training dataset was imbalanced; the model had learned patterns, leading to misclassification of posts with rare or unusual features. The 28 posts were forum entries that had a hidden hyperlink within a period (.). The text content was valid for the forum, but after the SVM model processed it, the period (.) revealed the deception. The histogram and density trace charts for this dataset reveal a dominant cluster of regular posts alongside a targeted spread for spam, results that support the use of this sampled benchmark for development–time evaluation; real-world prevalence and robustness under extreme imbalance were assessed using the independent validation dataset.

Figure 7 shows the composition of the validation dataset, which included the entire content of recent forum data up to the latest entries, with all regular and spam posts created between the first deployment of the forum and February 2023. This dataset encompassed 1,980,755 regular posts and 2408 spam posts, providing a highly imbalanced yet realistic view of actual forum interactions. Unlike the sampled training/test benchmarks, this validation set preserved the forum’s real-world prevalence of spam and therefore provided the most representative distribution for deployment-level evaluation. The classifier’s performance on this dataset, which was processed over nearly 49 h, yielded an accuracy of just under 90%. The confusion matrix reveals that 589 spam posts were misclassified as regular posts, while 197,679 regular posts were incorrectly identified as spam. This performance discrepancy, primarily involving false positives in the regular class, was attributed to the limited diversity in regular post samples due to prior subsampling adjustments. Despite these challenges, however, the results on the validation set were deemed satisfactory for real-world application of the model.

The datasets used for training, testing, and validation were processed separately to mitigate the risks of overfitting or conditioning. Although the training and testing datasets originated from the same initial collection, they were carefully partitioned to prevent overlap. The training dataset comprised 60% of the spam posts, which were paired with a proportional number of regular posts to maintain a 2:1 ratio of regular posts to spam. We emphasize that this 2:1 ratio was a development–time sampling choice (to ensure enough positive spam examples), whereas the validation dataset was intentionally left highly imbalanced to reflect real-world deployment conditions. The test dataset included the remaining 40% of the spam posts and a corresponding number of regular posts to give the same ratio. Furthermore, a completely independent validation dataset, consisting of newly collected posts from February 2023 onward, was introduced to evaluate the model’s performance on unseen data. This rigorous separation ensured that the training process was distinct from testing and validation, thus effectively reducing the likelihood of overfitting and allowing for an accurate assessment of the model’s generalizability in real-world conditions.

The performance of the trained SVM classifier was initially evaluated on a test set comprising 2293 posts. Standard classification metrics such as accuracy, precision, recall, and F1-score were used to assess its effectiveness. Decision trees reduce accuracy because they perform poorly on imbalanced data. SVM is a robust algorithm that is especially useful when classes are not linearly separable.

The confusion matrix for the test set is shown in Figure 8, which represents a heatmap showing the distribution of correctly and incorrectly classified posts in the test set.

The confusion matrix reveals that only 23 regular posts were incorrectly classified as spam (false positives), while five were misclassified as regular posts (false negatives), reflecting a high level of classification performance, as shown in Figure 8. The results for the performance metrics on the test set are shown in Table 6.

The classifier achieved an impressive accuracy of 98.1%, with a precision of 97.2% for spam detection and a recall of 99.3%, thus demonstrating the model’s effectiveness in minimizing false negatives (i.e., spam classified as regular posts). The generalizability of the classifier was assessed using the validation set, which contained 1,980,755 regular posts and 2408 new spam posts. This larger dataset introduced new challenges for the model, mainly due to the larger data imbalance. Table 6 shows the results for the performance metrics on the validation set and Table 7 shows the confusion matrix for the validation set.

The histogram in Figure 9 visualizes the distribution of correctly classified regular, spam, and false positive posts.

Although the overall accuracy of the model dropped to 90%, it achieved a respectable value for precision of 89.7% and recall of 91.5%. The main challenge was the increase in false positives, with 197,679 regular posts misclassified as spam. This reflects the difficulty of handling larger-scale, imbalanced data. The results for the performance metrics on the validation set are shown in Table 8.

An error analysis was conducted to gain further insight into the model’s behavior, with a focus on false positives. The precision–recall curve shown in Figure 10 was generated to visualize the trade-offs between precision and recall.

The integration of advanced technology such as AI represents a critical step forward in combating spam in QA systems. As the techniques used by spammers continuously evolve in order to bypass traditional anti-spam filters, it becomes imperative for moderation systems to adopt equally sophisticated countermeasures. The research presented here demonstrates the necessity of complementing automated processes with manual tools for content moderators to ensure the overall robustness of the system. This dual approach is essential for managing false positives and negatives, thus maintaining the reliability of our spam detection system.

Our real-time classification model, in which SVM is leveraged through full-text search, delivers impressive accuracy, with classification times not exceeding 2 s for posts with a maximum length of 8000 characters. Despite the various constraints encountered during the project, which involved developing a generic spam classification model for technology forums, the diligent use of curated datasets and the selection of an appropriate classifier algorithm enabled an accuracy of approximately 90% to be achieved. This relative accuracy underscores the importance of regular retraining, which should be carried out monthly to integrate new data and enhance the effectiveness of the model. With an average duration of 200 s, the retraining process ensures continuous improvement without significant operational disruptions.

The model encompasses additional moderation categories in addition to spam, such as abuse detection and report categorization. The potential for incorporating sentiment analysis and image classification further broadens the scope of future developments. The high recall, in particular, suggests that the model was very effective at identifying spam posts with minimal oversight. This is a crucial factor in spam detection systems, where undetected spam can lead to significant user dissatisfaction and platform integrity issues. In addition, the precision score indicates that the model minimized the number of non-spam posts misclassified as spam, reflecting a low risk of unnecessarily flagging legitimate content. Maintaining user trust by accurately distinguishing between spam and legitimate posts is essential in environments such as public forums, where content is user-driven. However, although the test set results were promising, the performance on the more extensive validation set highlighted several challenges. The accuracy of the classifier dropped to 90% when it was applied to real-world data consisting of 1,980,755 regular posts and 2408 spam posts. This degradation in performance primarily stemmed from the significant class imbalance, where regular posts vastly outnumbered spam posts by a ratio of over 800:1. As a result, the model struggled with false positives, misclassifying 197,679 regular posts as spam. This finding underscores the difficulty of generalizing a model trained on a relatively balanced dataset to a more imbalanced real-world dataset.

During the experimentation phase, certain types of posts could not be precisely classified as either spam or regular posts due to their ambiguous or unconventional characteristics. Posts with vague or contextually incomplete messages, such as “Check this out!” without additional context, frequently defied clear categorization. Similarly, hybrid posts that combined legitimate technical inquiries with spam-like elements, such as promotional links, proved challenging for the classifier. Multilingual posts in unsupported languages, such as Indonesian, further complicated classification efforts, as the model was designed specifically for English language content. Edge cases included posts where unconventional yet legitimate user behavior closely resembled spam patterns, making classification ambiguous. Other problematic cases involved specialized topics, incomplete or truncated posts, and unstructured or noisy data featuring excessive symbols or irregular formatting. Posts incorporating sarcasm or humor also confused the classifier, which relied on literal text analysis. These types of posts highlight the limitations of the spam classification model in handling content that does not conform neatly to predefined categories.

5. Conclusions

The landscape of spam detection has evolved significantly with the emergence of more sophisticated adversarial spam attacks. Spammers now employ various techniques to evade traditional classifiers, such as intentional misspellings, random insertion of irrelevant words, and the use of special characters. These adversarial tactics present substantial challenges for SVM models trained on standard datasets, making accurate spam detection difficult in real-world scenarios. The present work explored the effectiveness of using a full-text search SVM classifier to detect spam in an imbalanced dataset drawn from a public forum. Representing over 25 years of user-generated content, including regular and spam posts, the dataset provided a rich yet challenging environment for text classification. Through the use of a robust feature extraction pipeline and an SVM classifier, a high level of performance was achieved, especially in the controlled test environment, with an accuracy of 98.1%. However, the results on a large, imbalanced validation set revealed several limitations that warrant further investigation. The discussion below will explore the broader implications of these findings, the limitations of the current approach, and potential avenues for future research. The results for the test set, consisting of 2293 posts (60% regular and 40% spam), revealed a high classification accuracy. The classifier achieved high values for precision (97.2%) and recall (99.3%) for the detection of spam posts, leading to a balanced F1-score of 98.2% In summary, this research highlights the critical role of AI and machine learning in modern content moderation. In this study, a support vector machine (SVM)-based spam classification model was used, with an average validation accuracy that was shown to reach 90%. Our experiments highlight the potential of text SVM classifiers for real-time applications and demonstrate their ability to improve classification efficiency through the fine-tuning of text features. We understand the level of spam in technical forums and carried out practical implementation involving the embedding of spam classifiers within widely used platforms, with an accuracy of 98.1%. By addressing current limitations and exploring future enhancements, the accuracy and efficiency of spam detection systems can be significantly improved, which will ultimately foster a more secure and reliable user environment.

6. Discussions

The exclusion of 9893 non-English posts from the dataset did not significantly change the accuracy of the model, as the classifier was explicitly designed to operate within an English language context. While this step ensured consistency in the dataset and aligned it with the design of the model, it also highlighted a limitation in the approach described here. The exclusion of these posts prevented an evaluation of the classifier’s performance in multilingual environments, which represents an area for future improvement. Expanding the model to include additional languages could significantly enhance its applicability and robustness, as this would allow it to handle the diverse linguistic nature of many technical forums. Our methodology, based on SVM integrated with full-text search and with a focus on technical forums, offers distinct advantages and disadvantages. Firstly, it provides a fully embedded solution that is tailored for RAD platforms; this ensures a high level of relevance and effectiveness in these specific environments and enhances the model’s performance. The automation of real-time spam detection significantly reduces the manual workload for human moderators, allowing them to pay attention to more complex tasks, and ensures the prompt removal of spam, thereby maintaining the quality of the discussion and the user experience. Operating without external libraries simplifies deployment and reduces potential compatibility issues, leading to good processing times and resource utilization performance. By maintaining the quality and relevance of forum discussions, the model can help sustain engagement and trust within the community, which is essential for the success of a technical forum. However, the methodology has certain limitations, such as its focus on technical forums and RAD platforms, which restricts its applicability to other forums or general spam detection. Although avoiding the use of external libraries simplifies some aspects of its construction, the process of integration into existing systems may still be complex, requiring significant initial setup and ongoing fine-tuning. Customization for specific environments carries a risk of overfitting, making the model less adaptable to evolving spam tactics. The emphasis on real-time performance may mean that high computational resources are required, which will give rise to scalability issues for very large or high-traffic forums, and the primary focus on full-text search may limit its effectiveness in identifying non-textual spam elements, such as image- or video-based spam, which could be a constraint for forums that support various types of content. Our choice of SVM was driven by its superior performance for our specific use case, its robustness and efficiency in handling high-dimensional text data, and its ease of integration with the target platforms. SVM is known for its effectiveness in text classification problems, such as spam detection in technical forums. A rigorous set of experiments showed that our SVM-based model has a high level of accuracy (90% during validation and 98.1% in practical implementation), indicating its suitability and effectiveness for our specific spam classification task. In addition, SVMs can be seamlessly integrated with SQL and APEX-based systems; this was one of the primary objectives of this research in order to ensure smooth deployment within existing technical infrastructure. The proposed model reduces the workload for human moderators and enhances the user experience by maintaining the quality and relevance of forum discussions. Limitations of our work include the fact that extremely short or context-poor posts can be challenging to label consistently (e.g., vague messages without context). These cases are handled via basic input validation and moderator-in-the-loop review and are a target for future refinement (e.g., confidence-based triage). Our model for detecting spam in forums significantly enhanced user trust by ensuring cleaner, safer, and more relevant communication environments. By accurately identifying and removing spam, the model helped protect users from malicious content, such as scams or harmful links, and fostered confidence in the reliability of forum interactions. This, in turn, encouraged greater user engagement and trust in the model. However, challenges such as false positives, undetected spam, lack of transparency, and potential biases in detection undermined trust if not addressed. To maximize its positive impact, the AI system prioritized accuracy while providing users with mechanisms to appeal incorrect moderation decisions.

In future work, these limitations could be addressed by incorporating multilingual content and expanding the dataset to include more extensive historical data, which would enhance the model’s generalizability and applicability. Future development could also include expanding the model to handle additional languages or refining its performance with more advanced features, thereby offering greater flexibility and broader applicability.

Author Contributions

J.A.R.-H.: Software, Validation, Formal Analysis, Investigation, Writing—Original Draft. L.I.B.-S.: Writing—Review and Editing, Funding Acquisition. J.J.S.-E.: Methodology, Validation, Investigation, Funding Acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is available from the corresponding authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Preece, J.; Maloney-Krichmar, D. Online Communities: Design, Theory, and Practice. J.-Comput.-Mediat. Commun. 2005, 10, 10. [Google Scholar] [CrossRef]
Wellman, B.; Salaff, J.; Dimitrova, D.; Garton, L.; Gulia, M.; Haythornthwaite, C. Computer Networks as Social Networks: Collaborative Work, Telework, and Virtual Community. Annu. Rev. Sociol. 1996, 22, 213–238. [Google Scholar] [CrossRef]
Rheingold, H. The Virtual Community: Homesteading on the Electronic Frontier; MIT Press: Cambridge, MA, USA, 1993. [Google Scholar]
Chen, Y.R.; Chen, H.H. Opinion Spam Detection in Web Forum: A Real Case Study. In The Web Conference; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2015. [Google Scholar]
Chen, Y.R.; Chen, H.H. Opinion Spammer Detection in Web Forum. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2015. [Google Scholar]
Chowdhury, S.; Hindle, A. Mining StackOverflow to Filter Out Off-Topic IRC Discussion. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories; IEEE: New York, NY, USA, 2015. [Google Scholar]
Shin, Y.; Gupta, M.; Myers, S. Prevalence and mitigation of forum spamming. In 2011 Proceedings IEEE INFOCOM; IEEE: New York, NY, USA, 2011. [Google Scholar]
Shin, Y.; Gupta, M.; Myers, S. The Nuts and Bolts of a Forum Spam Automator. In Proceedings of the 4th USENIX Workshop on Large-Scale Exploits and Emergent Threats, Boston, MA, USA, 29 March 2011. [Google Scholar]
Yuan, N.; Chen, H.; Hsu, F.; Wang, Y.-M.; Ma, M. A Quantitative Study of Forum Spamming Using Context-based Analysis. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 28 February–2 March 2007. [Google Scholar]
Ratadiya, P.; Moorthy, R. Spam filtering on forums: A synthetic oversampling approach. arXiv 2019, arXiv:1909.04826. [Google Scholar] [CrossRef]
Shin, Y.; Myers, S.; Gupta, M.; Radivojac, P. A link graph-based approach to identify forum spam. Secur. Commun. Netw. 2015, 8, 176–188. [Google Scholar] [CrossRef]
Ghourabi, A.; Alohaly, M. Enhancing Spam Message Classification and Detection Using Transformer-Based Embedding and Ensemble Learning. Sensors 2023, 23, 3861. [Google Scholar] [CrossRef] [PubMed]
University of California. Comparative Analysis for Email Spam Detection Using SVM and Other Classifiers; University of California: Berkeley, CA, USA, 2023. [Google Scholar]
Jain, R. Performance Evaluation of SVM with Dimensionality Reduction for Various Types of Spam Detection. Int. J. Comput. Appl. 2022, 184, 1–7. [Google Scholar]
Alpaydin, E.; Bach, F. Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Lorica, B.; Loukides, M. What Is Artificial Intelligence? O’Reilly Media, Inc.: Sebastopol, CA, USA, 2016. [Google Scholar]
Marr, B.; Ward, M. Artificial Intelligence in Practice; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar]
National Research Council, Commission on Physical Sciences, Mathematics, and Applications; Naval Studies Board, Panel on Computer Science and Artificial Intelligence. Computer Science and Artificial Intelligence; National Academies Press: Washington, DC, USA, 1997. [Google Scholar]
Swamynathan, M. Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python; Academic Press: Bangalore, India, 2019. [Google Scholar]
Müller, A.C.; Guido, S. Introduction to Machine Learning with Python: A Guide for Data Scientists; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2016. [Google Scholar]

Figure 1. Manually flagged posts.

Figure 2. Workflow for data preprocessing.

Figure 3. Proposed SVM-based spam classification model.

Figure 4. Example of a post in German in the forum.

Figure 5. Class imbalance in the training dataset, showing the predominance of regular posts over spam in forum data.

Figure 6. Class imbalance in the test dataset, showing a sampled 2:1 ratio of regular posts (1536) to spam posts (757) used as a controlled evaluation benchmark.

Figure 7. Class imbalance for the validation set, showing a high level of imbalance, with 1,980,755 regular posts vs. 2408 spam posts.

Figure 8. Confusion matrix for the test set.

Figure 9. Histogram of results on the validation set.

Figure 10. Results for precision and recall on the test and validation sets.

Table 1. Comparative analysis of different spam detection methods.

Features	Ghourabi & Alohaly (2023) [12]	University of California (2023) [13]	Jain et al. (Various Years) [14,15,16,17,18,19,20]	Technical Forums (Context-Based) [4]
Methods	Hybrid model combining transformers and ensemble learning	SVM with TF-IDF vectorizer	SVM combined with other classification methods and dimensionality reduction	Real-time classification using SVM with full-text search
Data Focus	SMS	Forums	SMS, forums, social networks	Technical forums
Key Techniques	Pre-trained embeddings for text representation	Preprocessing, stemming, tokenization, TF-IDF	Dimensionality reduction, hyperparameter optimization	Exploratory analysis, descriptive statistics, user profile analysis
Performance	-	97.5% accuracy	-	-
Specialization	-	High-dimensional spaces, binary classification	Adaptable to various types of spam	Maintains quality and relevance of forum discussions
Computational Resources	-	-	-	High computational demand, emphasis on processing times
Scalability	-	-	-	Scalability issues for large or high-traffic forums
Spam Elements Detected	Textual	Textual	Textual	Textual, some limitations for non-textual (image/video) spam
Integration Complexity	-	-	-	Complex integration, initial setup and fine-tuning required
Risk of Overfitting	-	-	-	Customization for specific environments, less adaptable to evolving tactics

Table 2. Composition of the dataset.

Dataset	Regular	Spam	Total
Spam pool (source of positives)	–	1916	1916
Training set (sampled, $\tilde{2}$ :1)	2300	1159	3459
Test set (sampled, $\tilde{2}$ :1)	1536	757	2293
Validation set (real-world scale)	1,980,755	2408	1,983,163

Table 3. Stop words used in the model.

Stop Words
a	all	almost	also	although
an	and	any	are	as
but	by	can	could	d
did	do	does	either	for
from	had	has	have	having
he	her	here	hers	him
his	how	however	i	if
in	intro	is	it	its
just	ll	me	might	Mr
Mrs	Ms	my	no	non
nor	not	of	on	one
only	onto	or	our	ours
s	shall	she	should	since
so	some	still	such	t
than	that	the	their	them
then	there	therefore	these	they
this	those	though	through	thus
to	too	until	ve	very
was	we	were	what	when
where	whether	which	while	who
whose	why	will	with	would
yet	you	your	yours

Table 4. Lexer configuration cited.

Attribute	Value
Index Text	Enabled
Index Themes	Enabled
Stem Language	English

Table 5. Basic word list configuration cited.

Attribute	Value
Fuzzy Match	English
Fuzzy Score	60
Fuzzy Num Results	5000
Stemmer	English

Table 6. Performance metrics on test set.

Metric	SVM	Decision Tree
Accuracy	98.1%	1.67%
Precision	97.2%	1.83%
Recall	99.3%	1.50%
F1-Score	98.2%	1.65%

Table 7. Confusion matrix for the validation set.

	Predicted Regular	Predicted Spam
Actual regular	1,783,076	589
Actual spam	197,679	1819

Table 8. Performance metrics on validation set.

Metric	Value
Accuracy	90.0%
Precision	89.7%
Recall	91.5%
F1-Score	90.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rivera-Hernandez, J.A.; Barbosa-Santillán, L.I.; Sánchez-Escobar, J.J. A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection. Data 2026, 11, 78. https://doi.org/10.3390/data11040078

AMA Style

Rivera-Hernandez JA, Barbosa-Santillán LI, Sánchez-Escobar JJ. A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection. Data. 2026; 11(4):78. https://doi.org/10.3390/data11040078

Chicago/Turabian Style

Rivera-Hernandez, Jose Antonio, Liliana Ibeth Barbosa-Santillán, and Juan Jaime Sánchez-Escobar. 2026. "A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection" Data 11, no. 4: 78. https://doi.org/10.3390/data11040078

APA Style

Rivera-Hernandez, J. A., Barbosa-Santillán, L. I., & Sánchez-Escobar, J. J. (2026). A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection. Data, 11(4), 78. https://doi.org/10.3390/data11040078

Article Menu

A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection

Abstract

1. Introduction

2. State of the Art

3. A Generative Approach to Enhancing Forums Through SVM-Based Spam Detection

4. Experiments and Results

5. Conclusions

6. Discussions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI