Next Article in Journal
Towards Improved Clinical Adoption of AI Segmentation Models: Benchmarking High-Performance Models for Resource-Constrained Settings
Previous Article in Journal
BWT-Enhanced Compression for GIS Raster Data: A Hybrid AV1-Inspired Approach with Burrows–Wheeler Transform
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BERT-Based Models for Normalization of Adverse Drug Event Expressions in Social Media to Standard Medical Terminology for Drug Safety Analysis

1
National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
2
Pengcheng Laboratory, Shenzhen 518000, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Big Data Cogn. Comput. 2026, 10(5), 141; https://doi.org/10.3390/bdcc10050141
Submission received: 19 February 2026 / Revised: 20 April 2026 / Accepted: 21 April 2026 / Published: 2 May 2026
(This article belongs to the Section Data Mining and Machine Learning)

Abstract

Social media platforms host abundant and timely descriptions of medication experiences that can complement traditional pharmacovigilance systems. Yet the linguistic informality of these data presents a major challenge for mapping adverse drug event (ADE) expressions to standardized medical terminology. In this study, we developed BERT-based language models to classify ADE mentions from social media into MedDRA System Organ Classes (SOCs). Using the SMM4H and CADEC corpora, as well as their combination, we performed 20 iterations of 20% holdout validation for 3-, 6-, 22-, and 25-SOC classification tasks with a selected fixed training configuration (learning rate, batch size, and training epochs) based on training-loss convergence. The models achieved accuracies ranging from 75% to 94%, demonstrating strong performance for SOC-level classification of noisy and informal ADE expressions under the evaluated settings. These results are based on a controlled mention-level evaluation using deduplicated adverse drug event strings and do not establish document-level or real-world deployment generalization. This work provides a systematic evaluation of BERT-based models for SOC-level classification of ADEs and demonstrates consistent performance within the evaluated datasets and label granularities. While direct comparison with prior studies is limited by differences in datasets and evaluation protocols, the results demonstrate that transformer-based models can effectively classify ADEs into SOCs. These findings support the use of transformer-based normalization for SOC-level aggregation of user-reported adverse events and their integration into large-scale social media pharmacovigilance pipelines as a downstream component under controlled conditions.

1. Introduction

Drug safety is a fundamental component of public health and regulatory science. Spontaneous reporting systems, such as the FDA Adverse Event Reporting System (FAERS), are widely used for post-marketing pharmacovigilance and safety signal detection [1,2,3]. Beyond these reporting systems, patients increasingly share information about adverse drug reactions, treatment experiences, and medication use on social media platforms. As a result, social media has emerged as an important source of real-world data for drug safety surveillance in recent years [4,5,6,7,8,9]. Experiences and opinions on drug safety are posted on platforms like Facebook, X, and web forums. Such content provides real-time, rich, and diverse information that is complementary to traditional sources in pharmacovigilance [10,11,12]. For instance, social media data can be used to identify adverse drug events that may not be detected in traditional sources such as clinical trial reports, which are constrained by limited numbers of participants and controlled environments [13,14]. Recent studies showed that social media data can be effectively utilized to detect adverse drug events in real-time, aiding drug safety signal detection [15,16]. It is noteworthy that social media data have been used for identifying adverse events associated with COVID-19 treatment drugs during the COVID-19 pandemic [17,18,19,20], demonstrating the value of social media data in drug safety monitoring.
Social media posts often contain informal language, including slang, abbreviations, and misspellings, which makes the terms used to describe adverse drug events quite different from those used in clinical trial reports. Techniques used for identifying adverse drug events in traditional data sources may not be able to accurately detect adverse drug events in social media posts. Therefore, language models have been developed to identify adverse drug events in social media posts [21,22,23].
Another challenge is that different terms are used to describe the same adverse drug event in social media posts, affecting the accuracy of subsequent drug safety profile analysis. Therefore, normalizing adverse drug events extracted from social media posts into standardized terms is important for improving the accuracy of analysis on drug safety profiles obtained from these posts. Rule-based methods have been explored for normalizing adverse drug events identified from drug labeling documents to Medical Dictionary for Regulatory Activities (MedDRA) terms [24].
Normalizing adverse drug events identified from social media data is a challenging task due to the unstructured format and informal language used in user-generated content [25]. Models trained on clinical trial reports to normalize adverse drug events may not be effective when applied to social media data, owing to the significant differences in language style between clinical trial reports and social media content [26,27]. Moreover, the large amount of irrelevant information in social media posts makes it harder to detect and normalize adverse drug events. These irrelevant inputs can lead to incorrect associations, ultimately reducing the accuracy of models for drug safety monitoring [6,7,28,29,30,31]. These challenges underscore the need for language models to improve both the extraction and SOC-level classification of adverse drug events. Mapping to MedDRA System Organ Classes (SOCs) provides a coarse-grained representation that supports the aggregation of user-reported adverse drug events into standardized safety profiles. This task is distinct from the finer-grained task of mapping to Preferred Terms (PTs) or Lowest Level Terms (LLTs), which involves detailed concept-level normalization.
Traditional methods use preferred terms or lower-level terms from medical dictionaries to match and filter adverse drug events. These rule-based methods work well with adverse drug events identified in medical reports and drug labeling documents which are in a formal style. However, such rule-based methods are not able to handle adverse drug events expressed in informal language, which are common in social media data. Therefore, different methods are needed to understand and standardize adverse drug events identified from social media data [5,32,33].
MedDRA [34] is a widely used standardized medical terminology that facilitates uniform reporting and analysis of adverse drug events globally. MedDRA is organized in a hierarchy with 27 System Organ Classes (SOCs) at the top. The terms in MedDRA are grouped at different levels based on the organ system affected. Under SOCs, levels of terms (from high to low) are High-Level Group Terms (HLGTs), High-Level Terms (HLTs), Preferred Terms (PTs), and Lowest Level Terms (LLTs). There are over 80,000 LLTs that give detailed descriptions of adverse drug events. MedDRA is updated frequently to incorporate common terms used by healthcare professionals and to hierarchically classify frequently used adverse drug event terms into 27 SOCs. It serves as a standard for ensuring consistency and providing a framework for the classification and comparison of adverse drug events [35,36,37,38]. This standard is crucial for integrating findings from diverse data sources into a cohesive drug safety profile and maintaining a clear and consistent analysis of drug safety data with a shared understanding of adverse drug events.
BERT can understand the context of words by learning bi-directional representations of text, making it effective for handling complex and unstructured textual data [39,40]. Therefore, BERT-based models could be developed for extracting and normalizing adverse drug events from diverse sources, including social media. Normalizing adverse drug events to MedDRA term categories by BERT-based models ensures that adverse drug events identified from social media data can be compared with those detected in clinical trial documents, improving drug safety monitoring [41,42,43]. The ability of BERT-based models to standardize adverse drug events makes them valuable tools for creating consistent and reliable drug safety profiles across different data sources.
Recent studies have explored automated extraction and normalization of adverse drug event mentions from social media, using methods such as rule-based pipelines, deep learning architectures, semantic similarity models, and early transformer-based approaches [44,45,46,47,48,49]. Although these efforts have advanced the field of social media-based pharmacovigilance, adverse drug event normalization—particularly the mapping of noisy, informal patient language to MedDRA terminology—remains a persistent challenge. Current methods often struggle with spelling variability, colloquial symptom expressions, and contextual ambiguity, resulting in limited accuracy and inconsistent performance across datasets. Because robust normalization is essential for aggregating user-reported events, detecting emerging safety trends, and reliably integrating social media signals into broader pharmacovigilance workflows, there remains a clear need for more effective, domain-adapted supervised models.
Although mapping to PTs or LLTs is the ideal objective, it remains challenging in the context of social media due to extreme label sparsity and long-tail distribution, where many specific terms lack sufficient training examples. In this setting, mapping to SOCs provides a more tractable alternative, offering clinically valuable meaningful and coarse-grained categorization that supports high-level drug safety profiling.
In this study, we address this gap by developing a supervised BERT-based language model for SOC-level classification of adverse drug events identified from social media posts. This is a coarse-grained task, distinct from fine-grained normalization to PTs or LLTs. We, therefore, focus on evaluating the ability of BERT-based model to map informal adverse drug event expressions into clinically meaningful SOCs under a controlled supervised setting. We conduct a systematic evaluation across multiple datasets, including SMM4H, CADEC, and their combination, and examine performance under different class configurations. In particular, we assess the impact of class size and class distribution on model performance and compare results across datasets with different characteristics. Our results show that BERT-based models can classify informal adverse drug event mentions into MedDRA SOCs with the defined evaluation framework. These findings highlight the potential of transformer-based models for addressing linguistic variability in normalizing adverse drug events from social media. More broadly, this study demonstrates social media data at a coarse-grained level. It supports the use of SOC-level classification as a practical step toward aggregating user-reported adverse drug events and integrating social media data into pharmacovigilance workflows, while recognizing that fine-grained PT/LLT normalization remains an open challenge.

2. Methods

2.1. Study Design

The adverse drug event normalization models were designed to map adverse drug events identified in social media posts to MedDRA SOCs. The modeling framework assumes that adverse drug event mentions have been correctly identified and focuses on the downstream classification task. Nine models were developed and tested with two data sources, SMM4H (Social Media Mining for Health) [50] and CADEC (CSIRO Adverse Drug Event Corpus) [51], as shown in Figure 1. In brief, for each of the nine datasets that were derived from the two initial data sources, 80% were randomly selected and used to train a model and the remaining 20% were used to test the developed model. To prevent information leakage from repeated expressions, we removed duplicated adverse drug event mentions prior to data partitioning. This preprocessing step reduced the SMM4H corpus from 1717 raw records to 1106 unique adverse drug event strings, and the CADEC corpus from 5959 raw records to 3348 unique adverse drug event strings. Datasets were then partitioned such that no adverse drug event string in the test set appeared in the training set. During model development, the training dataset (80% of the whole dataset) was further split into 80% and 20% to evaluate training behavior under different parameter settings. Hyperparameters were selected based on training convergence and applied consistently across all datasets. The developed model was then evaluated using the remaining 20% of the data. This 20% holdout validation process was repeated 20 times with different random splitting to reach a statistically consistent performance evaluation within datasets.

2.2. SMM4H Dataset

The SMM4H dataset, a valuable resource in health informatics, contains human expert annotations of adverse drug events identified in social media posts. The annotated adverse drug events are mapped to MedDRA SOCs. This dataset is a useful data source for developing language models to automatically normalize adverse drug events identified from unstructured textual data such as social media posts. The SMM4H dataset includes 1717 adverse drug events which are mapped to 25 MedDRA SOCs (Table 1).

2.3. CADEC Dataset

The CADEC dataset contains human expert annotated adverse drug events extracted from online health forums (https://www.askapatient.com, forum from 2001 to 2013, accessed on 18 February 2026) where patients discuss their experiences with medications. This dataset contains 5959 annotated adverse drug events which are mapped to the same 22 MedDRA SOCs (Table 1).

2.4. Datasets for Model Development

The frequency of adverse drug events differs significantly across various MedDRA SOCs in both SMM4H and CADEC datasets, as shown in Table 1. The top six SOCs with most adverse drug events in the SMM4H dataset are SOC codes 10018065, 10037175, 10029205, 10017947, 10028395, and 10022891 (Table 1). In addition to the 25-class SMM4H and 22-class CADEC datasets, the following seven datasets were derived from the SMM4H and CADEC datasets. The 3-class and 6-class datasets were constructed by selecting the most frequent SOCs from the SMM4H and CADEC datasets. These subsets were designed to evaluate model performance under reduced class cardinality. They represent simplified classification scenarios and should not be considered equivalent to the full SOC normalization task.
Both: A 25-class dataset that merges SMM4H and CADEC datasets. When constructing the ‘Both’ models, we adopted a merge-then-deduplicate strategy. This process identified 146 unique adverse drug event terms shared between the SMM4H and CADEC corpora, which were consolidated prior to dataset partitioning. The resulting combined dataset contained 4307 unique adverse drug event mentions. This combined dataset reflects the union of SOCs present in both SMM4H and CADEC (Table 1). Specifically, SMM4H contains 25 SOCs and CADEC contains 22 SOCs; their integration yields 25 unique SOCs due to partial overlap between the two datasets. To ensure consistency comparability, all SOC labels were standardized according to MedDRA version 26.1.
SMM4H-3: A three-class dataset that consists of 1200 adverse drug events categorized under the three SOCs with codes 10018065, 10037175, and 10029205, derived from the SMM4H dataset.
CADEC-3: A three-class dataset comprising 2541 adverse drug events belonging to the three SOCs with codes 10018065, 10037175, and 10029205, derived from the CADEC dataset.
Both-3: A three-class dataset combining SMM4H-3 and CADEC-3.
SMM4H-6: A six-class dataset containing 1462 adverse drug events, including SMM4H-3 dataset and additional 262 adverse drug events in the three SOCs with codes 10017947, 10028395, and 10022891, derived from the SMM4H dataset.
CADEC-6: A six-class dataset incorporating 4953 adverse drug events, including CADEC-3 dataset and additional 2412 adverse drug events in the three SOCs with codes 10017947, 10028395, and 10022891, derived from the CADEC dataset.
Both-6: A six-class dataset merging SMM4H-6 and CADEC-6.
These nine datasets were used to build and evaluate BERT-based models for normalizing adverse drug events in different scenarios.

2.5. Model Architecture

Figure 2 shows the architecture of the BERT-based model for normalizing adverse drug events to SOCs. The model takes text with adverse drug events and predicts SOCs they belong to. First, the text is broken into individual words which are termed as tokens. These tokens are converted into vectors that represent their meaning, helping the model understand the text context. The model uses a pretrained BERT model to convert tokens into vectors. The vectors are then input to a fully connected neural network. This network compresses the vectors to a smaller space that matches the number of SOCs. It uses transformations and activation functions to learn the differences in adverse drug events between SOCs. Our model uses a softmax function to convert the raw scores from an adverse drug event into probabilities, and the SOC with the highest probability is selected for the adverse drug event. The final output is the predicted SOCs for the input adverse drug events. For example, the adverse drug event “feeling like a zombie” is predicted to the SOC with code 10018065, “general disorders and administration site conditions.”
The pretrained BERT model used is distilbert-base-uncased, with a vector dimension of 768. Each token is represented by a 768-dimensional vector, and this size remains the same throughout the model training process. Distilbert-base-uncased is a compact and efficient version of BERT, designed with fewer parameters to enhance speed and reduce memory usage. Despite its smaller size, it retains most of BERT’s language comprehension capabilities, performing well on various natural language processing tasks. This architecture leverages the rich information in BERT’s embeddings, making the developed model effective in predicting SOCs for adverse drug events.
The experiments were conducted using Python 3.10, PyTorch 2.1.0, and the Hugging Face Transformers library (version 4.38.2). All models were based on the distilbert-base-uncased architecture and fine-tuned for multi-class classification. Text inputs were tokenized using the corresponding tokenizer, with sequences truncated or padded to a maximum length of 256 tokens.
Models were trained using the AdamW optimizer. Hyperparameters were tuned using a sequential strategy. The number of training epochs was evaluated up to 100, with 40 epochs selected based on training loss convergence. Learning rates were evaluated over the set {1 × 10−6, 1 × 10−5, 5 × 10−5, 1 × 10−4, 5 × 10−4}, and batch sizes over {8, 16, 32, 64, 128}. The final configuration (40 epochs, learning rate 5 × 10−5, batch size 16) was applied across all datasets.
To ensure reproducibility, the 20% holdout evaluation was repeated 20 times using a fixed sequence of random seeds {2, 4, 6, …, 40}, ensuring consistent data partitioning and performance evaluation across runs.
In this study, the input to the model is the isolated adverse drug event mention string rather than the full post or document context. Although exact duplicate strings were removed prior to splitting, the evaluation remains at the mention level. As a result, near-duplicate expressions and contextual dependencies within posts or users may still exist across training and test sets. Therefore, this setup should be interpreted as a controlled mention-level evaluation and does not fully eliminate potential information leakage present in real-world scenarios.
To provide a reference for model performance, we implemented a traditional machine learning classifier using TF-IDF features and Logistic Regression. Adverse drug event mention strings were represented using TF-IDF with a vocabulary size capped at 2500 and an n-gram range of 1–2. These features were used to train a multi-class Logistic Regression model with a maximum of 1000 iterations.
The baseline was evaluated on the same dataset variants as the BERT-based models, including SMM4H, CADEC, and the combined dataset under 3-class, 6-class, and full SOC settings. Data was split into 80% training and 20% testing sets using stratified sampling. Each experiment was repeated with three random seeds, and performance was reported as mean accuracy and standard deviation.

2.6. Selection of Algorithmic Parameters

As shown in Figure 2, for a set of algorithmic parameters, a model was trained to predict SOCs for the training adverse drug events. The predicted SOCs were then compared with the true SOCs, and the difference was calculated as LOSS, which measures performance of the model. A parameter setting associated with stable training-loss convergence was selected to train the final BERT-based model.
Algorithmic parameters epochs, learning rates, and batch size are important in training the BERT-based model and were explored in this study. In an epoch, the model processes all training adverse drug events and adjusts weights of its neurons based on the computed LOSS to improve performance. The number of epochs is vital for the model’s generalization, as too few epochs may lead to underfitting while too many could cause overfitting. Learning rates control how much the model’s weights change in response to the LOSS. Batch size determines how many samples the model processes before updating the weights. Hyperparameters, including number of epochs, learning rate, and batch size, were selected based on training loss convergence behavior. Specifically, we identified parameter settings converged training loss. The identified hyperparameter settings (40 epochs, learning rate 5 × 10−5, batch size 16) were then applied across all datasets and class settings to ensure comparability. We did not perform validation-based hyperparameter selection or early stopping, which may limit generalization performance, particularly for smaller or imbalanced datasets.

2.7. Performance Metrics

Multiple metrics were used in this study for assessing performance of the BERT-based model for normalizing adverse drug events to SOCs. The first metric is accuracy, which measures overall performance of the multi-class classification models. Accuracy was calculated as the proportion of number of adverse drug events with their SOCs correctly predicted among all adverse drug events predicted using the equation below.
A c c u r a c y = N u m b e r   o f   a d v e r s e   d r u g   e v e n t s   c o r r e c t l y   p r e d i c t e d T o t a l   n u m b e r   o f   a d v e r s e   d r u g   e v e n t s   p r e d i c t e d
We used the mean value and standard deviation of the 20 accuracy values obtained from the 20 iterations of 20% hold-out validation (Figure 1). The mean accuracy measures the overall performance of the models. The standard deviation indicates consistent performance within datasets.
The second metric is normalized confusion matrix. For an N-class model, this metric is an N-by-N matrix, and its element Ci,j indicates the likelihood of adverse drug events in SOCi to be predicted in SOCj. To calculate this metric, a confusion matrix was generated from each of the 20 hold-out validations. An overall confusion matrix was then generated by adding up the elements of the 20 confusion matrices. The sum of each row of the overall matrix (20 times number of the adverse drug events in each SOC) was used to divide the element in the row, resulting in the normalized confusion matrix that shows average probabilities of adverse drug events in SOCi (i = 1,2, …, N) to be predicted in SOCj (j = 1,2, …, N).
To account for class imbalance, we computed precision, recall, and F1-score for each SOC using Equations (2)–(4).
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
where TP, FP, and FN represent true positives, false positives, and false negatives, respectively. We computed macro-averaged metrics by averaging across all SOCs using Equations (5)–(7).
M a c r o P r e c i s i o n = 1 N i = 1 N P r e c i s i o n i
M a c r o R e c a l l = 1 N i = 1 N R e c a l l i
M a c r o F 1 = 1 N i = 1 N F 1 i
where N is the number of SOCs. We calculated the mean and standard deviation of these metrics across 20 repeated holdout evaluations. In addition, 95% confidence intervals for macro-F1 were computed using Equation (8).
C I 95 % = x ¯ ± 1.96 s n
where x ¯ is the mean macro-F1, s is the standard deviation, and n = 20.

3. Results

3.1. Training-Loss-Based Parameters Selection

Figure 3 shows the results of fine-tuning parameters for the nine BERT-based models. With a fixed learning rate of 1 ×   10 5 and a batch size of 16, the training LOSS values of 100 epochs were calculated for each of the nine models. LOSS value decreases as the number of epochs increases for all models, indicating that the models learned from the training data to improve their prediction performance. Initially, the LOSS starts at a higher value, reflecting the models’ initial inaccuracies. By the 40th epoch, the LOSS values remained stable for all nine models. Training for more epochs did not significantly improve the models as the LOSS values did not dramatically reduce. Therefore, 40 epochs were used in training the BERT-based models.
Figure 4 presents the training LOSS values at different learning rates for the nine BERT-based models with 40 epochs and a batch size of 16. Results from different learning rates were plotted as curves in different colors: blue for 1 × 10−6, orange for 1 × 10−5, green for 5 × 10−5, red for 1 × 10−4, and purple for 5 × 10−4. A learning rate of 5 × 10−5 resulted in the lowest LOSS values and was used in training all nine BERT-based models.
Figure 5 shows the batch size fine-tuning results, illustrating the training LOSS values over 40 epochs with a learning rate of 5 × 10−5. The LOSS curves for all nine BERT-based models are color-coded according to batch size: blue for 8, orange for 16, green for 32, red for 64, and purple for 128. While nearly all batch sizes achieved their lowest LOSS values by the 40th epoch, we selected a batch size of 16 as the parameter for training. This decision was based on the rapid decrease in loss values at this batch size, leading to a stable convergence earlier than the others.

3.2. Model Performance

For each of the nine datasets, 80% of the samples were randomly selected to train a BERT-based model with the selected training configuration (40 epochs, batch size 16, and a learning rate of 5× 10−5). The remaining 20% of samples were then used to evaluate model performance. To reach a statistically consistent performance evaluation within datasets, this process was repeated 20 times. Figure 6 displays the mean accuracy and standard deviation from the 20 repeats of 20% holdout evaluations on the nine datasets. Overall, the models performed well. Further examination of the results indicates that models trained on the CADEC datasets outperformed the models constructed on the SMM4H datasets. It is important to note that model performance is evaluated under a mention-level setting using deduplicated adverse drug event strings. This setup isolates the classification task but does not reflect document-level dependencies or deployment conditions.
To evaluate model performance under class imbalance, we calculated macro-averaged precision, recall, and F1-score (Table 2). The results show a consistent decline in performance as the classification task shifts from reduced-class settings to the full SOC label space. In SMM4H, macro-F1 decreases from 0.76 (3-class) and 0.78 (6-class) to 0.45 in the full 25-class setting. A similar trend is observed in CADEC, where macro-F1 decreases from 0.94 and 0.93 to 0.74, and in the combined dataset, where it decreases from 0.85–0.86 to 0.62. Macro-precision and macro-recall exhibit similar trends, with increased variability in the full SOC setting, as indicated by larger standard deviations. The wider confidence intervals in the full setting (e.g., [0.42, 0.49] for SMM4H and [0.69, 0.77] for CADEC) further highlight the impact of class imbalance. Together, these results indicate that model performance on less frequent SOCs is substantially lower in the full classification setting, which is not captured by accuracy alone.
Figure 7 presents the normalized confusion matrices for the nine models. The x-axis represents the predicted SOC categories, while the y-axis corresponds to the actual SOC categories. The color scale indicates the percentage of instances for each prediction outcome. Diagonal elements represent correct classifications, whereas off-diagonal elements denote misclassifications. Overall, the models demonstrate strong performance, as most adverse events were accurately normalized to their correct SOC categories. A closer examination of the confusion matrices reveals that models trained on the SMM4H-3 dataset (average accuracy = 0.76) performed slightly better than those trained on SMM4H-6 (average accuracy = 0.75), both of which outperformed models built using the SMM4H dataset (average accuracy = 0.70). Similar trends were observed for the CADEC datasets (average accuracies of 0.94, 0.92, and 0.89 for CADEC-3, CADEC-6, and CADEC, respectively) and the combined datasets (average accuracies of 0.85, 0.86, and 0.82 for Both-3, Both-6, and Both, respectively).
These results highlight a clear distinction in model performance between reduced-class and full SOC settings. The full SOC models (25 classes for SMM4H, 22 for CADEC, and 25 for the combined dataset) demonstrate the feasibility of applying BERT-based models to SOC-level classification in realistic scenarios characterized by large label spaces and class imbalance. In contrast, the higher accuracies observed in the 3-class and 6-class settings reflect reduced task complexity, where classification is limited to the most frequent SOCs with more concentrated data distributions. Therefore, performance in reduced-class settings should be interpreted as reflecting simplified conditions, whereas results from the full SOC setting provide a more realistic estimate of real-world applicability. Furthermore, models trained on CADEC consistently outperformed those trained on SMM4H. This difference is likely attributable to several factors, including more samples per SOC in CADEC.
To provide a reference for model performance, we evaluated a traditional Term Frequency-Inverse Document Frequency (TF-IDF) and Logistic Regression baseline. The baseline shows lower accuracy than the BERT-based models across all datasets and class settings. In the reduced-class settings, the baseline has accuracies of 0.61 and 0.55 on SMM4H-3 and SMM4H-6, 0.83 and 0.77 on CADEC-3 and CADEC-6, and 0.74 and 0.73 on Both-3 and Both-6. In the full SOC setting, baseline performance dropped further to 0.41 on SMM4H, 0.71 on CADEC, and 0.65 on the combined dataset. In comparison, the BERT-based models have higher performance in each corresponding setting, with the largest gap observed in SMM4H. This pattern is consistent with the characteristics of the datasets. SMM4H contains shorter, noisier, and more informal social media expressions, which are difficult to represent using sparse TF-IDF features alone. In contrast, contextualized embeddings from BERT can capture word meaning in relation to surrounding tokens and are therefore better suited to variable and colloquial adverse drug event expressions. The smaller gap in CADEC may reflect its larger size and more repeated phrasing, which can be captured more effectively by traditional feature-based models. Overall, these results show that contextualized representations provide more effective modeling of adverse drug event mentions than TF-IDF features, particularly in settings with greater variation in expression and larger label spaces.
Per-class precision, recall, and F1-score across SOC categories are presented in Table 3, Table 4 and Table 5. The results show substantial variability across SOCs, particularly in the full SOC setting. Frequently represented SOCs achieve consistently high performance, whereas rare SOCs exhibit low and unstable performance, with some categories showing near-zero recall and F1-scores. This pattern reflects the long-tail distribution of SOC categories where limited training samples constrain model learning. These results complement macro-averaged metrics and demonstrate that overall performance is largely driven by frequent SOCs. Performance on rare SOCs remains a key limitation.

4. Discussion

This study demonstrates that supervised fine-tuned BERT-based models can achieve strong performance for SOC-level classification of adverse drug event expressions from social media under the evaluated experimental setting. While prior studies have explored a variety of methods for adverse drug event normalization, including rule-based pipelines, word-embedding approaches, semantic similarity techniques, and more recent deep learning and transformer-based frameworks [44,45,46,47,48,49], direct comparisons are limited by differences in datasets, label spaces, task definitions, and evaluation protocols. In particular, many previous studies focus on fine-grained normalization to MedDRA PTs or LLTs, whereas the present study evaluates a coarser SOC-level classification task. Therefore, the results reported here should be interpreted within the scope of the current study rather than as evidence of superiority over existing approaches.
Although the models show consistent performance across SMM4H, CADEC, and the combined dataset, these results reflect learning within the evaluated datasets and do not directly establish cross-domain generalization. Training and testing within the same dataset or pooled datasets do not fully capture variability across platforms, annotation styles, or real-world deployment conditions. More rigorous evaluation strategies, such as training on one dataset and testing on another, are needed to assess robustness across domains and should be explored in future work.
Our study highlights ongoing challenges in adverse drug event normalization. Like prior work, our models were affected by the long-tail distribution of adverse drug event mentions across SOCs. Several SOC categories in existing corpora contain very few examples, limiting the model’s ability to learn discriminative features for rare classes. This issue has been noted in earlier normalization studies, particularly those using social media data, where many adverse drug event categories are sparsely represented. This effect is also evident in our results. The higher accuracy observed in the 3-class and 6-class settings is largely due to reduced task complexity, as these settings focus on the most frequent SOCs. While these findings demonstrate model performance under simplified conditions, they do not fully reflect performance in the full SOC classification task. In contrast, the full SOC setting includes a larger number of classes and substantial imbalance, making it both more difficult and more representative of real-world scenarios. In addition, differences between datasets further influence model performance. The combined dataset integrates SOC annotations from two sources with different distributions and coverage. Although SOC labels were standardized using the same MedDRA version, differences in dataset composition and class distribution may contribute to variation in model performance and confusion patterns. These factors should be considered when interpreting results derived from the combined dataset. Expanding annotated datasets—either through new corpus development or through augmentation and active learning—will be essential to boosting performance, especially for the full 25-SOC classification.
Another challenge is the diversity and rapid evolution of language across platforms. While prior studies have evaluated methods across sources such as X, online forums, and Reddit, cross-platform generalizability remains limited. Future models should consider supervised fine-tuned pretraining or continual learning to mitigate performance degradation as linguistic patterns shift over time.
While the present study evaluates performance using deduplicated mention-level inputs, the results do not establish document-level or deployment-level generalization. Although exact duplicate strings were removed prior to splitting, near-duplicate expressions, shared linguistic patterns, and contextual dependencies within posts or users may still introduce implicit leakage, particularly in datasets such as CADEC. Therefore, the reported results should be interpreted as a mention-level semantic baseline under controlled conditions rather than a fully leakage-resistant evaluation. More rigorous strategies, such as grouped splitting at the post-, review-, or user-level, are needed to better assess real-world generalization and should be explored in future work. Such evaluations may result in lower performance, but would provide a more realistic estimate of model robustness in practical pharmacovigilance applications.
This study evaluates SOC-level classification under the assumption that adverse drug event mentions are correctly identified. In practical pharmacovigilance pipelines, adverse drug event detection from raw social media posts is an upstream step that may introduce errors. These errors can propagate to the normalization stage and reduce overall system performance. Therefore, the reported accuracies represent performance under idealized conditions and should be interpreted as an upper bound for the normalization task rather than end-to-end pipeline performance.
Hyperparameters were selected based on training loss convergence rather than validation-based optimization. Therefore, the reported configuration should be interpreted as a controlled baseline setting rather than a performance-optimized model. Future work incorporating validation-based tuning and early stopping may improve generalization.
While our study demonstrates that supervised fine-tuning of a general-domain pretrained model can achieve strong performance for adverse drug event normalization, the use of domain-specific pretrained models may further enhance results. Models such as BERTweet, BioBERT, and PubMedBERT are pretrained on social media or biomedical corpora and may better capture domain-specific linguistic patterns, terminology, and contextual nuances. In addition, domain-adaptive pretraining on large-scale unlabeled health-related social media data could further improve model representations by aligning them more closely with the target domain. Such approaches may be particularly beneficial for handling informal expressions, misspellings, and emerging terminology commonly observed in user-generated content. Future work will explore these strategies to assess their potential for improving performance and robustness in real-world applications.
Finally, although SOC-level normalization is a crucial step toward making social media adverse drug event data analyzable for pharmacovigilance, more granular MedDRA mapping (e.g., to PTs or LLTs) remains a key unmet need. Mapping to PT/LLT levels introduces additional challenges, including synonym (multiple expressions referring to the same concept), polysemy (ambiguous expressions with multiple meanings), and the multi-axial structure of MedDRA, where a single concept may belong to multiple hierarchical categories. In addition, some adverse event mentions may correspond to multiple plausible PTs depending on context, requiring more precise disambiguation. Despite these challenges, SOC-level normalization provides a practical and scalable approach for aggregating adverse events into SOCs, enabling downstream analyses such as frequency comparison and trend analysis across heterogeneous data sources. While recent developments in multi-level classification offer promising directions, robust solutions for fine-grained normalization, particularly in the context of noisy social media data, remain limited.

5. Conclusions

This study demonstrates that BERT-based models can effectively perform System Organ Class (SOC)-level classification of adverse drug event (ADE) mentions from social media under a supervised, mention-level setting. Across multiple datasets and class configurations, the models show strong performance within the evaluated experimental framework, highlighting the ability of transformer-based approaches to handle linguistic variability in informal user-generated content.
However, the findings should be interpreted within the scope of the current study. The evaluation assumes correctly identified ADE mentions and is conducted at the mention level using deduplicated inputs. As a result, the reported performance represents an upper bound for the classification task and does not reflect end-to-end pharmacovigilance performance or deployment conditions. In addition, the results do not establish cross-domain generalization, and further evaluation using leakage-resistant splitting strategies and cross-dataset validation is needed.
Overall, this work provides a systematic assessment of transformer-based models for SOC-level classification and supports their use as a component within broader pharmacovigilance workflows. Future work should focus on improving performance for rare SOC categories, incorporating domain-specific pretraining, and evaluating robustness under more realistic and application-oriented settings.

Author Contributions

Conceptualization, F.D. and H.H.; methodology, F.D., W.G., J.L., A.V.; writing—original draft preparation, F.D., W.G., H.H.; writing—review and editing, H.H., W.T., T.A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the US Food and Drug Administration (FDA).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

This research was supported in part by an appointment to the Research Participation Program at the National Center for Toxicological Research (Ann Varghese), administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and the U.S. Food and Drug Administration.

Conflicts of Interest

The authors declare no conflicts of interest. This manuscript reflects the views of its authors and does not necessarily reflect those of the U.S. Food and Drug Administration. Any mention of commercial products is for clarification only and is not intended as approval, endorsement, or recommendation.

References

  1. Chokkakula, S.; Yang, H.; Al-Masri, A.A.; Zhang, Y.; Naveen, B.; Yang, B. The post-marketing safety of venlafaxine: A real-world two-decade pharmacovigilance study using the FAERS database. Front. Pharmacol. 2026, 17, 1737113. [Google Scholar] [CrossRef]
  2. Zhou, C.; Peng, S.; Lin, A.; Jiang, A.; Peng, Y.; Gu, T.; Liu, Z.; Cheng, Q.; Zhang, J.; Luo, P. Psychiatric disorders associated with immune checkpoint inhibitors: A pharmacovigilance analysis of the FDA Adverse Event Reporting System (FAERS) database. eClinicalMedicine 2023, 59, 101967. [Google Scholar] [CrossRef] [PubMed]
  3. Guo, W.; Pan, B.; Sakkiah, S.; Ji, Z.; Yavas, G.; Lu, Y.; Komatsu, T.E.; Lal-Nag, M.; Tong, W.; Patterson, T.A.; et al. Informing selection of drugs for COVID-19 treatment through adverse events analysis. Sci. Rep. 2021, 11, 14022. [Google Scholar] [CrossRef] [PubMed]
  4. Wessel, D.; Pogrebnyakov, N. Using Social Media as a Source of Real-World Data for Pharmaceutical Drug Development and Regulatory Decision Making. Drug Saf. 2024, 47, 495–511. [Google Scholar] [CrossRef]
  5. Golder, S.; O’Connor, K.; Wang, Y.; Klein, A.; Gonzalez Hernandez, G. The Value of Social Media Analysis for Adverse Events Detection and Pharmacovigilance: Scoping Review. JMIR Public Health Surveill. 2024, 10, e59167. [Google Scholar] [CrossRef] [PubMed]
  6. Golder, S.; O’Connor, K.; Wang, Y.; Gonzalez Hernandez, G. The Role of Social Media for Identifying Adverse Drug Events Data in Pharmacovigilance: Protocol for a Scoping Review. JMIR Res. Protoc. 2023, 12, e47068. [Google Scholar] [CrossRef]
  7. Lee, J.Y.; Lee, Y.S.; Kim, D.H.; Lee, H.S.; Yang, B.R.; Kim, M.G. The Use of Social Media in Detecting Drug Safety-Related New Black Box Warnings, Labeling Changes, or Withdrawals: Scoping Review. JMIR Public Health Surveill. 2021, 7, e30137. [Google Scholar] [CrossRef]
  8. Golder, S.; Smith, K.; O’Connor, K.; Gross, R.; Hennessy, S.; Gonzalez-Hernandez, G. A Comparative View of Reported Adverse Effects of Statins in Social Media, Regulatory Data, Drug Information Databases and Systematic Reviews. Drug Saf. 2021, 44, 167–179. [Google Scholar] [CrossRef]
  9. Dong, F.; Guo, W.; Liu, J.; Patterson, T.A.; Hong, H. Pharmacovigilance in the digital age: Gaining insight from social media data. Exp. Biol. Med. 2025, 250, 10555. [Google Scholar] [CrossRef]
  10. Smith, K.; Golder, S.; Sarker, A.; Loke, Y.; O’Connor, K.; Gonzalez-Hernandez, G. Methods to Compare Adverse Events in Twitter to FAERS, Drug Information Databases, and Systematic Reviews: Proof of Concept with Adalimumab. Drug Saf. 2018, 41, 1397–1410. [Google Scholar] [CrossRef]
  11. Zhang, J.; Wang, X.; Zhou, Y. Comparative analysis of semaglutide induced adverse reactions: Insights from FAERS database and social media reviews with a focus on oral vs subcutaneous administration. Front. Pharmacol. 2024, 15, 1471615. [Google Scholar] [CrossRef]
  12. Sarker, A.; Ginn, R.; Nikfarjam, A.; O’Connor, K.; Smith, K.; Jayaraman, S.; Upadhaya, T.; Gonzalez, G. Utilizing social media data for pharmacovigilance: A review. J. Biomed. Inform. 2015, 54, 202–212. [Google Scholar] [CrossRef]
  13. MacKinlay, A.; Aamer, H.; Yepes, A.J. Detection of Adverse Drug Reactions using Medical Named Entities on Twitter. AMIA Annu. Symp. Proc. 2017, 2017, 1215–1224. [Google Scholar]
  14. Li, Y.; Jimeno Yepes, A.; Xiao, C. Combining Social Media and FDA Adverse Event Reporting System to Detect Adverse Drug Reactions. Drug Saf. 2020, 43, 893–903. [Google Scholar] [CrossRef]
  15. Oyebode, O.; Orji, R. Identifying adverse drug reactions from patient reviews on social media using natural language processing. Health Inform. J. 2023, 29, 14604582221136712. [Google Scholar] [CrossRef] [PubMed]
  16. Khademi Habibabadi, S.; Delir Haghighi, P.; Burstein, F.; Buttery, J. Vaccine Adverse Event Mining of Twitter Conversations: 2-Phase Classification Study. JMIR Med. Inform. 2022, 10, e34305. [Google Scholar] [CrossRef]
  17. Terry, K.; Yang, F.; Yao, Q.; Liu, C. The role of social media in public health crises caused by infectious disease: A scoping review. BMJ Glob. Health 2023, 8, e013515. [Google Scholar] [CrossRef]
  18. Schellack, N.; Strydom, M.; Pepper, M.S.; Herd, C.L.; Hendricks, C.L.; Bronkhorst, E.; Meyer, J.C.; Padayachee, N.; Bangalee, V.; Truter, I.; et al. Social Media and COVID-19-Perceptions and Public Deceptions of Ivermectin, Colchicine and Hydroxychloroquine: Lessons for Future Pandemics. Antibiotics 2022, 11, 445. [Google Scholar] [CrossRef] [PubMed]
  19. Hussain, Z.; Sheikh, Z.; Tahir, A.; Dashtipour, K.; Gogate, M.; Sheikh, A.; Hussain, A. Artificial Intelligence-Enabled Social Media Analysis for Pharmacovigilance of COVID-19 Vaccinations in the United Kingdom: Observational Study. JMIR Public Health Surveill. 2022, 8, e32543. [Google Scholar] [CrossRef] [PubMed]
  20. Daluwatte, C.; Khromava, A.; Chen, Y.; Serradell, L.; Chabanon, A.L.; Chan-Ou-Teung, A.; Molony, C.; Juhaeri, J. Application of a Language Model Tool for COVID-19 Vaccine Adverse Event Monitoring Using Web and Social Media Content: Algorithm Development and Validation Study. JMIR Infodemiology 2024, 4, e53424. [Google Scholar] [CrossRef]
  21. Dong, F.; Guo, W.; Liu, J.; Patterson, T.A.; Hong, H. BERT-based language model for accurate drug adverse event extraction from social media: Implementation, evaluation, and contributions to pharmacovigilance practices. Front. Public Health 2024, 12, 1392180. [Google Scholar] [CrossRef]
  22. Dai, X.; Karimi, S.; Sarker, A.; Hachey, B.; Paris, C. MultiADE: A Multi-domain benchmark for Adverse Drug Event extraction. J. Biomed. Inform. 2024, 160, 104744. [Google Scholar] [CrossRef] [PubMed]
  23. Li, Y.; Li, J.; He, J.; Tao, C. AE-GPT: Using Large Language Models to extract adverse events from surveillance reports—A use case with influenza vaccine adverse events. PLoS ONE 2024, 19, e0300919. [Google Scholar] [CrossRef]
  24. Tiftikci, M.; Özgür, A.; He, Y.; Hur, J. Machine learning-based identification and rule-based normalization of adverse drug reactions in drug labels. BMC Bioinform. 2019, 20, 707. [Google Scholar] [CrossRef] [PubMed]
  25. Sloane, R.; Osanlou, O.; Lewis, D.; Bollegala, D.; Maskell, S.; Pirmohamed, M. Social media and pharmacovigilance: A review of the opportunities and challenges. Br. J. Clin. Pharmacol. 2015, 80, 910–920. [Google Scholar] [CrossRef] [PubMed]
  26. Yu, D.; Vydiswaran, V.G.V. An Assessment of Mentions of Adverse Drug Events on Social Media with Natural Language Processing: Model Development and Analysis. JMIR Med. Inform. 2022, 10, e38140. [Google Scholar] [CrossRef]
  27. Miftahutdinov, Z.; Kadurin, A.; Kudrin, R.; Tutubalina, E. Medical concept normalization in clinical trials with drug and disease representation learning. Bioinformatics 2021, 37, 3856–3864. [Google Scholar] [CrossRef]
  28. Pappa, D.; Stergioulas, L. Harnessing social media data for pharmacovigilance: A review of current state of the art, challenges and future directions. Int. J. Data Sci. Anal. 2019, 8, 113–135. [Google Scholar] [CrossRef]
  29. Audeh, B.; Bellet, F.; Beyens, M.N.; Lillo-Le Louët, A.; Bousquet, C. Use of Social Media for Pharmacovigilance Activities: Key Findings and Recommendations from the Vigi4Med Project. Drug Saf. 2020, 43, 835–851. [Google Scholar] [CrossRef]
  30. Pérez-Pérez, M.; Igrejas, G.; Fdez-Riverola, F.; Lourenço, A. A framework to extract biomedical knowledge from gluten-related tweets: The case of dietary concerns in digital era. Artif. Intell. Med. 2021, 118, 102131. [Google Scholar] [CrossRef]
  31. Fisher, A.; Young, M.M.; Payer, D.; Pacheco, K.; Dubeau, C.; Mago, V. Automating Detection of Drug-Related Harms on Social Media: Machine Learning Framework. J. Med. Internet Res. 2023, 25, e43630. [Google Scholar] [CrossRef] [PubMed]
  32. Rezaei, Z.; Ebrahimpour-Komleh, H.; Eslami, B.; Chavoshinejad, R.; Totonchi, M. Adverse Drug Reaction Detection in Social Media by Deep Learning Methods. Cell J. 2020, 22, 319–324. [Google Scholar] [CrossRef] [PubMed]
  33. Murphy, R.M.; Klopotowska, J.E.; de Keizer, N.F.; Jager, K.J.; Leopold, J.H.; Dongelmans, D.A.; Abu-Hanna, A.; Schut, M.C. Adverse drug event detection using natural language processing: A scoping review of supervised learning methods. PLoS ONE 2023, 18, e0279842. [Google Scholar] [CrossRef]
  34. Brown, E.G.; Wood, L.; Wood, S. The medical dictionary for regulatory activities (MedDRA). Drug Saf. 1999, 20, 109–117. [Google Scholar] [CrossRef] [PubMed]
  35. Große-Michaelis, I.; Proestel, S.; Rao, R.M.; Dillman, B.S.; Bader-Weder, S.; Macdonald, L.; Gregory, W. MedDRA Labeling Groupings to Improve Safety Communication in Product Labels. Ther. Innov. Regul. Sci. 2023, 57, 1–6. [Google Scholar] [CrossRef]
  36. Kralova, K.; Wilson, C.A.; Richebourg, N.; D’Souza, J. Quality of MedDRA® Coding in a Sample of COVID-19 Vaccine Medication Error Data. Drug Saf. 2023, 46, 501–507. [Google Scholar] [CrossRef]
  37. Revers, A.; Hof, M.H.; Zwinderman, A.H. BAHAMA: A Bayesian Hierarchical Model for the Detection of MedDRA®-Coded Adverse Events in Randomized Controlled Trials. Drug Saf. 2022, 45, 961–970. [Google Scholar] [CrossRef]
  38. Chan, E.; Small, S.S.; Wickham, M.E.; Cheng, V.; Balka, E.; Hohl, C.M. The Utility of Different Data Standards to Document Adverse Drug Event Symptoms and Diagnoses: Mixed Methods Study. J. Med. Internet Res. 2021, 23, e27188. [Google Scholar] [CrossRef]
  39. Narayanan, S.; Mannam, K.; Achan, P.; Ramesh, M.V.; Rangan, P.V.; Rajan, S.P. A contextual multi-task neural approach to medication and adverse events identification from clinical text. J. Biomed. Inform. 2022, 125, 103960. [Google Scholar] [CrossRef]
  40. Li, Y.; Tao, W.; Li, Z.; Sun, Z.; Li, F.; Fenton, S.; Xu, H.; Tao, C. Artificial intelligence-powered pharmacovigilance: A review of machine and deep learning in clinical text-based adverse drug event detection for benchmark datasets. J. Biomed. Inform. 2024, 152, 104621. [Google Scholar] [CrossRef]
  41. Kim, S.; Kang, T.; Chung, T.K.; Choi, Y.; Hong, Y.; Jung, K.; Lee, H. Automatic Extraction of Comprehensive Drug Safety Information from Adverse Drug Event Narratives in the Korea Adverse Event Reporting System Using Natural Language Processing Techniques. Drug Saf. 2023, 46, 781–795. [Google Scholar] [CrossRef]
  42. Zitu, M.M.; Zhang, S.; Owen, D.H.; Chiang, C.; Li, L. Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records. Front. Pharmacol. 2023, 14, 1218679. [Google Scholar] [CrossRef]
  43. Guan, H.; Devarakonda, M. Leveraging Contextual Information in Extracting Long Distance Relations from Clinical Notes. AMIA Annu. Symp. Proc. 2019, 2019, 1051–1060. [Google Scholar]
  44. Karapetiantz, P.; Audeh, B.; Redjdal, A.; Tiffet, T.; Bousquet, C.; Jaulent, M.C. Monitoring Adverse Drug Events in Web Forums: Evaluation of a Pipeline and Use Case Study. J. Med. Internet Res. 2024, 26, e46176. [Google Scholar] [CrossRef]
  45. Magge, A.; Tutubalina, E.; Miftahutdinov, Z.; Alimova, I.; Dirkson, A.; Verberne, S.; Weissenbacher, D.; Gonzalez-Hernandez, G. DeepADEMiner: A deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter. J. Am. Med. Inform. Assoc. 2021, 28, 2184–2192. [Google Scholar] [CrossRef]
  46. Remy, F.; Scaboro, S.; Portelli, B. Boosting Adverse Drug Event Normalization on Social Media: General-Purpose Model Initialization and Biomedical Semantic Text Similarity Benefit Zero-Shot Linking in Informal Contexts. arXiv 2023, arXiv:2308.00157. [Google Scholar] [CrossRef]
  47. Zhai, Y.; Bao, X.; Chersoni, E.; Portelli, B.; Gu, J.; Huang, C.-R. PolyuCBS at SMM4H 2024: LLM-based Medical Disorder and Adverse Drug Event Detection with Low-rank Adaptation. In Proceedings of the 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks, Bangkok, Thailand, 15 August 2024; pp. 74–78. [Google Scholar]
  48. Yazdani, A.; Rouhizadeh, H.; Bornet, A.; Teodoro, D. CONORM: Context-Aware Entity Normalization for Adverse Drug Event Detection. medRxiv 2023. [Google Scholar] [CrossRef]
  49. Elbiach, O.; Grissette, H.; Nfaoui, E.H. Leveraging Transformer Models for Enhanced Pharmacovigilance: A Comparative Analysis of ADR Extraction from Biomedical and Social Media Texts. AI 2025, 6, 31. [Google Scholar] [CrossRef]
  50. Magge, A.; Klein, A.; Miranda-Escalada, A.; Ali Al-Garadi, M.; Alimova, I.; Miftahutdinov, Z.; Farre, E.; Lima López, S.; Flores, I.; O’Connor, K.; et al. Overview of the Sixth Social Media Mining for Health Applications (#SMM4H) Shared Tasks at NAACL 2021. In Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task, Mexico City, Mexico, 10 June 2021; pp. 21–32. [Google Scholar]
  51. Karimi, S.; Metke-Jimenez, A.; Kemp, M.; Wang, C. Cadec: A corpus of adverse drug event annotations. J. Biomed. Inform. 2015, 55, 73–81. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Workflow for model training, parameter selection, and evaluation across multiple datasets. The workflow illustrates the repeated data-splitting and model training procedure applied to the SMM4H, CADEC, and combined datasets (“Both”) with varying annotation levels (3-class, 6-class, or full versions). Each iteration begins with a random split of the full dataset into 80% training and 20% testing partitions. The training portion is further divided into an internal 80/20 split for training-loss-based parameter selection. All candidate parameter sets were evaluated to select a training-loss-based parameter setting. The training-loss-based parameter configuration was then used to generate a BERT model for predicting the held-out test set. This process was repeated 20 times to obtain stable performance estimation across random data splits, with aggregated prediction results summarized at the end.
Figure 1. Workflow for model training, parameter selection, and evaluation across multiple datasets. The workflow illustrates the repeated data-splitting and model training procedure applied to the SMM4H, CADEC, and combined datasets (“Both”) with varying annotation levels (3-class, 6-class, or full versions). Each iteration begins with a random split of the full dataset into 80% training and 20% testing partitions. The training portion is further divided into an internal 80/20 split for training-loss-based parameter selection. All candidate parameter sets were evaluated to select a training-loss-based parameter setting. The training-loss-based parameter configuration was then used to generate a BERT model for predicting the held-out test set. This process was repeated 20 times to obtain stable performance estimation across random data splits, with aggregated prediction results summarized at the end.
Bdcc 10 00141 g001
Figure 2. The schema of the BERT-based pipeline for normalizing adverse drug events to SOC categories. The schema illustrates the end-to-end process for training a BERT model to assign MedDRA System Organ Class (SOC) labels to adverse drug events. Adverse events are first tokenized into sub word units and wrapped with special tokens CLS and SEP. Tokens are transformed into 768-dimensional embeddings and passed through a pretrained BERT model. The resulting contextualized representation is used to predict an SOC label, which is compared with the true SOC to compute a training loss. Gradients are backpropagated for weight updates, and the process iterates until convergence. The pipeline enables fine-tuning of BERT for accurate classification of adverse drug events.
Figure 2. The schema of the BERT-based pipeline for normalizing adverse drug events to SOC categories. The schema illustrates the end-to-end process for training a BERT model to assign MedDRA System Organ Class (SOC) labels to adverse drug events. Adverse events are first tokenized into sub word units and wrapped with special tokens CLS and SEP. Tokens are transformed into 768-dimensional embeddings and passed through a pretrained BERT model. The resulting contextualized representation is used to predict an SOC label, which is compared with the true SOC to compute a training loss. Gradients are backpropagated for weight updates, and the process iterates until convergence. The pipeline enables fine-tuning of BERT for accurate classification of adverse drug events.
Bdcc 10 00141 g002
Figure 3. Training LOSS values across 100 epochs for the nine BERT-based models. The figure shows the training LOSS values for the nine dataset–annotation combinations described in Figure 1: SMM4H, CADEC, and the combined dataset (“Both”), each prepared in the 3-class (–3), 6-class (–6), or full versions. For each model, LOSS values were recorded across 100 training epochs using a fixed learning rate of 1 × 10−5 and a batch size of 16. LOSS curves showed the model training process and were used to select the number of epochs applied in later analyses. An inset “Epochs 0–16” was used to show the early training phase.
Figure 3. Training LOSS values across 100 epochs for the nine BERT-based models. The figure shows the training LOSS values for the nine dataset–annotation combinations described in Figure 1: SMM4H, CADEC, and the combined dataset (“Both”), each prepared in the 3-class (–3), 6-class (–6), or full versions. For each model, LOSS values were recorded across 100 training epochs using a fixed learning rate of 1 × 10−5 and a batch size of 16. LOSS curves showed the model training process and were used to select the number of epochs applied in later analyses. An inset “Epochs 0–16” was used to show the early training phase.
Bdcc 10 00141 g003
Figure 4. Training LOSS values across learning rates for nine BERT-based models. The figure presents the training LOSS values for the nine dataset–annotation combinations described in Figure 1. Each model was trained for 40 epochs with a batch size of 16. Five learning rates were evaluated: 1 × 10−6 (blue), 1 × 10−5 (orange), 5 × 10−5 (green), 1 × 10−4 (red), and 5 × 10−4 (purple). The LOSS curves were used to identify the learning rate applied in subsequent BERT-based model training.
Figure 4. Training LOSS values across learning rates for nine BERT-based models. The figure presents the training LOSS values for the nine dataset–annotation combinations described in Figure 1. Each model was trained for 40 epochs with a batch size of 16. Five learning rates were evaluated: 1 × 10−6 (blue), 1 × 10−5 (orange), 5 × 10−5 (green), 1 × 10−4 (red), and 5 × 10−4 (purple). The LOSS curves were used to identify the learning rate applied in subsequent BERT-based model training.
Bdcc 10 00141 g004
Figure 5. Training LOSS values across batch sizes for nine BERT-based models. The figure presents the training LOSS values for the nine dataset–annotation combinations defined in Figure 1. Each model was trained for 40 epochs using a learning rate of 5 × 10−5. Five batch sizes were evaluated and are shown in different colors: 8 (blue), 16 (orange), 32 (green), 64 (red), and 128 (purple). These LOSS curves were used to select the batch size applied in later BERT-based model training.
Figure 5. Training LOSS values across batch sizes for nine BERT-based models. The figure presents the training LOSS values for the nine dataset–annotation combinations defined in Figure 1. Each model was trained for 40 epochs using a learning rate of 5 × 10−5. Five batch sizes were evaluated and are shown in different colors: 8 (blue), 16 (orange), 32 (green), 64 (red), and 128 (purple). These LOSS curves were used to select the batch size applied in later BERT-based model training.
Bdcc 10 00141 g005
Figure 6. Model performance for holdout evaluations for nine BERT-based models. The figure summarizes model performance for the nine dataset–annotation combinations introduced in Figure 1. For each dataset, 80% of the samples were randomly selected to train a BERT-based model using the selected parameters (40 epochs, batch size 16, and learning rate 5 × 10−5), and the remaining 20% were used for testing. This process was repeated 20 times to obtain reliable performance estimates. The bars represent the mean accuracy, with error bars indicating the standard deviation.
Figure 6. Model performance for holdout evaluations for nine BERT-based models. The figure summarizes model performance for the nine dataset–annotation combinations introduced in Figure 1. For each dataset, 80% of the samples were randomly selected to train a BERT-based model using the selected parameters (40 epochs, batch size 16, and learning rate 5 × 10−5), and the remaining 20% were used for testing. This process was repeated 20 times to obtain reliable performance estimates. The bars represent the mean accuracy, with error bars indicating the standard deviation.
Bdcc 10 00141 g006
Figure 7. Normalized confusion matrices for nine BERT-based models. The figure displays the normalized confusion matrices for the nine dataset–annotation combinations defined in Figure 1. For each model, the x-axis represents the predicted SOC categories, and the y-axis represents the true SOC categories. Each cell value represents the percentage of samples belonging to a given true SOC category that were assigned to each predicted category. The color scale represents these percentage values, ranging from white (0%) to blue (100%). Diagonal cells correspond to correct classifications, while off-diagonal cells represent misclassifications. The matrices were generated by training each model on 80% of the data and evaluating on the remaining 20%, and the results were averaged across 20 repeated holdout evaluations.
Figure 7. Normalized confusion matrices for nine BERT-based models. The figure displays the normalized confusion matrices for the nine dataset–annotation combinations defined in Figure 1. For each model, the x-axis represents the predicted SOC categories, and the y-axis represents the true SOC categories. Each cell value represents the percentage of samples belonging to a given true SOC category that were assigned to each predicted category. The color scale represents these percentage values, ranging from white (0%) to blue (100%). Diagonal cells correspond to correct classifications, while off-diagonal cells represent misclassifications. The matrices were generated by training each model on 80% of the data and evaluating on the remaining 20%, and the results were averaged across 20 repeated holdout evaluations.
Bdcc 10 00141 g007
Table 1. SOC Codes and Detailed Names Ordered by Decreasing Frequency of Annotated SOCs for adverse drug events in SMM4H, CADEC and combined Data.
Table 1. SOC Codes and Detailed Names Ordered by Decreasing Frequency of Annotated SOCs for adverse drug events in SMM4H, CADEC and combined Data.
SOC CodeSOC NameADEs in SMM4HADEs in CADECADEs in Both
10018065General disorders and administration site conditions46312951758
10037175Psychiatric disorders4217751196
10029205Nervous system disorders316471787
10017947Gastrointestinal disorders89627716
10028395Musculoskeletal and connective tissue disorders8716851772
10022891Investigations86100186
10027433Metabolism and nutrition disorders58765
10040785Skin and subcutaneous tissue disorders43284327
10021428Immune system disorders30535
10038738Respiratory, thoracic and mediastinal disorders22144166
10022117Injury, poisoning and procedural complications194463
10015919Eye disorders1797114
10038604Reproductive system and breast disorders1586101
10047065Vascular disorders104252
10041244Social circumstances81018
10007541Cardiac disorders7166173
10038359Renal and urinary disorders66369
10021881Infections and infestations5611
10013993Ear and labyrinth disorders53035
10019805Hepatobiliary disorders21719
10029104Neoplasms benign, malignant and unspecified (incl cysts and polyps)224
10042613Surgical and medical procedures202
10077536Product issues101
10014698Endocrine disorders134
10010331Congenital, familial and genetic disorders101
Table 2. Performance of BERT-based models across datasets and class settings. Values are reported as mean ± standard deviation over 20 repeated holdout evaluations. Confidence intervals (95%) are reported for macro-F1.
Table 2. Performance of BERT-based models across datasets and class settings. Values are reported as mean ± standard deviation over 20 repeated holdout evaluations. Confidence intervals (95%) are reported for macro-F1.
ModelMacro-PrecisionMacro-RecallMacro-F195% CI (Macro-F1)
SMM4H-30.76 ± 0.040.76 ± 0.030.76 ± 0.04[0.74, 0.78]
SMM4H-60.78 ± 0.040.77 ± 0.040.77 ± 0.03[0.76, 0.79]
SMM4H0.45 ± 0.070.47 ± 0.080.45 ± 0.07[0.42, 0.49]
CADEC-30.94 ± 0.020.94 ± 0.010.94 ± 0.01[0.93, 0.95]
CADEC-60.93 ± 0.010.92 ± 0.010.92 ± 0.01[0.92, 0.93]
CADEC0.74 ± 0.080.73 ± 0.090.73 ± 0.08[0.69, 0.77]
Both-30.85 ± 0.010.84 ± 0.010.84 ± 0.01[0.84, 0.85]
Both-60.86 ± 0.020.86 ± 0.010.86 ± 0.01[0.85, 0.87]
Both0.62 ± 0.040.59 ± 0.030.60 ± 0.03[0.58, 0.61]
Table 3. Per-class recall over 20 repeated holdout evaluations. Values are reported as mean ± standard deviation.
Table 3. Per-class recall over 20 repeated holdout evaluations. Values are reported as mean ± standard deviation.
SOCSupport (ADEs)SMM4H-3CADEC-3Both-3SMM4H-6CADEC-6Both-6SMM4HCADECBoth
1003717511960.79 ± 0.060.94 ± 0.020.84 ± 0.030.75 ± 0.060.91 ± 0.030.82 ± 0.040.75 ± 0.070.91 ± 0.040.80 ± 0.03
1001806517580.74 ± 0.080.95 ± 0.020.90 ± 0.030.72 ± 0.080.88 ± 0.020.84 ± 0.030.71 ± 0.100.87 ± 0.020.81 ± 0.03
100292057870.76 ± 0.060.93 ± 0.040.78 ± 0.040.74 ± 0.060.91 ± 0.040.75 ± 0.040.73 ± 0.070.89 ± 0.040.74 ± 0.03
10017947716 0.78 ± 0.110.90 ± 0.040.87 ± 0.030.71 ± 0.160.87 ± 0.050.84 ± 0.04
100283951772 0.84 ± 0.110.95 ± 0.020.95 ± 0.020.78 ± 0.120.94 ± 0.020.93 ± 0.01
10022891186 0.82 ± 0.130.98 ± 0.040.93 ± 0.050.78 ± 0.140.89 ± 0.090.84 ± 0.08
1002743365 0.73 ± 0.110.85 ± 0.370.59 ± 0.16
10040785327 0.72 ± 0.210.89 ± 0.040.87 ± 0.04
10038738166 0.89 ± 0.150.93 ± 0.070.92 ± 0.08
1002211763 0.77 ± 0.270.65 ± 0.200.70 ± 0.11
10015919114 0.68 ± 0.200.90 ± 0.070.87 ± 0.07
10038604101 0.28 ± 0.300.85 ± 0.120.79 ± 0.11
1004706552 0.10 ± 0.210.68 ± 0.200.59 ± 0.19
1002142835 0.72 ± 0.340.15 ± 0.370.55 ± 0.27
1004124418 0.10 ± 0.310.45 ± 0.510.30 ± 0.38
10007541173 0.40 ± 0.500.79 ± 0.100.75 ± 0.09
1003835969 0.50 ± 0.510.92 ± 0.110.83 ± 0.12
1002188111 0.20 ± 0.410.35 ± 0.490.25 ± 0.26
1001399335 0.05 ± 0.220.72 ± 0.200.62 ± 0.17
1001980519 1.00 ± 0.000.83 ± 0.230.88 ± 0.13
100426132 0.05 ± 0.220.10 ± 0.310 ± 0
100291044 0 ± 00.75 ± 0.440 ± 0
100775361 0 ± 0 0 ± 0
100103311 0 ± 0 0 ± 0
100146984 0 ± 0 0.38 ± 0.22
Table 4. Per-class precision over 20 repeated holdout evaluations. Values are reported as mean ± standard deviation.
Table 4. Per-class precision over 20 repeated holdout evaluations. Values are reported as mean ± standard deviation.
SOCSupport (ADEs)SMM4H-3CADEC-3Both-3SMM4H-6CADEC-6Both-6SMM4HCADECBoth
1003717511960.78 ± 0.050.93 ± 0.030.85 ± 0.030.78 ± 0.050.92 ± 0.030.84 ± 0.030.77 ± 0.070.91 ± 0.040.81 ± 0.03
1001806517580.76 ± 0.050.95 ± 0.010.88 ± 0.020.71 ± 0.050.89 ± 0.030.84 ± 0.020.69 ± 0.060.88 ± 0.040.81 ± 0.03
100292057870.75 ± 0.070.94 ± 0.040.81 ± 0.030.71 ± 0.070.91 ± 0.040.79 ± 0.040.64 ± 0.050.88 ± 0.040.74 ± 0.04
10017947716 0.80 ± 0.090.91 ± 0.030.87 ± 0.040.71 ± 0.130.87 ± 0.040.86 ± 0.04
100283951772 0.82 ± 0.120.94 ± 0.010.93 ± 0.020.73 ± 0.090.93 ± 0.010.91 ± 0.02
10022891186 0.87 ± 0.100.99 ± 0.020.93 ± 0.050.77 ± 0.100.90 ± 0.070.87 ± 0.06
1002743365 0.75 ± 0.130.69 ± 0.380.62 ± 0.10
10040785327 0.78 ± 0.150.90 ± 0.050.84 ± 0.05
10038738166 0.91 ± 0.140.94 ± 0.060.88 ± 0.07
1002211763 0.63 ± 0.230.65 ± 0.160.56 ± 0.12
10015919114 0.76 ± 0.210.96 ± 0.060.90 ± 0.08
10038604101 0.30 ± 0.370.87 ± 0.110.80 ± 0.11
1004706552 0.12 ± 0.280.69 ± 0.210.62 ± 0.16
1002142835 0.78 ± 0.340.15 ± 0.370.72 ± 0.28
1004124418 0.10 ± 0.310.38 ± 0.460.23 ± 0.32
10007541173 0.29 ± 0.410.77 ± 0.090.64 ± 0.08
1003835969 0.37 ± 0.420.87 ± 0.100.81 ± 0.11
1002188111 0.15 ± 0.330.25 ± 0.380.28 ± 0.36
1001399335 0.05 ± 0.220.92 ± 0.160.90 ± 0.20
1001980519 1.00 ± 0.001.00 ± 0.000.93 ± 0.11
100426132 0.05 ± 0.220.10 ± 0.310 ± 0
100291044 0 ± 00.72 ± 0.440 ± 0
100775361 0 ± 0 0 ± 0
100103311 0 ± 0 0 ± 0
100146984 0 ± 0 0.72 ± 0.44
Table 5. Per-class F1 score over 20 repeated holdout evaluations. Values are reported as mean ± standard deviation.
Table 5. Per-class F1 score over 20 repeated holdout evaluations. Values are reported as mean ± standard deviation.
SOCSupport (ADEs)SMM4H-3CADEC-3Both-3SMM4H-6CADEC-6Both-6SMM4HCADECBoth
1003717511960.78 ± 0.040.93 ± 0.020.84 ± 0.020.76 ± 0.040.92 ± 0.020.83 ± 0.030.75 ± 0.050.91 ± 0.030.80 ± 0.02
1001806517580.75 ± 0.050.95 ± 0.010.89 ± 0.020.71 ± 0.050.88 ± 0.020.84 ± 0.020.70 ± 0.070.87 ± 0.020.81 ± 0.02
100292057870.75 ± 0.050.94 ± 0.020.80 ± 0.020.72 ± 0.060.91 ± 0.020.77 ± 0.020.68 ± 0.040.88 ± 0.030.74 ± 0.03
10017947716 0.78 ± 0.090.91 ± 0.030.87 ± 0.030.70 ± 0.130.87 ± 0.030.85 ± 0.02
100283951772 0.82 ± 0.090.95 ± 0.010.94 ± 0.010.75 ± 0.090.93 ± 0.010.92 ± 0.01
10022891186 0.83 ± 0.080.98 ± 0.020.93 ± 0.030.77 ± 0.100.89 ± 0.060.85 ± 0.04
1002743365 0.73 ± 0.100.74 ± 0.360.60 ± 0.12
10040785327 0.73 ± 0.150.89 ± 0.030.85 ± 0.03
10038738166 0.88 ± 0.110.93 ± 0.040.90 ± 0.05
1002211763 0.68 ± 0.220.65 ± 0.170.62 ± 0.10
10015919114 0.69 ± 0.160.93 ± 0.050.88 ± 0.06
10038604101 0.27 ± 0.300.85 ± 0.090.79 ± 0.08
1004706552 0.11 ± 0.220.67 ± 0.170.59 ± 0.15
1002142835 0.71 ± 0.300.15 ± 0.370.59 ± 0.22
1004124418 0.10 ± 0.310.40 ± 0.470.24 ± 0.30
10007541173 0.32 ± 0.430.77 ± 0.070.69 ± 0.06
1003835969 0.41 ± 0.440.89 ± 0.080.82 ± 0.09
1002188111 0.17 ± 0.350.28 ± 0.410.25 ± 0.27
1001399335 0.05 ± 0.220.78 ± 0.150.72 ± 0.15
1001980519 1.00 ± 0.000.89 ± 0.160.89 ± 0.08
100426132 0.05 ± 0.220.10 ± 0.310 ± 0
100291044 0 ± 00.73 ± 0.440 ± 0
100775361 0 ± 0 0 ± 0
100103311 0 ± 0 0± 0
100146984 0 ± 0 0.49 ± 0.29
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dong, F.; Guo, W.; Liu, J.; Varghese, A.; Tong, W.; Patterson, T.A.; Hong, H. BERT-Based Models for Normalization of Adverse Drug Event Expressions in Social Media to Standard Medical Terminology for Drug Safety Analysis. Big Data Cogn. Comput. 2026, 10, 141. https://doi.org/10.3390/bdcc10050141

AMA Style

Dong F, Guo W, Liu J, Varghese A, Tong W, Patterson TA, Hong H. BERT-Based Models for Normalization of Adverse Drug Event Expressions in Social Media to Standard Medical Terminology for Drug Safety Analysis. Big Data and Cognitive Computing. 2026; 10(5):141. https://doi.org/10.3390/bdcc10050141

Chicago/Turabian Style

Dong, Fan, Wenjing Guo, Jie Liu, Ann Varghese, Weida Tong, Tucker A. Patterson, and Huixiao Hong. 2026. "BERT-Based Models for Normalization of Adverse Drug Event Expressions in Social Media to Standard Medical Terminology for Drug Safety Analysis" Big Data and Cognitive Computing 10, no. 5: 141. https://doi.org/10.3390/bdcc10050141

APA Style

Dong, F., Guo, W., Liu, J., Varghese, A., Tong, W., Patterson, T. A., & Hong, H. (2026). BERT-Based Models for Normalization of Adverse Drug Event Expressions in Social Media to Standard Medical Terminology for Drug Safety Analysis. Big Data and Cognitive Computing, 10(5), 141. https://doi.org/10.3390/bdcc10050141

Article Metrics

Back to TopTop