Adverse Drug Reaction Detection on Social Media Based on Large Language Models

Li, Hao; Lin, Hongfei

doi:10.3390/info17040352

Open AccessArticle

Adverse Drug Reaction Detection on Social Media Based on Large Language Models

by

Hao Li

and

Hongfei Lin

^*

School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(4), 352; https://doi.org/10.3390/info17040352

Submission received: 18 January 2026 / Revised: 24 March 2026 / Accepted: 25 March 2026 / Published: 7 April 2026

(This article belongs to the Topic Generative AI and Interdisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

Adverse drug reaction (ADR) detection is essential for ensuring drug safety and effective pharmacovigilance. The rapid growth of users’ medication reviews posted on social media has introduced a valuable new data source for ADR detection. However, the large scale and high noise inherent in social media text pose substantial challenges to existing detection methods. Although large language models (LLMs) exhibit strong robustness to noisy and interfering information, they are often limited by issues such as stochastic outputs and hallucinations. To address these challenges, this paper proposes two generative detection frameworks based on Chain of Thought (CoT), namely LLaMA-DetectionADR for Supervised Fine-Tuning (SFT) and DetectionADRGPT for low-resource in-context learning. LLaMA-DetectionADR automatically generates CoT reasoning sequences to construct an instruction tuning dataset, which is then used to fine-tune the LLaMA3-8B model via Quantized Low-Rank Adaptation (QLoRA). In contrast, DetectionADRGPT leverages clustering algorithms to select representative unlabeled samples and enhances in-context learning by incorporating CoT reasoning paths together with their corresponding labels. Experimental results on the Twitter and CADEC social media datasets show that LLaMA-DetectionADR achieves excellent performance, with F1 scores of 92.67% and 86.13%, respectively. Meanwhile, DetectionADRGPT obtains competitive F1 scores of 87.29% and 82.80% with only a few labeled examples, approaching the performance of fully supervised advanced models. The overall results demonstrate the effectiveness and practical value of the proposed CoT-based generative frameworks for ADR detection from social media.

Keywords:

adverse drug reaction detection; social media; generative models; chain of thought

1. Introduction

Adverse Drug Reactions (ADRs) are defined as noxious and unintended responses to a medicinal product that occur at doses normally used in humans for the prophylaxis, diagnosis, or therapy of disease, or for the modification of physiological function [1]. These unanticipated reactions can significantly compromise patients’ physiological states and overall well-being, potentially leading to fatal consequences. Consequently, the timely and accurate detection of potential ADR is critical for ensuring medication safety and effective pharmacovigilance [2,3].

With the rapid proliferation of social media platforms, increasing numbers of users are sharing their experiences with medication online. The massive volume of medication-related information generated on these platforms offers a viable opportunity for detecting ADR from social media data. In this study, we formulate the ADR detection task as a binary classification problem, as illustrated in Figure 1. The objective is to identify relevant texts containing ADR mentions from vast social media streams, thereby providing early warnings and valuable insights to clinicians and pharmaceutical research institutions to mitigate medication-related incidents.

The irregularity of social media data and the complexity of adverse drug reaction detection pose significant challenges for identifying adverse drug reactions from social media. Li et al. pointed out that social media texts generally suffer from ungrammatical expressions, spelling errors, and other issues that introduce substantial noise, creating great difficulties for adverse drug reaction detection. Various methods have been used to alleviate textual noise and improve the accuracy of adverse drug reaction detection. Although these methods can reduce noise in social media data to a certain extent, they rely on fine-tuned encoders with relatively small parameter sizes, making it difficult to improve the model’s noise robustness. Furthermore, target symptoms relieved by medication are often similar to adverse drug reaction symptoms, and whether a social media text contains descriptions of adverse drug reactions heavily depends on textual logic. Fu et al. enhanced the model’s understanding of grammatical logic by incorporating syntactic dependencies. However, social media texts are often messy and unstructured, and many user-generated posts do not follow standard grammar, so grammar-based enhancement achieves limited effectiveness in understanding textual logic. With the rapid development of large language models (LLMs) represented by GPT [4] and LLaMA [5], such models have been pre-trained on massive amounts of social media data. With large parameter scales and extensive knowledge, they can greatly alleviate textual noise and deeply understand textual logic. In addition, Kojima et al. [6] proposed Chain-of-Thought (CoT), a technique to enhance the reasoning ability of large language models. Its core is to decompose a task into step-by-step intermediate reasoning processes, enabling the model to strengthen its understanding of textual logic and derive final results. Relying on these technical advantages, enhancing large language models with CoT to mitigate noise in social media texts and strengthen textual logic understanding shows great potential for improving the performance of adverse drug reaction detection.

Given the scarcity of labeled data for ADR detection in social media, low-resource detection paradigms align more closely with practical application scenarios. Consequently, this study not only proposes a generative model for fully supervised ADR detection but also introduces a research framework for low-resource conditions to evaluate the performance of generative models under conditions of limited data availability. It is worth emphasizing that although mining social media data holds great potential for detecting adverse drug reactions, pharmacovigilance is an extremely rigorous task. All adverse drug reactions identified through text detection must be further strictly verified by professional medical practitioners. The main contributions of this paper are summarized as follows:

1.: We propose CoT-based large language model detection models and frameworks for adverse drug reaction detection in social media under both fully supervised and low-resource conditions. The fully supervised model LLaMA-DetectionADR achieves superior performance over previous discriminative models, and the low-resource framework DetectionADRGPT significantly reduces training and annotation costs for adverse drug reaction detection.
2.: In the fully supervised setting, LLaMA-DetectionADR introduces a method to automatically obtain CoT reasoning sequences. Using these automatically constructed chain-of-thought data to enhance fine-tuning of generative models effectively alleviates data noise, strengthens textual logic understanding, and improves detection performance for adverse drug reactions.
3.: In the low-resource setting, DetectionADRGPT employs in-context learning with CoT examples selected by clustering sampling. We verify the feasibility of using generative models to detect adverse drug reactions from social media under low-resource conditions.

2. Related Works

2.1. Advances in the Detection of Adverse Drug Reactions

In traditional practice, ADRs are mainly identified through clinical trials. Although ADRs determined in this manner are rigorous, pre-marketing clinical trials are limited by trial duration and sample size, such that only a small fraction of potential adverse reactions can be detected. Many adverse events remain undiscovered until drugs are placed on the market [7]. Guellil et al. [8] noted that user reviews and posts on social media contain abundant texts related to adverse drug reactions. Since these texts are publicly available and their use for research purposes does not involve any invasion of user privacy, they can provide supplementary early signals for pharmacovigilance. Research on pharmacovigilance using social media data often relies on a pipeline combining several tasks: first, classifying social media texts; second, extracting named entities of adverse drugs and identifying triple relationships between drugs and adverse reaction symptoms; and finally, mapping colloquial drug and symptom entities from social media to standardized ontologies to discover potential adverse drug reactions [8]. To accomplish these tasks, the application of NLP algorithms has become a common and convenient approach. It should be emphasized that the use of NLP techniques for detecting and discovering ADRs serves only as an auxiliary tool, providing directions for pharmacovigilance to professional medical and pharmaceutical practitioners. Final confirmation still requires verification through clinical methods. Nevertheless, the pharmacovigilance signals mined in this way can greatly improve the efficiency with which medical professionals identify ADRs during clinical trials [9]. Therefore, binary classification of ADR-related texts from social media represents a fundamental and essential task for filtering relevant content from massive social media data. Only after identifying relevant texts can the subsequent process of assisted pharmacovigilance be effectively carried out.

2.2. Rule-Based Methods for Adverse Drug Reaction Mining

Initial rule-based approaches to ADR detection were largely predicated on identifying specific ADR entities within a text, thereby establishing this as a pivotal standard in the field. Early research utilized dictionary matching techniques [10] to verify the presence of ADR, alongside pattern matching methodologies [11] that leveraged predefined templates and rules for text classification. However, these dictionaries and rules are typically derived from known ADR and depend heavily on surface-level lexical matching, which imposes inherent limitations on the detection of potential or latent ADR. Furthermore, as rule formulation and dictionary compilation are often tailored to specific datasets, they lack robust transferability across heterogeneous data sources. As a result, such methods demonstrate constrained generalization and transfer capabilities.

2.3. Machine Learning Methods for Adverse Drug Reaction Mining

The proliferation of annotated datasets has catalyzed a distinct shift towards supervised learning methodologies [12,13,14,15] for ADR detection. A prevalent strategy involves the ensemble of multiple machine learning classifiers, trained on diverse manually extracted linguistic features to achieve automated classification. Rastegar-Mojarad et al. [12] designed an ensemble classifier specifically to mitigate the challenge of class imbalance. Their feature set encompassed unigram, bigram, and multi-term hybrid models alongside drug-ADR co-occurrence and sentiment scores. Similarly, Zhang et al. [13] proposed a weighted average ensemble framework consisting of four classifiers. This approach synthesized four distinct feature types—ADR concept matching, TF-IDF weighted word-level n-grams, Naive Bayes (NB) log likelihood ratios, and word embeddings—using a weighted average mechanism for the final prediction. In a different approach, Patki et al. [14] introduced a two-stage drug classification model. This model initially detects the presence of ADR within the text and subsequently evaluates the drug comprehensively to detect potential safety signals, effectively estimating the probability of adverse events. Moreover, Yang et al. [15] integrated text mining with partially supervised learning to automatically harvest ADR-relevant content from social media platforms. They observed that such online surveillance not only refines post-marketing pharmacovigilance strategies but also substantiates the feasibility of leveraging large-scale user-generated content for medical insights.

2.4. Deep Learning Methods for Adverse Drug Reaction Mining

In recent years, deep learning has been widely applied to text classification and ADR detection, and the scope of data sources for ADR detection has progressively expanded from clinical reports to social media platforms. Kim et al. [16] proposed the Text Convolutional Neural Network (TextCNN) model, pioneering the application of Convolutional Neural Networks (CNNs) to text classification tasks. Subsequently, Zhang et al. [17] introduced the Character-level Convolutional Neural Network (CharCNN). By discretizing sentences into character-level representations and utilizing a CNN for feature extraction, this model inherently learns to handle anomalous character combinations—such as misspellings and emoticons—thereby making it well-suited for detection and classification on noisy, informal data. Huynh et al. [18] proposed two novel neural network architectures for ADR detection: a Convolutional Recurrent Neural Network (CRNN) and a CNN incorporating attention mechanisms. Furthermore, Alimova et al. [19] proposed an Interactive Attention Network (IAN) to detect ADR within user reviews. Wu et al. [20] proposed a detection method based on multi-head self-attention and hierarchical tweet representation. By combining BiLSTM (Bidirectional Long Short-Term Memory) with a CNN, this method learns the representational features of tweets at the word level and detects drug adverse reaction information from tweets. Additionally, Zhang et al. [21] introduced an adversarial network model fused with a sentiment-aware attention mechanism. This approach employs both sentiment analysis and adversarial learning for ADR detection.

While the aforementioned models have achieved promising results, their performance remains somewhat constrained due to the absence of pre-trained language models. With the advancement of transfer learning and pre-trained encoders, Sun et al. [22] investigated various fine-tuning strategies for BERT in text classification tasks. By addressing issues such as text preprocessing, layer-wise learning rates, and catastrophic forgetting, they achieved significant performance improvements. Similarly, Qiu et al. [23] proposed a knowledge-enhanced deep–shallow network model incorporating domain keywords, which leverages external domain knowledge to boost the performance of drug adverse reaction detection. Li et al. [24] proposed a multi-feature enhanced model for ADR detection, which employs diverse feature fusion to alleviate data noise and improve domain cognition.

Furthermore, with the evolution of the capabilities of generative models, a number of studies are employing generative approaches for text detection. For instance, in the vertical medical domain, the Taiyi model was developed by fine-tuning the Qwen Base model [25], resulting in a medical LLM with robust detection and classification capabilities [26]. Zitu et al. [27] provided a comprehensive clinical perspective on LLM applications for adverse drug events, reviewing 39 studies that demonstrated excellent performance of LLM-driven approaches. Fu et al. [28] utilized data augmentation techniques based on ADR data to fine-tune large language models, which in turn enhanced the models’ performance. However, they overlooked the practical reality that sufficient existing labeled data are typically unavailable for such tasks.

In summary, in the specific field of ADR detection, research employing generative methods remains scarce. This paper proposes applying large language models with CoT reasoning to adverse drug reaction detection.

3. Proposed Methodologies

In this study, we incorporate the CoT methodology into ADR detection within social media to mitigate the hallucination phenomenon inherent in generative models. Specifically, larger LLMs are prompted to mimic human-like step-by-step reasoning and explicitly articulate the thought process. This mechanism facilitates a more profound comprehension of the related ADR information embedded in the text before generating the final detection results. Consequently, this approach alleviates hallucinations and enhances identification accuracy, effectively achieving a “think before you answer” paradigm.

Building upon this rationale, we propose two generative frameworks based on CoT to address the challenge of ADR detection in social media, which contain the LLaMA-DetectionADR framework based on fully supervised fine-tuning and the DetectionADRGPT framework based on in-context learning. These frameworks are customized specifically for ADR detection tasks, with each optimized for either the fully supervised scenario or the low-resource scenario.

3.1. Fully Supervised Adverse Drug Reaction Detection

In the fully supervised setting, we enhance the supervisory signals for fine-tuning generative models by incorporating Chain-of-Thought (CoT) explanations specifically derived for the Adverse Drug Reaction (ADR) classification task. These CoT explanations are automatically elicited from a large language model based on labeled instances from the training set. The complete construction workflow is illustrated in Figure 2, and the construction pipeline is as follows.

1.: CoT Explanation Generation Triggering: Labeled social media text (e.g., “Took aspirin for a cold last night, and woke up with a skin rash this morning” with label Yes) is fed into a predefined prompt template, which explicitly requires the model to explain the rationale for assigning the label based on the text content.
2.: Larger LLM Reasoning Generation: A Larger LLM is invoked to automatically generate CoT text that explains the detection rationale, based on the input text, label, and prompt.
3.: CoT Explanations Acquisition: The generated reasoning process is extracted from the output of the larger LLM, serving as part of the “Answer” for subsequent instruction fine-tuning.
4.: Instruction–Response Pair Construction: Manually designed high-quality prompts tailored for the ADR detection task are concatenated with the target social media text to form the instruction “Question”. The generated CoT explanations are concatenated with the final detection result (Yes/No) to form the “Answer”, yielding a complete set of instruction–response pairs.
5.: Efficient Fine-Tuning: The efficient fine-tuning algorithm Quantized Low-Rank Adaptation (QLoRA) is employed to perform supervised instruction fine-tuning on a Smaller LLM (e.g., LLaMA3-8B).
6.: Final Model: After fine-tuning, a CoT-enhanced large language model for ADR detection is obtained, designated as LLaMA-DetectionADR.

Figure 2. Demonstration of the entire process for instruction tuning data construction and instruction tuning of the fully supervised LLaMA-DetectionADR model.

3.1.1. Chain of Thought Data Generation

Initially, specific instructions are utilized to elicit CoT explanations from GPT-4o. Figure 3 illustrates the constructed instruction set, which comprises a task prompt and n few-shot examples designed to extract reasoning paths based on the target text and its corresponding detection result. The generated output represents the logical rationale for detecting adverse drug reactions in the text. The primary objective is to leverage GPT-4o to synthesize logical reasoning data that bridges the gap between the source text and the detection outcome. Subsequently, these outputs are aggregated to acquire the specific CoT reasoning processes for ADR detection.

3.1.2. Instruction Data Construction

To enhance the generalization capability of the instruction prompts for the ADR detection task, we designed five distinct and high-quality prefix prompts:

Please identify whether the following text contains information about adverse drug reactions.
Kindly verify if the text provided below contains any details related to adverse drug reactions.
Ascertain whether the following passage includes information on adverse drug reactions.
Judge whether the text that follows contains relevant information regarding adverse drug reactions.
Identify if there is any content about adverse drug reactions in the text presented below.

For each data instance, one ADR detection prompt is randomly selected from the aforementioned pool and concatenated with the raw text from the dataset to formulate the “Question” component of the instruction tuning pair. Simultaneously, the elicited CoT reasoning sequences is concatenated with the final detection label to constitute the “Answer”. To ensure linguistic coherence and a smooth transition between the reasoning process and the conclusion, the connective phrase “Therefore, the final determination is:” is inserted. An example of the constructed instruction tuning dataset for ADR detection is presented in Figure 4.

3.1.3. QLoRA-Based Instruction Fine-Tuning

In this subsection, we fine-tuned the LLaMA3-8B architecture to develop the LLaMA-DetectionADR model, which performs reasoning steps before outputting final detection results. To mitigate computational overhead and reduce training time, we employ QLoRA [29], a highly efficient fine-tuning approach to train the model.

QLoRA represents a highly efficient fine-tuning methodology specifically tailored for LLMs. Its core philosophy leverages Low-Rank Adaptation (LoRA) and quantization techniques to significantly reduce the computational resources and memory footprint required during fine-tuning while maintaining model performance. This approach effectively addresses the computational resource and storage bottlenecks inherent in traditional fine-tuning.

As a representative parameter-efficient fine-tuning technique, LoRA provides the fundamental theoretical support for QLoRA. The key mechanism of LoRA lies in low-rank decomposition, which serves as the bridge between the original high-dimensional model weights and the lightweight fine-tuning paradigm. Specifically, instead of updating the entire weight matrices of the original pre-trained language model, low-rank decomposition reformulates the weight learning objective into optimizing only two small low-rank matrices.This decomposition strategy substantially reduces the number of trainable parameters and lowers computational complexity. The derivation of this low-rank decomposition and weight update formulas is based on the core assumption of LoRA: the weight matrix update

Δ W

of the pre-trained model has low-rank characteristics, meaning that the changes in weights during model fine-tuning can be approximated through a low-dimensional space. Based on this assumption, we can decompose the high-dimensional weight update matrix

Δ W

into the product of two low-dimensional matrices, A and B. Equations (1) and (2) are derived based on this assumption to quantify the update process of the weight matrix.

Δ W = A \cdot B

(1)

W^{'} = W + Δ W = W + A \cdot B

(2)

where A

\in R^{m \times r}

and B

\in R^{r \times n}

, with

r ≪ min (m, n)

. In this approach, only these two smaller matrices need to be fine-tuned instead of the entire weight matrix, thereby significantly reducing the number of parameters required for fine-tuning.

The specific fine-tuning process for input data is as follows: First, the preprocessed ADR detection texts including CoT explanations are fed into the frozen LLaMA3-8B pre-trained model to obtain the initial weight response of the model. Subsequently, based on the above formulas, the original weight matrix W is fixed, and only the parameters of the low-rank matrices A and B are updated—the loss value corresponding to the input text is calculated through backpropagation, and the parameters of matrices A and B are iteratively optimized using the gradient descent method so that the weight update ΔW = A·B can minimize the detection loss. Finally, the updated weight update ΔW is added to the original weight matrix W to obtain the updated weight matrix W’, completing a single fine-tuning iteration of the input data. This process is repeated until the preset number of epochs is reached, and the fine-tuned LLaMA-DetectionADR model is finally obtained.

Furthermore, QLoRA employs quantization techniques to convert the weights and activations of the original model from high-precision floating point representations to low-precision integer representations. This quantization process not only reduces the storage requirements of the model but also accelerates the computational workflow. QLoRA adopts a mixed-precision quantization strategy. Specifically, during fine-tuning, the critical low-rank adapter A and B parameters are maintained in high precision to ensure that training performance remains uncompromised. Conversely, the original frozen model parameters W are low precision quantized to optimize both computation and storage. By synergistically combining low-rank adaptation with quantization techniques, QLoRA significantly reduces the resource consumption and computational cost required for fine-tuning while preserving the model’s inherent performance.

3.2. Adverse Drug Reaction Detection in Low-Resource Conditions

This subsection regards the task of ADR detection under low-resource conditions as a generative task of in-context learning for GPT-4o with few-shot prompts. Specifically, given an input context string C and a target text x, we model the probability of the output y, where C consists of a task instruction P and a series of N examples

(x_{i}, y_{i})

, which is formally expressed as shown in Equation (3). T represents the sequence length of the output result y, and t is an index variable used to iterate over each token in the output sequence y.

p_{LM} (y ∣ C, x) = \prod_{t = 1}^{T} p (y_{t} ∣ C, x, y_{< t})

(3)

The overall framework of the proposed method is presented in Figure 5. To enhance the generalization capability of LLMs during few-shot in-context learning, drawing inspiration from the Auto-CoT framework [30], we propose a cluster-based sampling strategy. This strategy realizes semantic stratification of unlabeled texts through clustering and filters out the core samples of each semantic cluster via representative sampling. It not only ensures that the selected samples can cover the main semantic types of social media ADR texts and avoid the semantic singularity of few-shot exemplars but also eliminates edge samples with high noise, improving the quality and representativeness of annotated samples. Thus, a small number of manually annotated samples can provide a comprehensive and typical reference basis for the in-context learning of large language models, greatly enhancing the generalization ability and detection accuracy of the model in low-resource scenarios, which can be broken down into the following five steps:

1.: Clustering and Representative Sampling: First, we use Sentence-BERT [31] to generate vector representations for the collected unlabeled social media texts, then we partition the dataset into N semantic clusters via the k-means algorithm. Within each cluster, texts are ranked in ascending order of their distance to the cluster centroid, and the representative samples closest to the centroid are selected. In this step, the core role of clustering is to realize semantic clustering of unlabeled data, explore the semantic correlations and category differences among texts, and ensure that the samples cover different scenarios of ADR detection. Representative sampling is designed to screen out the samples that best reflect the core characteristics of each semantic cluster, reduce the cost of subsequent manual annotation, and meanwhile guarantee the typicality of annotated samples and avoid the interference of edge noise samples on model learning.
2.: Manual CoT and Label Annotation: The selected representative samples are manually annotated to generate corresponding CoT explanations and classification labels (Yes/No). For example: Aspirin was taken last night, and a skin rash appeared this morning—this is an abnormal symptom after medication, so the final determination is: Yes.
3.: Context Prompt Template Construction: The ADR detection task instruction, the manually annotated examples with CoT selected from different clusters, and the target text to be analyzed are concatenated to form a complete contextual input prompt.
4.: Large Language Model Inference: The constructed contextual prompt is fed into a large language model to trigger few-shot in-context learning.
5.: CoT Explanation and Detection Result Generation: Based on the contextual examples and the target text, the model automatically generates a reasoning process (CoT explanation) for the target text and outputs the final ADR detection result, thereby completing the adverse drug reaction detection task under low-resource conditions.

Figure 5. Demonstration of the entire construction process for the low-resource DetectionADRGPT framework.

The input to the DetectionADRGPT framework mainly consists of a task description for adverse drug reaction detection, with demonstrations of N examples selected from different clusters and the target text to be analyzed. The output includes an explanation of the detection process for the target text as well as the final detection result. Examples of all input contextual texts and the corresponding outputs are illustrated in Figure 6.

4. Experiment Results and Discussion

4.1. Datasets

To impartially evaluate the performance of the proposed method, we conducted experiments on two benchmark datasets based on social media designed for ADR detection. To evaluate our proposed model fairly and effectively, we performed 5-fold cross validation on the two datasets. The specific operation is to randomly divide each dataset into five equal parts, sequentially select one part as the test set, one part as the validation set, and the remaining three parts as the training set, then conduct five rounds of independent training and validation. Finally, the model results are taken as the average value of the five rounds of experiments. To ensure fair evaluation compared with other baseline models, we adopted the publicly available data from the original papers for both datasets and did not apply special preprocessing to the datasets. The statistics of two datasets are presented in Table 1.

1.: CADEC [32]: This dataset is derived from user’s reviews on the medical forum AskAPatient. Each post is annotated with drugs, adverse effects, symptoms and diseases. For this study, posts containing annotated adverse effects are labeled as “Yes”, while the remaining posts are labeled as “No”.
2.: Twitter [33]: This dataset constitutes a subset of the TwiMed corpus. TwiMed aggregates drug-related reports from both social media platforms and formal scientific literature. The social media component is specifically referred to as Twitter. The original dataset was annotated at the entity level, encompassing drugs, diseases, symptoms and their interrelationships. A document is classified as containing an ADR if any relationship within it is annotated as an “outcome-negative”.

4.2. Experimental Parameters

The model experiments in this paper were conducted using an Ubuntu 24.04 system and implemented using PyTorch 2.7.0. For supervised fine-tuning datasets, the LLaMA-DetectionADR model was trained for 10 epochs. QLoRA employed 4-bit precision quantization, with a dropout rate of 0.05 and a learning rate of 3 × 10⁻⁴. The AdamW optimizer was used, and the generation temperature was set to 0.1. Training was performed on an NVIDIA RTX 3090 GPU. These parameters are the optimal ones, adjusted through experiments. Additionally, for the closed-source model GPT-4o used in this paper, inference was carried out via the official OpenAI API, with the generation temperature also set to 0.1. More detailed parameter settings, including API parameters under low-resource settings and QLoRA parameters for fully supervised fine-tuning, have been added to Appendices Appendix A–Appendix C.

4.3. Evaluation Metrics

This section evaluates the model using three common metrics: Precision (P), Recall (R) and F1 score. Here,

T P_{num_sentence}

represents the number of correctly classified adverse drug reaction sentences,

F N_{num_sentence}

represents the number of sentences containing adverse drug reactions that were predicted as not containing them, and

F P_{num_sentence}

represents the number of sentences without adverse drug reactions that were predicted as containing them. The formulas are defined as shown in Equations (4)–(6).

P = \frac{T P_{num_sentence}}{T P_{num_sentence} + F P_{num_sentence}}

(4)

R = \frac{T P_{num_sentence}}{T P_{num_sentence} + F N_{num_sentence}}

(5)

F 1 = \frac{2 P \times R}{P + R}

(6)

4.4. Baseline Discriminative Models

This subsection introduces the mainstream discriminative models for ADR detection that serve as comparative baselines in this study.

1.: CRNN [18]: A Convolutional Recurrent Neural Network designed for text classification tasks. This architecture effectively stacks convolutional layers atop a recurrent neural network to capture both local and sequential features.
2.: HTR-MSA [20]: A model featuring a hierarchical multi-head self-attention mechanism. It leverages hierarchical representations to learn multi-dimensional textual features specifically for adverse drug reaction detection.
3.: CNN + corpus [34]: This approach augments the training corpus by directly incorporating ADR data into the training set of each specific dataset. Subsequently, a CNN is trained on this augmented dataset to enhance generalization.
4.: CNN + transfer [34]: Utilizing a transfer learning paradigm, this model jointly trains on both source and target datasets. It employs a CNN to extract shared features across both domains while utilizing distinct text classifiers to generate predictions for each respective dataset.
5.: ATL [34]: An adversarial transfer learning framework that employs shared and private feature extractors to learn cross-domain common features and task-specific private features. It incorporates adversarial training to mitigate interference from corpus discrepancies.
6.: ANNSA [21]: A neural network model integrating a sentiment-aware attention mechanism with adversarial training. It uses an attention mechanism to extract fine-grained sentiment features associated with the emotional lexicon and employs adversarial training for robust data augmentation.
7.: CGEM [35]: This model constructs a heterogeneous graph treating words and documents as nodes. It employs Graph Neural Networks (GNNs) such as Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs) to learn complex relationships between nodes, thereby effectively improving the accuracy and robustness of ADR detection.
8.: KESDT [23]: A multi-layer Transformer model infused with ADR domain knowledge. It integrates domain-specific keywords across Transformer layers and incorporates external knowledge via data augmentation and specialized loss functions to enhance ADR detection performance in social media contexts.
9.: KnowCAGE [36]: Building upon the CGEM architecture, this model incorporates Unified Medical Language System medical knowledge to augment the representation capability of the graph embedding model. Furthermore, it introduces a concept-aware self-attention mechanism to enhance feature discrimination. This model has achieved state-of-the-art results on ADR detection tasks across multiple public datasets.
10.: DMFE [24]:A multi-feature enhanced adverse drug reaction detection model with data noise mitigation and domain lexicon augmentation.

4.5. Comparison of ADR Detection Results with Discriminative Models

Table 2 and Table 3 compare the performance of the proposed LLaMA-DetectionADR model and the DetectionADRGPT framework with baseline discriminative models on two social media datasets, CADEC and Twitter.

The experimental results demonstrate that the proposed LLaMA-DetectionADR model achieves excellent performance across all datasets. Specifically, on the CADEC dataset, the LLaMA-DetectionADR model attains an F1 score of 92.67%, which represents a 2.27% improvement over KnowCAGE. On the Twitter dataset, the LLaMA-DetectionADR model achieves a precision of 85.99%, a recall of 86.28% and an F1 score of 86.13%. These figures correspond to improvements of 1.19%, 2.18%, and 1.73% over KnowCAGE.

Furthermore, in low-resource settings, the DetectionADRGPT framework achieves F1 scores of 87.29% and 82.80% on the CADEC and Twitter datasets utilizing only five labeled instances. This performance surpasses that of the vast majority of discriminative models and lags behind the state-of-the-art discriminative model KnowCAGE by only 3.11% and 1.60%. Although a performance gap remains between the proposed cluster-based Chain of Thought in-context learning GPT framework and advanced supervised fine-tuning methods, LLMs are rapidly evolving with increasingly potent intrinsic capabilities. Given their ability to yield effective detection results without incurring substantial training resource costs, low-resource detection frameworks represent a promising trend for future ADR detection research and warrant further investigation.

4.6. Comparison of ADR Detection Results with Generative Models

In addition to comparing with discriminative baseline models for ADR detection, this subsection benchmarks the two approaches proposed in this study against mainstream generative models. The descriptions of these comparative models and the associated experimental methodologies are provided below:

1.: GPT-4o [4]: GPT-4o is a large language model built upon a Transformer architecture. In this study, we conducted both zero-shot and few-shot experiments on the GPT-4o model utilizing default parameter settings.
2.: DeepSeek-V3 [37]: Developed by DeepSeek, DeepSeek-V3 is an open-source inference model based on the Mixture of Experts architecture. Similarly, we conducted zero-shot and few-shot experiments using this model.
3.: LLaMA3 [38]: The Base version of the 8B parameter LLaMA3 model has only undergone pretraining. In this study, it is applied to the instruction fine-tuning task on the Adverse Drug Reaction (ADR) detection dataset, yielding LLaMA3-8B-SFT.
4.: Bal-LLaMA [28]: A LLaMA-3-based generative framework with similarity-based augmentation and instruction tuning for social media ADR detection.

The experimental results are presented in Table 4 and Table 5. On the CADEC dataset, the LLaMA-DetectionADR model with CoT enhancement achieves an F1 score of 92.67%, while the standalone LLaMA3-8B-SFT model trained solely via instruction tuning yields an F1 score of 91.78%, corresponding to an improvement of 0.89%. The recall rate of LLaMA-DetectionADR also exhibits a notable increase, rising from 91.44% to 93.21%. On the Twitter dataset, LLaMA-DetectionADR attains an F1 score of 86.13%, compared with 85.17% for LLaMA3-8B-SFT, with an improvement of 0.96%. For this dataset, both precision and recall of LLaMA-DetectionADR show synchronous gains alongside the F1 improvement. Collectively, these results confirm that CoT data effectively strengthens the ability of LLaMA3 to interpret ADR-related textual information, enabling the model to extract ADR-specific details with higher precision. This observation aligns with the conclusions of existing theoretical frameworks in the field.

Moreover, the proposed DetectionADRGPT framework exhibits substantial performance gains over GPT-4o. On the CADEC dataset, DetectionADRGPT improves F1 by 6.75 % and 5.87 % relative to GPT-4o in zero-shot and few-shot settings. On the Twitter dataset, corresponding improvements of 6.34 % and 5.00 % are observed. These findings validate the effectiveness of CoT data in enhancing detection performance under low-resource conditions, where large-scale labeled ADR data is often limited. The significant lead over GPT-4o also underscores the value of task-specific fine-tuning with CoT explanations, which outperforms the general-purpose few-shot prompting of a state-of-the-art closed-source model.

Compared to Bal-LLaMA, which incorporates similarity-based augmentation and instruction tuning specifically for social media ADR detection, the proposed LLaMA-DetectionADR achieves competitive performance. On the CADEC dataset, LLaMA-DetectionADR attains an F1 score of 92.67%, representing a 0.27 percentage point improvement over Bal-LLaMA’s 92.40%, with comparable precision and recall values. On the Twitter dataset, LLaMA-DetectionADR also outperforms Bal-LLaMA, with the F1 score increasing from 85.50% to 86.13%, a gain of 0.63 percentage points. These results suggest that the CoT enhancement strategy employed in LLaMA-DetectionADR provides effective complementary benefits to existing domain-specific augmentation approaches, achieving superior detection accuracy without relying on external similarity-based data augmentation techniques.

Corresponding experiments on DeepSeek-V3 further validate this trend. Evaluated under both zero-shot and few-shot settings for ADR detection, DeepSeek-V3 achieves F1 scores of 77.88 % and 78.60 % on CADEC and 79.04 % and 79.10 % on Twitter. These values are substantially lower than those of both DetectionADRGPT and LLaMA-DetectionADR, reinforcing that CoT-augmented fine-tuning yields more robust ADR detection than zero- or few-shot inference of alternative models.

4.7. Ablation Study of LLaMA-DetectionADR

To further substantiate the efficacy of the individual modules within the proposed LLaMA-DetectionADR model, this subsection presents an ablation study. Specifically, we analyze the model’s performance by systematically removing the CoT component and the multi-prompt instruction strategy, thereby relying solely on a single prompt for inference.

The results of this ablation study are presented in Table 6 and Table 7. It is evident that excluding the CoT explanations leads to a consistent performance degradation across both datasets, with F1 scores decreasing by 0.81% and 0.58% on the CADEC and Twitter datasets, respectively. This validates that, within the instruction fine-tuning paradigm, CoT data facilitates a structured reasoning mechanism for ADR detection, enabling the model to decompose complex textual information and identify nuanced ADR signals that might otherwise be overlooked, thereby enhancing the model’s accuracy in analyzing textual content.

Furthermore, when the multi-prompt strategy is ablated and replaced with a single instruction prompt for the instruction–response pairs, the model’s F1 scores decline by 0.45% and 0.39% on the CADEC and Twitter datasets, respectively. This indicates that employing multiple high-quality prompts can more effectively elicit LLMs’ learning capabilities and improve their generalization. Crucially, this strategy prevents the model from converging to a narrow domain representation, which would otherwise compromise performance on out-of-distribution or noisy inputs.

4.8. Ablation Study of DetectionADRGPT

To validate the effectiveness of the constituent modules within the proposed the DetectionADRGPT framework, this subsection conducts a parallel ablation study. The experiments were designed to systematically analyze the impact of each component by:

1.: Removing the clustering algorithm and substituting it with random text selection for in-context learning to assess the value of semantic sampling;
2.: Eliminating the CoT explanations, thereby restricting GPT to perform standard in-context learning without explicit reasoning paths;
3.: Excluding the relevant few-shot exemplars entirely, effectively requiring GPT to perform reasoning and generate detection results in a zero-shot setting.

The results of the ablation study are presented in Table 8 and Table 9. It is observed that eliminating the clustering module leads to a decrease in F1 scores by 1.40% and 0.58% on the CADEC and Twitter datasets for the DetectionADRGPT framework. This demonstrates that selecting representative exemplars from diverse categories via fine-grained clustering enhances the robustness of the prompting framework. Specifically, this strategy ensures that each target text can retrieve semantically similar cases within the prompt exemplars for in-context learning, thereby improving the framework’s overall performance. Upon ablating the CoT module, the F1 scores declined by 2.99% and 1.16% across the two datasets. This indicates that enabling the generative model to engage in a “think before detect” process can mitigate the hallucination phenomenon inherent in large models, thereby improving the accuracy of ADR detection. Finally, when the few-shot case prompts were removed entirely, the framework’s F1 scores declined by 5.22% and 2.31% on the respective datasets. This underscores the critical role of prompt exemplars in enabling the generative model to perform effective in-context learning for the specific task of ADR detection.

4.9. Error Analysis

To investigate the performance bottlenecks of models in social media ADR detection, and to verify the optimization effect of the CoT enhancement strategy, this section presents a targeted error comparison experiment based on the Twitter dataset. Two types of typical error-prone test samples are selected from the Twitter dataset, with 100 instances for each type, via GPT preliminary screening combined with precise manual review: one is social media noisy text containing a large amount of interference information, and the other is logically ambiguous text with vague expressions and loose semantic relevance.

The experimental results are shown in Table 10 and Table 11. It can be observed that for the specially selected noisy text and logically ambiguous text, the performance of both LLaMA-SFT and LLaMA-DetectionADR has decreased to a certain extent compared to the full dataset. However, the performance degradation of LLaMA-DetectionADR is far smaller than that of LLaMA-SFT, which demonstrates that CoT can alleviate the noise in social media text data to a certain degree and enhance the model’s understanding of data logic.

4.10. Impact of Different Foundation Models on Performance Under Full Supervision

To further validate the efficacy of CoT within the fully supervised fine-tuning paradigm, this subsection conducts supplementary experiments using a selection of prominent open-source models, extending beyond the LLaMA3-8B backbone. Specifically, we selected four additional models, which contain LLaMA2-7B, Qwen2.5-7B, Qwen3-8B, and ChatGLM3-6B. Among these, LLaMA2-7B is the predecessor of the current-generation model developed by Meta AI, Qwen2.5-7B and Qwen3-8B are developed by Alibaba Cloud’s Tongyi Lab, and ChatGLM3-6B is jointly released by Zhipu AI and the KEG Laboratory at Tsinghua University.

The comparative results are illustrated in Figure 7 and Figure 8. Figure 7 presents the F1 score performance on the CADEC dataset, where all models demonstrate clear improvements when fine-tuned with CoT: LLaMA2-7B increased from 88.19% to 90.66%, Qwen2.5-7B from 89.98% to 91.25%, Qwen3-8B from 90.75% to 92.09%, ChatGLM3-6B from 89.77% to 90.18%, and LLaMA3-8B from 91.78% to 92.67%. Similarly, Figure 8 reports the results on the Twitter dataset, where the F1 scores also rose consistently with CoT: LLaMA2-7B improved from 83.2% to 84.12%, Qwen2.5-7B from 82.74% to 83.48%, Qwen3-8B from 83.17% to 85.07%, ChatGLM3-6B from 81.78% to 83.19%, and LLaMA3-8B from 83.18% to 85.17%. Across these diverse model backbones, fine-tuning with CoT explanatory data yields consistent improvements in ADR detection performance. These findings further corroborate the effectiveness of integrating CoT to augment model capabilities within a fully supervised fine-tuning framework.

4.11. Impact of Different Models on Performance in Low-Resource Conditions

This subsection presents a further analysis of the hyperparameter N, which represents the number of clusters in the DetectionADRGPT framework. Specifically, we configured the number of semantic clusters to 3, 5, and 8 for the Twitter dataset. Annotations for CoT and classification labels were exclusively applied to the representative samples selected after clustering. The corresponding results are depicted in Figure 9.

It can be observed that when the number of clusters is insufficient, the model exhibits poor performance. This is likely because the limited diversity hinders the target text from retrieving an appropriate reference exemplar for effective comparison. When the number of clusters becomes excessive, the model’s performance tends to plateau or even decline. This phenomenon can be attributed to the over segmentation of categories, resulting in overly fine-grained clusters that may interfere with GPT’s pre-existing knowledge base. Furthermore, given the inherent noise in social media data, an excessive number of exemplars may introduce semantic inconsistencies, thereby confusing the model. Consequently, for the ADR detection task in social media, the model achieves optimal F1 performance when the number of clusters is set to 5.

5. Conclusions and Future Work

Social media has evolved into a pivotal platform for daily communication, where user-generated content constitutes a massive volume of textual data. These data contain authentic patient experiences regarding medication, such as tweets about drug effects on Twitter and discussions on therapeutic outcomes in online health communities. Compared with traditional data sources, social media data is characterized by rapid update rates and large scale.

In this paper, we proposed two CoT-based generative frameworks for ADR detection, which contain the fully supervised LLaMA-DetectionADR and the in-context learning-based DetectionADRGPT. Specifically, LLaMA-DetectionADR leverages CoT reasoning sequences generated by ChatGPT for data augmentation and integrates multi-instruction prompt templates to perform supervised instruction fine-tuning on LLaMA3-8B. Experimental results on two datasets demonstrate that this approach surpasses both traditional discriminative models and standard instruction-tuned models, confirming that CoT explanations effectively mitigate model hallucinations and enhance ADR detection performance in fully supervised settings. Conversely, DetectionADRGPT incorporates clustering algorithms to screen for representative samples automatically and introduces manually annotated CoT explanations and detection labels. This validates that combining CoT with in-context exemplars significantly improves the ADR detection capabilities of generative models in low-resource scenarios.

Although this study has achieved substantial progress in ADR detection from social media, several emerging research questions warrant further investigation. The results of adverse drug reaction mining based on social media texts can only serve as auxiliary early warning signals, and their medical rigor needs to be further verified. Final conclusions must be confirmed through professional clinical studies and medical examinations. With the rapid ascent of novel social media platforms such as Instagram and TikTok, the modality of social media content has transitioned from unimodal text to complex structures encompassing images, videos and audio. The proliferation of such multimodal data presents both new opportunities and challenges for pharmacovigilance. Future research will focus on developing social media analysis models that process multimodal information, thereby enhancing the diversity and comprehensiveness of ADR detection research.

Author Contributions

Conceptualization, H.L. (Hao Li) and H.L. (Hongfei Lin); Methodology, H.L. (Hao Li); Software, H.L. (Hao Li); Validation, H.L. (Hongfei Lin); Formal analysis, H.L. (Hao Li); Investigation, H.L. (Hao Li) and H.L. (Hongfei Lin); Resources, H.L. (Hongfei Lin); Data curation, H.L. (Hao Li); Writing—original draft preparation, H.L. (Hao Li); Writing—review and editing, H.L. (Hongfei Lin); Visualization, H.L. (Hao Li); Supervision, H.L. (Hongfei Lin); Project administration, H.L. (Hongfei Lin). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Closed-source model inference parameters.

Model Name	Model ID & Release Date	Temperature	Top_p	Max Tokens
GPT-4o	gpt-4o-2025-03-26	0.1	1.0	5000
DeepSeek-V3	deepseek-chat-v3-0324	0.1	1.0	5000

Table A2. CoT generation (GPT-4o) parameters.

Item	Configuration
Model	gpt-4o-2025-03-26
Temperature	0.3
Top_p	1.0
Max Tokens	10,000

Appendix B

Table A3. LLaMA3-8B QLoRA supervised fine-tuning configuration.

Parameter	Value
LoRA Rank (r)	16
LoRA Alpha ( $α$ )	32
LoRA Dropout	0.05
Target Modules	q_proj, v_proj
Quantization	4-bit NF4
Optimizer	AdamW
Learning Rate	3 × 10⁻⁴
Precision	BF16
Random Seed	42
Generation Temperature	0.1

Appendix C

The total API call expenses for this work are approximately $200. All experiments for the fully supervised fine-tuning were conducted on an RTX 3090 server and completed after about 170 h of training. Inference of the trained LLaMA3-ADRDetection model requires one NVIDIA RTX 3090 GPU.

Appendix D

This study strictly adheres to academic research ethics and data privacy regulations. All social media data (CADEC and Twitter datasets) employed are publicly available, anonymized benchmarks that contain no personal identifiable information (PII), private user data, or sensitive personal health information (PHI). These data are used exclusively for academic research on adverse drug reaction detection, in full compliance with platform terms of service and biomedical research norms. Notably, the ADR detection results derived from social media texts serve only as auxiliary signals for pharmacovigilance and must be further verified by professional medical practitioners before any clinical application.

References

Baber, N. International conference on harmonisation of technical requirements for registration of pharmaceuticals for human use (ICH). Br. J. Clin. Pharmacol. 1994, 37, 401. [Google Scholar] [CrossRef]
Khemani, B.; Malave, S.; Shinde, S.; Shukla, M.; Shikalgar, R.; Talwar, H. AI-Driven Pharmacovigilance: Enhancing Adverse Drug Reaction Detection with Deep Learning and NLP. MethodsX 2025, 15, 103460. [Google Scholar] [CrossRef]
Wei, Y.; Li, R.; Sun, C.; Zhu, C.; Chen, T.; Yang, H.; Liu, H. The Role of Artificial Intelligence in Adverse Drug Reaction Monitoring: Current Status and Challenges. Med. J. Peking Union Med. Coll. Hosp. 2025, 16, 1363–1370. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Hsu, D.; Moh, M.; Moh, T.S.; Moh, D. Drug side effect frequency mining over a large twitter dataset using apache spark. In Handbook of Artificial Intelligence in Biomedical Engineering; Apple Academic Press: Palm Bay, FL, USA, 2021; pp. 233–259. [Google Scholar] [CrossRef]
Golder, S.; Xu, D.; O’Connor, K.; Wang, Y.; Batra, M.; Hernandez, G.G. Leveraging natural language processing and machine learning methods for adverse drug event detection in electronic health/medical records: A scoping review. Drug Saf. 2025, 48, 321–337. [Google Scholar] [CrossRef]
Murphy, R.M.; Klopotowska, J.E.; de Keizer, N.F.; Jager, K.J.; Leopold, J.H.; Dongelmans, D.A.; Abu-Hanna, A.; Schut, M.C. Adverse drug event detection using natural language processing: A scoping review of supervised learning methods. PLoS ONE 2023, 18, e0279842. [Google Scholar] [CrossRef]
Kuhn, M.; Campillos, M.; Letunic, I.; Jensen, L.J.; Bork, P. A side effect resource to capture phenotypic effects of drugs. Mol. Syst. Biol. 2010, 6, MSB200998. [Google Scholar] [CrossRef]
Regev, Y.; Finkelstein-Landau, M.; Feldman, R.; Gorodetsky, M.; Zheng, X.; Levy, S.; Charlab, R.; Lawrence, C.; Lippert, R.A.; Zhang, Q.; et al. Rule-based extraction of experimental evidence in the biomedical domain: The KDD Cup 2002 (task 1). ACM Sigkdd Explor. Newsl. 2002, 4, 90–92. [Google Scholar] [CrossRef]
Rastegar-Mojarad, M.; Elayavilli, R.K.; Yu, Y.; Liu, H. Detecting signals in noisy data-can ensemble classifiers help identify adverse drug reaction in tweets. In Proceedings of the Social Media Mining Shared Task Workshop at the Pacific Symposium on Biocomputing, Kohala Coast, HI, USA, 4–8 January 2016. [Google Scholar]
Liu, J.; Zhao, S.; Zhang, X. An ensemble method for extracting adverse drug events from social media. Artif. Intell. Med. 2016, 70, 62–76. [Google Scholar] [CrossRef]
Patki, A.; Sarker, A.; Pimpalkhute, P.; Nikfarjam, A.; Ginn, R.; O’Connor, K.; Smith, K.; Gonzalez, G. Mining adverse drug reaction signals from social media: Going beyond extraction. Proc. Biolinksig 2014, 2014, 1–8. [Google Scholar]
Yang, M.; Wang, X.; Kiang, M.Y. Identification of Consumer Adverse Drug Reaction Messages on Social Media. In Proceedings of the Pacific Asia Conference on Information Systems (PACIS), Jeju Island, Republic of Korea, 18–22 June 2013; p. 193. Available online: https://aisel.aisnet.org/pacis2013/193 (accessed on 23 March 2026).
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 26–28 October 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 2015, 28, 649–657. Available online: https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html (accessed on 23 March 2026).
Huynh, T.; He, Y.; Willis, A.; Rueger, S. Adverse Drug Reaction Classification With Deep Neural Networks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers; The COLING 2016 Organizing Committee: Osaka, Japan, 2016. [Google Scholar]
Alimova, I.; Solovyev, V. Interactive attention network for adverse drug reaction classification. In Proceedings of the Conference on Artificial Intelligence and Natural Language; Springer: Cham, Switzerland, 2018; pp. 185–196. [Google Scholar] [CrossRef]
Wu, C.; Wu, F.; Liu, J.; Wu, S.; Huang, Y.; Xie, X. Detecting tweets mentioning drug name and adverse drug reaction with hierarchical tweet representation and multi-head self-attention. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 34–37. [Google Scholar] [CrossRef]
Zhang, T.; Lin, H.; Xu, B.; Yang, L.; Wang, J.; Duan, X. Adversarial neural network with sentiment-aware attention for detecting adverse drug reactions. J. Biomed. Inform. 2021, 123, 103896. [Google Scholar] [CrossRef] [PubMed]
Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to fine-tune bert for text classification? In Proceedings of the China National Conference on Chinese Computational Linguistics; Springer: Cham, Switzerland, 2019; pp. 194–206. [Google Scholar] [CrossRef]
Qiu, Y.; Zhang, X.; Wang, W.; Zhang, T.; Xu, B.; Lin, H. Kesdt: Knowledge enhanced shallow and deep transformer for detecting adverse drug reactions. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing; Springer: Cham, Switzerland, 2023; pp. 601–613. [Google Scholar] [CrossRef]
Li, H.; Qiu, Y.; Lin, H. Multi-Feature Enhanced Adverse Drug Reaction Detection for Social Media. J. Chin. Inf. Process. 2025, 39, 148–156. [Google Scholar]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Luo, L.; Ning, J.; Zhao, Y.; Wang, Z.; Ding, Z.; Chen, P.; Fu, W.; Han, Q.; Xu, G.; Qiu, Y.; et al. Taiyi: A bilingual fine-tuned large language model for diverse biomedical tasks. J. Am. Med. Inform. Assoc. 2024, 31, 1865–1874. [Google Scholar] [CrossRef]
Zitu, M.M.; Owen, D.; Manne, A.; Wei, P.; Li, L. Large Language Models for Adverse Drug Events: A Clinical Perspective. J. Clin. Med. 2025, 14, 5490. [Google Scholar] [CrossRef]
Fu, W.; Lin, H.; Xu, G.; Qiu, Y.; Wang, J.; Diao, Y.; Zheng, P. Data Augmentation and Instruction Fine-Tuning for ADR Detection. In Proceedings of the China Health Information Processing Conference; Springer: Singapore, 2024; pp. 3–20. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar] [CrossRef]
Shum, K.; Diao, S.; Zhang, T. Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Singapore, 2023; pp. 12113–12139. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Hong Kong, China, 2019; pp. 3980–3990. [Google Scholar] [CrossRef]
Karimi, S.; Metke-Jimenez, A.; Kemp, M.; Wang, C. Cadec: A corpus of adverse drug event annotations. J. Biomed. Inform. 2015, 55, 73–81. [Google Scholar] [CrossRef]
Alvaro, N.; Miyao, Y.; Collier, N. TwiMed: Twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations. JMIR Public Health Surveill. 2017, 3, e6396. [Google Scholar] [CrossRef]
Li, Z.; Yang, Z.; Luo, L.; Xiang, Y.; Lin, H. Exploiting adversarial transfer learning for adverse drug reaction detection from texts. J. Biomed. Inform. 2020, 106, 103431. [Google Scholar] [CrossRef]
Gao, Y.; Ji, S.; Zhang, T.; Tiwari, P.; Marttinen, P. Contextualized graph embeddings for adverse drug event detection. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2022; pp. 605–620. [Google Scholar] [CrossRef]
Gao, Y.; Ji, S.; Marttinen, P. Knowledge-augmented graph neural networks with concept-aware attention for adverse drug event detection. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024); ELRA and ICCL: Torino, Italy, 2024; pp. 9787–9798. [Google Scholar]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar] [CrossRef]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]

Figure 1. Examples of adverse drug reaction detection in social media.

Figure 3. Automatic construction of chain of thought inputs and outputs.

Figure 4. Adverse drug reaction detection question–answer pairs constructed for SFT.

Figure 6. Demonstration of templates for guiding model outputs under low-resource conditions.

Figure 7. Performance comparison of different base models on the CADEC dataset with and without CoT.

Figure 8. Performance comparison of different base models on the Twitter dataset with and without CoT.

Figure 9. The impact of cluster quantity on DetectionADRGPT on the Twitter dataset.

Table 1. Introduction to adverse drug reaction detection datasets from social media.

Dataset	ADR	Non-ADR	Total	MaxLength
CADEC	2478	4996	7474	46
Twitter	232	393	625	241

Table 2. Performance comparison between the proposed model and different discriminative models on the CADEC dataset.

Model	P (%)	R (%)	F1 (%)
CRNN	61.26	65.96	63.52
HTR + MSA	60.67	61.70	61.18
CNN + corpus	52.75	61.28	56.69
CNN + transfer	61.84	60.00	60.91
ATL	63.68	63.40	63.54
ANNSA	82.73	83.52	83.06
KESDT	88.16	87.63	87.82
KnowCAGE	87.10	93.90	90.40
DMFE	88.40	89.63	89.01
DetectionADRGPT	86.37 (0.21)	88.24 (0.34)	87.29 (0.29)
LLaMA-DetectionADR	92.13 (0.59)	93.21 (0.26)	92.67 (0.41)

Table 3. Performance comparison between the proposed model and different discriminative models on the Twitter dataset.

Model	P (%)	R (%)	F1 (%)
CRNN	68.52	66.43	67.46
HTR + MSA	66.58	63.62	65.07
CNN + corpus	60.51	61.50	61.00
CNN + transfer	69.58	61.74	65.42
ATL	70.84	65.02	67.81
ANNSA	58.82	73.34	64.18
CGEM	84.20	83.70	83.90
KnowCAGE	84.80	84.10	84.40
DetectionADRGPT	81.39 (0.51)	84.25 (0.89)	82.80 (0.72)
LLaMA-DetectionADR	85.99 (0.65)	86.28 (0.43)	86.13 (0.52)

Table 4. Performance comparison between the proposed method and mainstream generative models on the CADEC dataset.

Model	P (%)	R (%)	F1 (%)
GPT4o (zero shot)	81.92	79.21	80.54
GPT4o (few shot)	84.17	78.84	81.42
DeepseekV3 (zero shot)	75.29	80.66	77.88
DeepseekV3 (few shot)	76.22	81.13	78.60
LLaMA3-8B-SFT	92.12	91.44	91.78
Bal-LLaMA	92.0	93.00	92.40
DetectionADRGPT	86.37 (0.21)	88.24 (0.34)	87.29 (0.29)
LLaMA-DetectionADR	92.13 (0.59)	93.21 (0.26)	92.67 (0.41)

Table 5. Performance comparison between the proposed method and mainstream generative models on the Twitter dataset.

Model	P (%)	R (%)	F1 (%)
GPT4o (zero shot)	80.22	79.37	79.79
GPT4o (few shot)	82.19	80.11	81.13
DeepseekV3 (zero shot)	76.99	81.22	79.04
DeepseekV3 (few shot)	75.57	82.98	79.10
LLaMA3-8B-SFT	84.16	86.20	85.17
Bal-LLaMA	85.90	86.30	85.50
DetectionADRGPT	81.39 (0.51)	84.25 (0.89)	82.80 (0.72)
LLaMA-DetectionADR	85.99 (0.65)	86.28 (0.43)	86.13 (0.52)

Table 6. Ablation experiments of the LLaMA-DetectionADR model on the CADEC dataset.

Model	P (%)	R (%)	F1 (%)	ΔF1
LLaMA-DetectionADR	92.13	93.21	92.67	−
w/o CoT	91.17	92.56	91.86	0.81
w/o Multiple Prompts	92.25	92.19	92.22	0.45

Table 7. Ablation experiments of the LLaMA-DetectionADR model on the Twitter dataset.

Model	P (%)	R (%)	F1 (%)	ΔF1
LLaMA-DetectionADR	85.99	86.28	86.13	−
w/o CoT	84.85	86.27	85.55	0.58
w/o Multiple Prompts	85.79	85.71	85.74	0.39

Table 8. Ablation experiments of the DetectionADRGPT architecture on the CADEC dataset.

Model	P (%)	R (%)	F1 (%)	ΔF1
DetectionADRGPT	86.37	88.24	87.29	−
w/o clustering	84.69	87.12	85.89	1.40
w/o CoT	82.49	86.17	84.29	2.99
w/o prompt	84.17	80.08	82.07	5.22

Table 9. Ablation experiments of the DetectionADRGPT architecture on the Twitter dataset.

Model	P (%)	R (%)	F1 (%)	ΔF1
DetectionADRGPT	81.39	84.25	82.80	−
w/o clustering	81.71	82.74	82.22	0.58
w/o CoT	82.16	81.04	81.60	1.16
w/o prompt	81.67	79.36	80.49	2.31

Table 10. Model performance comparison on noisy text from the Twitter dataset.

Model	P (%)	R (%)	F1 (%)
LLaMA-SFT	79.15	84.31	81.65
LLaMA-DetectionADR	83.56	84.92	84.23

Table 11. Model performance comparison on logically ambiguous text from the Twitter dataset.

Model	P (%)	R (%)	F1 (%)
LLaMA-SFT	74.81	75.33	75.07
LLaMA-DetectionADR	81.27	82.64	81.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, H.; Lin, H. Adverse Drug Reaction Detection on Social Media Based on Large Language Models. Information 2026, 17, 352. https://doi.org/10.3390/info17040352

AMA Style

Li H, Lin H. Adverse Drug Reaction Detection on Social Media Based on Large Language Models. Information. 2026; 17(4):352. https://doi.org/10.3390/info17040352

Chicago/Turabian Style

Li, Hao, and Hongfei Lin. 2026. "Adverse Drug Reaction Detection on Social Media Based on Large Language Models" Information 17, no. 4: 352. https://doi.org/10.3390/info17040352

APA Style

Li, H., & Lin, H. (2026). Adverse Drug Reaction Detection on Social Media Based on Large Language Models. Information, 17(4), 352. https://doi.org/10.3390/info17040352

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adverse Drug Reaction Detection on Social Media Based on Large Language Models

Abstract

1. Introduction

2. Related Works

2.1. Advances in the Detection of Adverse Drug Reactions

2.2. Rule-Based Methods for Adverse Drug Reaction Mining

2.3. Machine Learning Methods for Adverse Drug Reaction Mining

2.4. Deep Learning Methods for Adverse Drug Reaction Mining

3. Proposed Methodologies

3.1. Fully Supervised Adverse Drug Reaction Detection

3.1.1. Chain of Thought Data Generation

3.1.2. Instruction Data Construction

3.1.3. QLoRA-Based Instruction Fine-Tuning

3.2. Adverse Drug Reaction Detection in Low-Resource Conditions

4. Experiment Results and Discussion

4.1. Datasets

4.2. Experimental Parameters

4.3. Evaluation Metrics

4.4. Baseline Discriminative Models

4.5. Comparison of ADR Detection Results with Discriminative Models

4.6. Comparison of ADR Detection Results with Generative Models

4.7. Ablation Study of LLaMA-DetectionADR

4.8. Ablation Study of DetectionADRGPT

4.9. Error Analysis

4.10. Impact of Different Foundation Models on Performance Under Full Supervision

4.11. Impact of Different Models on Performance in Low-Resource Conditions

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI