Adverse Drug Event Detection Using a Weakly Supervised Convolutional Neural Network and Recurrent Neural Network Model

: Social media and health-related forums, including the expression of customer reviews, have recently provided data sources for adverse drug reaction (ADR) identiﬁcation research. However, in the existing methods, the neglect of noise data and the need for manually labeled data reduce the accuracy of the prediction results and greatly increase manual labor. We propose a novel architecture named the weakly supervised mechanism (WSM) convolutional neural network (CNN) long-short-term memory (WSM-CNN-LSTM), which combines the strength of CNN and bi-directional long short-term memory (Bi-LSTM). The WSM applies the weakly labeled data to pre-train the parameters of the model and then uses the labeled data to ﬁne-tune the initialized network parameters. The CNN employs a convolutional layer to study the characteristics of the drug reviews and active features at di ﬀ erent scales, and then the feed-forward and feed-back neural networks of the Bi-LSTM utilize these salient features to output the regression results. The experimental results e ﬀ ectively demonstrate that our model marginally outperforms the comparison models in ADR identiﬁcation and that a small quantity of labeled samples results in an optimal performance, which decreases the inﬂuence of noise and reduces the manual data-labeling requirements.


Introduction
Adverse drug reactions (ADRs) are part of the leading cause of morbidity and mortality in public health.Research has indicated that death and hospitalizations due to ADRs number in the millions (up to 5% hospitalizations, 28% emergency treatments, and 5% death), and the related consumption is approximately 75 billion dollars annually [1][2][3].Post-marketing drug safety monitoring is therefore essential for pharmacovigilance.Regulatory agencies (e.g., the Food and Drug Administration (FDA)) establish and support spontaneous reporting systems (SRS) to monitor the most current pharmacovigilance activities in the United States.Suspected ADRs may be raised by patients and healthcare providers through these surveillance systems.However, biased and underreported events limit the effectiveness of these systems, which report an estimated ADR rate of approximately 10% [4].
Social media, especially health-related social networks (e.g., DailyStrength (http://www.dailystrength.org) and AskaPatient (https://www.askapatient.com/)),enable both the patients and nursing staff to share and obtain comments regarding drug safety.Drug reviews of patient feedback on social media are a potential and timely source for ADR identification [5,6].User reviews contain sentiment information (i.e., positive, negative or neutral expressions) to provide important features for ADR identification [7], and sentiment features can marginally improve ADR detection in health-related forum reviews [8].
In this study, based on the intuition of patient reviews about adverse drug reactions (ADRs) expressing negative sentiments, we aim to recognize ADRs through sentiment classification, which is commonly used to complete ADR identification through social media reviews [9].The current sentiment classification methods are typically divided into three categories: (1) lexicon-based methods, (2) traditional machine learning methods, and (3) deep learning methods.Lexicon-based methods have implemented a string-matching method that matches the detected terms to predefined drug adverse event lexicons [10,11].However, lexicon matching cannot easily distinguish whether a drug-related event is related to an ADR or to an indication for a medication.In addition, the characteristics of social media language (e.g., informal, vernacular, abbreviations, symbols, misspellings, and irregular grammar) further limit the precision of the lexicon matching method in ADR identification.
Traditional machine learning classifiers (e.g., conditional random fields (CRFs)) [12,13] combine knowledge bases with sentiment-related text features.However, the fixed-width window mechanism of CRFs only considers the target word and its neighbouring words in the scope of their input; therefore, important information associated with more distant words may be excluded.
Deep learning models (e.g., convolutional neural networks CNNs) [14][15][16] may limit CRF's.Hierarchical CNNs specialize in extracting position-invariant features.Given the specificity of social media user reviews, an entire sentence may describe a positive sentiment, but the phrases that contain a negative sentiment (e.g., "don't" and "miss") may appear.Thus, the long-short-term-memory (LSTM) network (specifically a class of recurrent neural networks (RNNs) [17,18] with a sequential architecture can be used to correctly process long sentences.The LSTM 'memory mechanism, which is well suited for marking tasks, has a hidden state to remember previous labeling decisions and then labels the current token.However, LSTM does not perform well in the emotional classification of social media to complete a key-phrase recognition task [19].
Furthermore, a deep learning model is an end-to-end model, allowing the computer to automatically learn sentiment features, thereby reducing feature-extracted complexity and incompleteness.However, a successful deep learning model depends on large-scale labeled data, and obtaining massive labeled training data manually is time-consuming and expensive.The lack of large-scale labeled data has become a bottleneck for deep learning in ADR identification-related research [20].
To reduce the limitations of deep learning, researchers mine the information from the data generated by users (e.g., sentiment ratings, tweets, reviews, and emoticons), which is helpful in the training of sentiment classifiers.However, the behaviour of labeling texts, which users designate as predefined labels for each review, is arbitrary and has no uniform standard.These labeled data are noisy (a high score with a negative review) and are called weakly labeled data [21].The classification model influenced by noise data in weakly labeled data will lower the accuracy [22].
In this work, we propose a deep learning framework for the sentiment classification of drug reviews.The framework utilizes a weakly supervised mechanism (WSM) that applies weakly labeled data to pre-train the parameters of the model and then uses the labeled data to fine-tune the initialized parameters.First, we attempt to leverage a large quantity of weakly labeled data to pre-train a deep neural network that reflects the drug reviews' sentiment distribution in the neural network.Second, we utilize a small quantity of labeled data to fine-tune the network and learn the target prediction function.In contrast, previous training methods, usually based on weakly labeled data, directly learn the target prediction function, which can impact the prediction function because of the noise in the data.CNN is better at classifying sentences with simple syntactic structure.LSTM can capture long-distance dependencies in comment statements and is better at "understanding" the semantics of sentences as a whole.Through the training framework of "weak supervised pre-training + supervised fine-tuning", the influence of noise on the model training process is reduced, and a large amount of useful information in the weak labeled data is better "remembered" in the depth model.The time efficiency of CNN, LSTM and CNN_LSTM are not very different when we use our small datasets.Our method performs well in ADR recognition.
We propose a model that applies the WSM combining the strength of the CNN and bi-directional long-short-term memory (Bi-LSTM) [23][24][25] (named WSM-CNN-LSTM) to complete the sentiment classification task of ADR reviews.The WSM-CNN-LSTM model includes two parts: the CNN employs a convolutional layer to study and extract the characteristics of the drug review and active features of different scales within the drug reviews.Then, the Bi-LSTM seizes past and future information by the forward and backward networks, respectively, and utilizes the sentence sequence information to compose features sequentially and output the regression results.
To effectively train the WSM-CNN-LSTM model, we collect drug reviews identified as weakly labeled datasets, containing 61,263 comments from the AskaPatient.comforum to pre-train a deep neural network.Additionally, a manually labeled dataset containing 11,083 comments is used to fine-tune the network to learn the target prediction function.Sufficient experiments are designed and implemented to validate the effectiveness of the WSM-CNN-LSTM model.
In this work, our contributions are as follows: We propose a novel method that uses a WSM for the sentiment analysis of ADR reviews to avoid a large amount of manually labeled data.The WSM greatly reduces the influence of noise on the model in the weakly labeled data.To our knowledge, this is the first work in the health forum, particularly in the field of drug review sentiment analysis.
We propose a novel architecture named WSM-CNN-LSTM to complete the task of ADR identification.This model reports that the stand-alone CNN model performs poorly in the characteristics of the long text of most drug reviews, while adding feed-forward and feed-back neural networks dramatically improves the classification effects.
We validate that the WSM-CNN-LSTM model presents superior performance in ADR identification through experiments, in which a large amount of weakly labeled data is utilized to pre-train a deep neural network and a small quantity of labeled data is used to fine-tune the network and learn the target prediction function.Our proposed training method avoids the direct use of a weakly labeled data training target prediction function, which can partly reduce the influence of noise on the prediction function.
This paper is organized as follows.The weakly supervised multi-channel CNN-LSTM model proposed in this paper is introduced in Section 2. In Section 3, the experimental process and results are discussed.Finally, Section 4 is conclusions and presents directions for future work.

Related Work
In recent years, some researchers used potential resources from social media to detect ADR.Leaman et al. [26] applied Lexicon-based approach and used 450 comments for Concept/relation extraction system development.Akhtyamova et al. [27] proposed a CNNs model based on varied structural parameters.The majority vote determines the prediction of the model.Santiso et al. [28] proposed a deep model based on the LSTM to discover ADRs from Electronic Health Records (EHRs).Embeddings are created using lemmas to reinforce lexical variability of EHRs.However, due to the lack of labeled data, the accuracy of prediction results needs to be improved.
Fortunately, although there is a lack of large-scaled of labeled data, there is still a large amount of weakly labeled data on social networks, such as comment containing sentiment orientation.Tutubalina Elena et al. [29] proposed the method based on ADR review scores to predict demographic.The weak-tagged text corpus is used to generate dictionary.However, in their work, the generated lexicon using weakly labeled is still not escape the limitations of domain knowledge.

Word Embedding
As the input of our model, we normally needed to generate high-dimensional word vectors that capture information regarding the words of morphology, syntax, and semantics in the word embedding layer.We trained every word as a k-dimensional (300 dimensions) word vector using the publicly available GloVe toolkit [30], where k represents the dimension of the word vector.The sentence matrix is achieved by connecting the word vectors together after pre-training.Let W i ∈ R k be the i-th k-dimensional word vector in a sentence; therefore, a drug review with n word vectors is encoded as the sentence matrix W ∈ R n×k , which is composed of a sequence of word vectors denoted as: (1)

Framework of the WSM-CNN-LSTM Model
We propose a novel architecture named WSM-CNN-LSTM, which introduces a WSM that combined the strengths of CNN-LSTM to complete the task of three labels (positive, neutral, and negative) for drug reviews, which is a variation in the CNN-LSTM model in [31,32].
Figure 1 indicates the architecture of the WSM-CNN-LSTM model.There were six varieties of layers in this model: input layer, convolutional layer, max-pooling and dropout layer, Bi-LSTM layer, fully connected layer, and softmax layer.First, the pre-trained word vectors were input into the convolutional layer perform a convolution via linear filters with different lengths.The effect of a convolution was to extract features from word vectors and generate feature maps.Second, the max-pooling layer extracted salient features from feature maps generated by the convolution and then input them into the forward and backward LSTM network.In the LSTM layer, these salient features were used to output the regression results.Finally, the fully connected layer and softmax layers extracted regression results from LSTM and output the final classification results.
embedding layer.We trained every word as a k-dimensional (300 dimensions) word vector using the publicly available GloVe toolkit [30], where k represents the dimension of the word vector.The sentence matrix is achieved by connecting the word vectors together after pre-training.Let  ∈  be the i-th k-dimensional word vector in a sentence; therefore, a drug review with n word vectors is encoded as the sentence matrix W ∈  × , which is composed of a sequence of word vectors denoted as:  =  ,  , … ,  . (1)

Framework of the WSM-CNN-LSTM Model
We propose a novel architecture named WSM-CNN-LSTM, which introduces a WSM that combined the strengths of CNN-LSTM to complete the task of three labels (positive, neutral, and negative) for drug reviews, which is a variation in the CNN-LSTM model in [31,32].

Convolutional Layer
The convolutional layer was used to effectively extract features from the sentence matrix through a set of convolution filters  ∈  × , where h is the length of the filter.This method convolutes the sentence matrix W input by the word embedding layer to obtain the feature map M∈  , in which the vector has one column.Different sizes of the feature map are produced from the different filter sizes.The i-th result output element of each filter m is generated as: and the feature map M∈  is produced as: where  is a bias, ⊗ is the convolutional operator, and  is a nonlinear function (e.g., tanh).We used the activation function ReLU [33] for a fast calculation, and  : denotes the word vectors, represented as:

Convolutional Layer
The convolutional layer was used to effectively extract features from the sentence matrix through a set of convolution filters F ∈ R h×k , where h is the length of the filter.This method convolutes the sentence matrix W input by the word embedding layer to obtain the feature map M∈ R n−h+1 , in which the vector has one column.Different sizes of the feature map are produced from the different filter sizes.The i-th result output element of each filter m is generated as: and the feature map M ∈ R n−h+1 is produced as: where b is a bias, ⊗ is the convolutional operator, and f is a nonlinear function (e.g., tanh).We used the activation function ReLU [33] for a fast calculation, and w i:i+h−1 denotes the word vectors, represented as:

Max-Pooling and Dropout Layer
The max-pooling layer, in which the most salient feature was further extracted from the previous different filters using the maximum mechanism, down-sampled the features learned in the convolutional layer.This method took the most salient feature and reduced the computation by choosing a maximum value, which eliminated the non-maximal values.Because the maximum value represents the most distinguishing salient feature of a drug review in a filter, we chose max-pooling rather than average pooling.In this layer, we applied multiple convolutional filters to extract the different features that were fed into the Bi-LSTM layer.
At the same time, in our model, a dropout layer [34] was introduced after the max-pooling layer because of the inevitable over-fitting in the CNN.

Bi-LSTM Layer
The RNN was applied to suitably process sequence data, whose hidden layer's input combined the output of the input layer and the output of the hidden layer at the preceding moment, and the neuron had a memory ability.However, the vanishing gradient problem will produce very small numbers in a simple RNN [35].Bi-LSTM, with the capacity to catch long-term dependencies, introduced a gate mechanism to effectively address this problem.
LSTM (long short term memory) is specially designed to solve the long-term dependence problem of general RNN, which is added memory units to the neurons of the hidden layer on the basis of RNN.As shown in Figure 2, the LSTM cell consisted of three gates, namely, the input gate i, the forget gate f, and the output gate o, to control the memory length.At each step time t, the three gates, input vector, and state update of a memory cell were calculated as follows.The max-pooling layer, in which the most salient feature was further extracted from the previous different filters using the maximum mechanism, down-sampled the features learned in the convolutional layer.This method took the most salient feature and reduced the computation by choosing a maximum value, which eliminated the non-maximal values.Because the maximum value represents the most distinguishing salient feature of a drug review in a filter, we chose max-pooling rather than average pooling.In this layer, we applied multiple convolutional filters to extract the different features that were fed into the Bi-LSTM layer.
At the same time, in our model, a dropout layer [34] was introduced after the max-pooling layer because of the inevitable over-fitting in the CNN.

Bi-LSTM Layer
The RNN was applied to suitably process sequence data, whose hidden layer's input combined the output of the input layer and the output of the hidden layer at the preceding moment, and the neuron had a memory ability.However, the vanishing gradient problem will produce very small numbers in a simple RNN [35].Bi-LSTM, with the capacity to catch long-term dependencies, introduced a gate mechanism to effectively address this problem.
LSTM (long short term memory) is specially designed to solve the long-term dependence problem of general RNN, which is added memory units to the neurons of the hidden layer on the basis of RNN.As shown in Figure 2, the LSTM cell consisted of three gates, namely, the input gate i, the forget gate f, and the output gate o, to control the memory length.At each step time t, the three gates, input vector, and state update of a memory cell were calculated as follows.Three gates: Input vector: Three gates: Input vector: State update: where x t is the input vector; W and V represent the weight matrix of the input x t and hidden state h t−1 , respectively; b is the bias matrix for the input cell and three gates; d_in t is the dimension of the word vector for the input cell; i t , f t , and o t denote the input gate, and forget gate, output gate, respectively; c t is the memory cell; p t is the hidden state; ⊗ is the element-wise multiplication; and σ is the sigmoid activation function.
In the bi-directional LSTM, the model learned the output weights of the previous moment and the input of each sequence at the current time.Additionally, a forward network and a backward network were beneficial for simultaneously capturing the past (backward direction) and future (forward direction) information of sentence sequences to obtain the contextual information for many sequential tagging tasks during sentence sequence modeling.Therefore, this approach was utilized to capture all the information during sentence sequence modeling [36].

Fully Connected Layer
Fully connected layers playing the role of classifiers mapped the distributed feature representation to the sample space to feature vectors that contained the combination information of the characteristics of the input reviews.Finally, these vectors were input to the output layer to complete the classification task.

Softmax Layer
In the softmax layer, we used the softmax activation function [37] to compute classification, which was converted by the outputs of the fully connected layer.A vector is output in this layer and is calculated by (11), where N is number of classes, z is the input vector from the previous layer, and w is the parameter vector.The final classification labels, namely, positive, neutral, and negative, were output in this layer.The classification result ĉ is calculated by (12):

Weakly Supervised Mechanism
The WSM-CNN-LSTM model, trained by a scheme called unsupervised pre-training appended supervised fine-tuning, was first pre-trained by a large amount of weekly labeled data and then fine-tuned by a small amount of labeled data via manual labeling.
First, our model was pre-trained by a considerable amount of weakly labeled data from the drug rating reviews obtained from the AskaPatient forum.Second, to improve the accuracy of the pre-trained model by a large amount of weekly labeled data with noise, we manually labeled a small amount of labeled data that was used to fine-tune the pre-trained model.The parameters of the pre-trained model were used as the initial parameters of the supervised training.The labeled data were used to supervise the training and testing of the model, and finally, the classification model was trained.

Dataset
In our work, a dataset was collected from the drug ratings and health care opinions forum named AskaPatient, where actual patients who have previously taken the drug share their treatment experience.The drug reviews were gathered from 1 May 2012 to 31 December 2017.The drug reviews from this forum with comments by patients are shown as eight fields, namely, review the rating of the drug, reason for taking this drug, and side effects that were experienced with the drug.Additional reviews include gender, age, duration/dosage, and date added.The general meaning for the ratings is displayed in Table 1.Our target was a multi-classification problem for the sentiment classification of the drug reviews on the AskaPatient forum.We regarded the reviews of the 4 and 5 ratings as positive weakly labeled data and divided them into class 2. The 3 ratings reviews were regarded as neutral and divided into class 1.Finally, the reviews of the 1 and 2 ratings were regarded as negative and were divided into class 0.
In the forum of AskaPatient.com,we captured 63,782 reviews on 2000 publicly available drugs containing prescription medicines currently approved by the FDA, along with many over-the-counter medicines.The remaining 61,263 reviews were non-null comments.The labeled data containing 11,083 drug reviews took one month for two authors to manually label.The composed proportion of weakly labeled data are labeled data are shown in Figure 3.We note that the datasets were roughly balanced and that the labeled data were approximately one-fifth of the weakly labeled data.

Experimental Setup
Seventy percent of the weakly labeled data was randomly leveraged to pre-train the deep neural network, and 30% of the data were utilized for testing.Every drug review was trained as an embedding matrix by the publicly available GloVe toolkit with 300 dimensions, using the TensorFlow model of the Python module [38].The matrix was composed of a sequence of word embeddings.We prepared an embedding matrix and initialized the words that were not found in the embedding index to be all-zeros.Then, pre-trained word embeddings were loaded into the embedding layer.The batch

Experimental Setup
Seventy percent of the weakly labeled data was randomly leveraged to pre-train the deep neural network, and 30% of the data were utilized for testing.Every drug review was trained as an embedding matrix by the publicly available GloVe toolkit with 300 dimensions, using the TensorFlow model of the Python module [38].The matrix was composed of a sequence of word embeddings.We prepared an embedding matrix and initialized the words that were not found in the embedding index to be all-zeros.Then, pre-trained word embeddings were loaded into the embedding layer.The batch size was 64, the dropout was 0.5 and the activation function was softmax.The output of the one-dimensional (1D) CNN with global max-pooling was the input of the Bi-LSTM.
According to the characteristics of the drug reviews and to facilitate implementation convenience, we restricted the number of words in each drug review to within 100 words.For a drug review with k words, if k < 100, then we appended it to 100 with a zero vector.The model truncated the vector, leaving only the first 100 words, when a drug review had more than 100 words.No drug reviews contained more than 100 words.

Comparison Models
In our experiments, we specifically compared the performance of our model, SVM [39,40], WSM-CNN-LSTM, with the CNN, LSTM, and CNN-LSTM-rand models and the WSM-CNN and WSM-LSTM models.The compared models were as follows: • SVM.Support vector machines.We used trigrams and Liblinear classifier; • CNN-rand.We trained the CNN on of the labeled dataset and randomly initialized the network parameters;

•
Weakly supervised mechanism CNN model (WSM-CNN).The weakly labeled data were utilized to train the network model based on the CNN, and the labeled data were used to fine-tune the initialized network parameters; • LSTM-rand.We trained the LSTM on the labeled dataset and randomly initialized the network parameters;

•
Weakly supervised mechanism LSTM model (WSM-LSTM).The weakly labeled data were utilized to train the network model based on LSTM, and the labeled data were used to fine-tune the initialized network parameters; • CNN-LSTM-rand.We trained the combined CNN and LSTM on the labeled dataset and randomly initialize the network parameters.

Weakly Supervised Model Performance
Table 2 shows the preliminary experimental results of WSM-CNN-LSTM and the comparison baseline model for the dataset.Except for the overall accuracy, we employed micro-F1 [40], precision, and recall as evaluation metrics.They are computed as follows: Importantly, the experimental results demonstrate that WSM-CNN-LSTM improved the comparison models with regard to accuracy and F1 during classification.It is likely that the CNN is good at classifying simple sentence structures, and the LSTM layer can capture the long-distance dependencies in the drug reviews.The WSM-CNN-LSTM model, utilizing the WSM and combining the strengths of both the CNN and LSTM, understood the semantics of the sentence as a whole and improved the classification performance of the model in the sentiment analysis of drug reviews.In the comparison experiments, we deliberately used two mechanisms for the same model.The *-rand mechanism trained the network model based on randomly initialized network parameters with labeled datasets, and the WSM-* mechanism was a WSM in which weakly labeled data were used to pre-train the network model and parameters.Then, the small amount of labeled data was used to fine-tune the pre-trained model.Clearly, all WSM-* model results are slightly higher than the *-rand model results in Table 2.This increase is likely due to the expression of the WSM, which uses pre-training to record the prior knowledge of the emotional distribution, and fine-tuning the parameters of the model reduces the effect of noise data on the model training process.

Macro-F1 Result of Our Model
Macro-F1 is the average of the F1 of each class.In order to verify the performance of our model in each class, we present the F1 of each class in Table 3, thus macro-F1 is 86.64.As can be seen from the results in the Table 4, F1 of the negative and positive class are higher than the neural class, which prove that our model is more effective in capturing negative and positive words in drug reviews.It was important to identify the sensitivity of the data sample to the weakly supervised machine learning model, especially the influence of sample size on the model.To investigate this issue, we examined the influence of the labeled training data size on each model.D% of the labeled data, where D ranged between 10 and 90, was chosen to fine-tune our experiments.The model learning curves are shown in Figure 4. Our model reached more than an 80% accuracy and an F1 score from the 30% training set and appeared to be stable from the 70% training set.The experimental results prove that our model was not influenced by the size of the manually labeled data.It is therefore likely that a small amount of labeled data, which was used to fine-tune the WSM-CNN-LSTM model, is more suitable for the sentiment analysis of drug reviews.Furthermore, this finding significantly reflects the advantages of a small amount of manual labor in our work.Although 90% of the labeled data can achieve a better result, a 70% partition ratio is common and reasonable.In our experiments, we chose 70% labeled data as a training set.3. The cross validation results further demonstrate that the *-rand models exhibited no substantial improvement in accuracy and precision due to the influence of noise data on the model functions.
The WSM-CNN-LSTM model was relatively effective at avoiding the impact of noise and statistically discriminating long and short sentences to improve the accuracy.

Conclusions and Future Work Discussion
In this work, we proposed a weakly supervised deep learning model named WSM-CNN-LSTM for identifying ADRs, utilizing the drug reviews of customers on health forums through multiple classification.Our model was an effective combination of a CNN and LSTM, along with a WSM that employed both weakly labeled data to pre-train the model and the use of labeled data to fine-tune the initialized network parameters.Experiments on the drug reviews collected from the AskaPatient forum indicated that the effect of our model on ADR identification was significantly superior to the contrast model in accuracy and F1 performance, which reflects the effectiveness of our model for the sentiment classification of drug review data.ADR identification through drug reviews by customers on health forums was remarkably enhanced by our model.We also observed that the WSM only required a small amount of labeled samples to attain optimal performance, which decreased the influence of noise and reduced the manual data-labeling requirements.
Drug review data in social media and health forums offer us valuable resources.In future work, our continuing research will focus on investigating the potential relationships of the drug reviews and exploring the impact of other features of the drug reviews for ADR identification, so that considerable online review data can better serve the healthy life of individuals.3. The cross validation results further demonstrate that the *-rand models exhibited no substantial improvement in accuracy and precision due to the influence of noise data on the model functions.The WSM-CNN-LSTM model was relatively effective at avoiding the impact of noise and statistically discriminating long and short sentences to improve the accuracy.

Conclusions and Future Work Discussion
In this work, we proposed a weakly supervised deep learning model named WSM-CNN-LSTM for identifying ADRs, utilizing the drug reviews of customers on health forums through multiple classification.Our model was an effective combination of a CNN and LSTM, along with a WSM that employed both weakly labeled data to pre-train the model and the use of labeled data to fine-tune the initialized network parameters.Experiments on the drug reviews collected from the AskaPatient forum indicated that the effect of our model on ADR identification was significantly superior to the contrast model in accuracy and F1 performance, which reflects the effectiveness of our model for the sentiment classification of drug review data.ADR identification through drug reviews by customers on health forums was remarkably enhanced by our model.We also observed that the WSM only required a small amount of labeled samples to attain optimal performance, which decreased the influence of noise and reduced the manual data-labeling requirements.
Drug review data in social media and health forums offer us valuable resources.In future work, our continuing research will focus on investigating the potential relationships of the drug reviews and exploring the impact of other features of the drug reviews for ADR identification, so that considerable online review data can better serve the healthy life of individuals.

Figure 1
Figure 1 indicates the architecture of the WSM-CNN-LSTM model.There were six varieties of layers in this model: input layer, convolutional layer, max-pooling and dropout layer, Bi-LSTM layer, fully connected layer, and softmax layer.First, the pre-trained word vectors were input into the convolutional layer perform a convolution via linear filters with different lengths.The effect of a convolution was to extract features from word vectors and generate feature maps.Second, the maxpooling layer extracted salient features from feature maps generated by the convolution and then input them into the forward and backward LSTM network.In the LSTM layer, these salient features were used to output the regression results.Finally, the fully connected layer and softmax layers extracted regression results from LSTM and output the final classification results.
Our target was a multi-classification problem for the sentiment classification of the drug reviews on the AskaPatient forum.We regarded the reviews of the 4 and 5 ratings as positive weakly labeled data and divided them into class 2. The 3 ratings reviews were regarded as neutral and divided into class 1.Finally, the reviews of the 1 and 2 ratings were regarded as negative and were divided into class 0.In the forum of AskaPatient.com,we captured 63,782 reviews on 2000 publicly available drugs containing prescription medicines currently approved by the FDA, along with many over-thecounter medicines.The remaining 61,263 reviews were non-null comments.The labeled data containing 11,083 drug reviews took one month for two authors to manually label.The composed proportion of weakly labeled data are labeled data are shown in Figure3.We note that the datasets were roughly balanced and that the labeled data were approximately one-fifth of the weakly labeled data.

Figure 3 .
Figure 3. Sizes of the weakly labeled and manually labeled datasets.

Figure 3 .
Figure 3. Sizes of the weakly labeled and manually labeled datasets.

Information 2019 ,
10, x FOR PEER REVIEW 10 of 13suitable for the sentiment analysis of drug reviews.Furthermore, this finding significantly reflects the advantages of a small amount of manual labor in our work.Although 90% of the labeled data can achieve a better result, a 70% partition ratio is common and reasonable.In our experiments, we chose 70% labeled data as a training set.(a) Accuracy curves ( b) F1 curves

Figure 4 .
Figure 4. Impact of labeled training data size on each model.Stratified 10 × 10-fold cross validation results.Experiments of stratified 10 × 10-fold cross validations were conducted to further verify the statistical significance of the improvements.We combined the training and test data and then distributed them randomly to 10 folds, ensuring that all folds had approximately the same proportion of positive, negative, and neutral drug reviews.We repeatedly used randomly generated folds for training and verification, each time training on nine folds and testing on one fold.The average results after 100 experiments are shown in Table3.The

Author Contributions:Funding:
Conceptualization, G.G.; Methodology, Z.M.; Data collection, Z.M.; experiment, Z.M.; writing-original draft preparation, Z.M.; writing-review and editing, G.G.; supervision, G.G.This research was funded by National Natural Science Foundation of China, grant number 61731015, 61673319 and 61802311.National Key Research and Development Program of China, grant

Figure 4 .
Figure 4. Impact of labeled training data size on each model.Stratified 10 × 10-fold cross validation results.Experiments of stratified 10 × 10-fold cross validations were conducted to further verify the statistical significance of the improvements.We combined the training and test data and then distributed them randomly to 10 folds, ensuring that all folds had approximately the same proportion of positive, negative, and neutral drug reviews.We repeatedly used randomly generated folds for training and verification, each time training on nine folds and testing on one fold.The average results after 100 experiments are shown in Table3.The cross validation results further demonstrate that the *-rand models exhibited no substantial improvement in accuracy and precision due to the influence of noise data on the model functions.The WSM-CNN-LSTM model was relatively effective at avoiding the impact of noise and statistically discriminating long and short sentences to improve the accuracy.

Table 1 .
In our work, a dataset was collected from the drug ratings and health care opinions forum named AskaPatient, where actual patients who have previously taken the drug share their treatment experience.The drug reviews were gathered from 1 May 2012 to 31 December 2017.The drug reviews from this forum with comments by patients are shown as eight fields, namely, review the rating of the drug, reason for taking this drug, and side effects that were experienced with the drug.Additional reviews include gender, age, duration/dosage, and date added.The general meaning for the ratings is displayed in Table1.Review Rating for AskaPatient.com.

Table 2 .
ADR identification performance percentages when testing different comparison models.

Table 3 .
ADR identification performance percentages when testing different compared models.Stratified 10 × 10-fold cross validation results.Note: Statistically significant improvements over comparison models are bolded and marked with an asterisk (*).

Table 4 .
F1 percentages of each class individually.