# A Novel Machine Learning Approach for Sentiment Analysis on Twitter Incorporating the Universal Language Model Fine-Tuning and SVM

## Abstract

## 1. Introduction

## 2. Related Work

#### 2.1. Transfer Learning with ULMFiT

#### 2.2. Language Model

#### 2.3. AWD-LSTM

#### 2.4. Support Vector Machine (SVM)

#### 2.5. Long Short-Term Memory (LSTM)

## 3. Methodology

#### 3.1. ULMFit–SVM Model

- General-domain LM pre-training;
- Target task LM fine-tuning;
- Target task classifier.

#### 3.2. Pretrained Phase

#### 3.3. Fine-Tuning the Language Model

#### 3.3.1. Slanted Triangular Learning Rates (STLR)

- $\left(T\right)$ refers to the count of training iterations (one training iteration is equal to the number of epochs times the number of updates per epoch).
- $(cut\_frac)$ refers to the fraction of iterations.
- $\left(cut\right)$ refers to the iteration in case of raising or lowering the LR.
- (for $t<cut,p)$ refers to the count of iterations the LR has increased upon the total number of increasing iterations.
- $t>=cut,p$ refers to the total count of iterations the LR has decreased upon the total number of decreasing iterations.
- $\left(ratio\right)$ states the size of the lowest LR compared with the maximum LR, ${\eta}_{max}$.
- $\left(\eta t\right)$ refers to the learning rate at iteration t.
- $cut\_frac=0.1,ratio=32$ and $et{a}_{max}=0.01$.

#### 3.3.2. Discriminative Fine-Tuning (DFT)

#### 3.4. Model Training

#### 3.4.1. Concat Pooling

#### 3.4.2. Gradual Unfreezing

- The first unfrozen layer is the last LSTM layer, and then, the model is fine-tuned for one epoch.
- Subsequently, the following lower layer is unfrozen.
- The same procedures of unfreezing are performed on all layers until they are fine-tuned to convergence.

#### 3.4.3. BPTT for Text Classification (BPT3C)

#### 3.4.4. Bidirectional Language Model

#### 3.5. Dataset Overview

#### 3.6. Dataset Preprocessing

#### 3.7. Word Embedding

- Extra spaces, tab characters, newline characters, and other characters should be removed and replaced with regular characters.
- To tokenize the data, we use the spaCy library. Since spaCy does not have a parallel/multicore tokenizer, the fast.ai package is used to offer this feature. This parallel version of the spacy tokenizer takes advantage of all of the cores on your computer’s CPUs and is significantly faster than the serial version.

- Making a list of all the words that appear in the same order.
- Replacing each word with its index into that list.

#### 3.8. Evaluation Metrics

## 4. Performance Evaluation

#### 4.1. Evaluation Based on Testing Data

#### 4.2. Effect of Hyper-Parameters and Hidden Units Number Setting in Our Model Efficiency

## 5. Discussion and Additional Comparisons

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

ULMFiT | Universal Language Model Fine-tuning |

SVM | Support Vector Machine |

SLT | Statistical Learning Theory |

NLP | Natural Language Processing |

AWD | ASGD Weight-dropped |

LSTM | Long Short-term Memory Networks |

RNN | Recurrent Neural Network |

RBF | Radian Basis Function |

DFT | Discriminative Fine-tuning |

## References

Split | Twitter US Airlines [30] | IMDB [31] | GOP Debate [32] | |
---|---|---|---|---|

Positive | Train | 1773 | 18,750 | 1665 |

Test | 590 | 6250 | 555 | |

Negative | Train | 6884 | 18,750 | 6357 |

Test | 2294 | 6250 | 2104 | |

Natural | Train | 2325 | – | 2393 |

Test | 774 | – | 797 | |

Total | – | 14,640 | 50,000 | 13,871 |

Method | Accuracy |
---|---|

Support Vector Machine (SVM) | 78.5% |

Bag-of-words SVM | 78.5% |

Deep Learning Model with Dropouts in Keras | 77.9% |

SIS-ULMFiT [7] | 84.1% |

(ULMFiT-SVM) [Ours] | 99.78% |

Hyper-Parameter Name | Meaning | The Best Value |
---|---|---|

em-sz | Embedding vector size | 0.77 |

nh | Hidden activations number | 0.000005 |

nl | Number of layers | 3 |

bs | Batch size | 32 |

$\beta 1$ | Optimal bias | 0.8 |

$\beta 2$ | Optimal bias | 0.99 |

C-GAMMA | SVM parameters | 5.6569–1.0667 |

**Table 4.**Performance comparisons for ULMFiT-SVM-based Twitter US Airlines, IMDB, and GOP debate datasets with several related approaches.

Dataset | Used Model | Accuracy |
---|---|---|

Twitter US Airlines [30] | SVM only [33] | 78% |

RNN/LSTM (ULMFiT) [34] | 77.8% | |

LSTM, CNN [35] | 79.64% | |

MultinomialNB [36] | $\pm 80$% | |

ABCDM [37] | $\pm 92.75$% | |

ULMFit-SVM (Ours) | 99.78% | |

IMDB [31] | ToWE-SG [38] | 90.8% |

ULMFiT [7] | 95.4% | |

BERT large fine-tune UDA [39] | 95.8% | |

RCNN [40] | 84.70% | |

ULMFit-SVM (Ours) | 99.71% | |

GOP Debate [32] | SIS-ULMFiT [41] | 55.034% |

ULMFit-SVM (Ours) | 95.78% |

**Table 6.**Training time, testing time, and number of support vectors (nSV) of ULMFit-SVM in comparison with SVM for binary classification.

Technique | Time of Training (s) | Time of Testing (s) | nSV |
---|---|---|---|

SVM | 901.095 | 18.533 | 3649 |

ULMFit-SVM | 682.10 | 4.321 | 3448 |

