Automatic Classification of National Health Service Feedback

: Text datasets come in an abundance of shapes, sizes and styles. However, determining what factors limit classiﬁcation accuracy remains a difﬁcult task which is still the subject of intensive research. Using a challenging UK National Health Service (NHS) dataset, which contains many characteristics known to increase the complexity of classiﬁcation, we propose an innovative classiﬁcation pipeline. This pipeline switches between different text pre-processing, scoring and classiﬁcation techniques during execution. Using this ﬂexible pipeline, a high level of accuracy has been achieved in the classiﬁcation of a range of datasets, attaining a micro-averaged F1 score of 93.30% on the Reuters-21578 “ApteMod” corpus. An evaluation of this ﬂexible pipeline was carried out using a variety of complex datasets compared against an unsupervised clustering approach. The paper describes how classiﬁcation accuracy is impacted by an unbalanced category distribution, the rare use of generic terms and the subjective nature of manual human classiﬁcation.


Introduction
The quantity of digital documents generally available is ever-growing.Classification of these documents is widely accepted as being essential as this reduces the time spent on analysis.However, manual classification is both time consuming and prone to human error.Therefore, the demand for techniques to automatically classify and categorize documents continues to increase.Automatic classification supports the process of enabling researchers to carry out deeper analysis on text corpora.Practical applications of classification include library cataloguing and indexing [1], email spam detection and filtration [2] and sentiment analysis [3].
Since text documents and datasets exhibit such a wide variety of differences and combinations of features, it is impossible to adopt a standardized classification approach.One such feature is the length of the text available as input to the classification process.For example, a newspaper article is likely to contain significantly more text than a tweet, thus providing a larger vocabulary which will aid classification.Another important feature of text is the intended purpose of the text.The intended use of text can significantly affect the author's style and choice of vocabulary.Let us suppose you are required to determine Amazon product categories based on the descriptions of the products.Clearly, certain keywords are highly likely to appear on similar products, and those keywords are likely to have the same intended meaning wherever they are used.In contrast, consider the scenario where you are required to determine whether a tweet exhibited a positive or negative sentiment.In this case, the same keyword used in multiple tweets may have completely different meanings based on the context and tone of the author.The intended sentiment could vary considerably.
Of course, objectivity and subjectivity can also affect the accuracy of the classification process.If the same text, based on Amazon products and tweets, was used for manual classification, then there is likely to be more consensus on the category of a product description than on the sentiment of a tweet.This is due to the inherent objectivity of the categories.Aside from these examples, there are numerous other features of a dataset which can limit classification accuracy [4].
This paper describes the analysis of a complex UK National Health Service (NHS) patient feedback dataset which contains many of the elements known to restrict the accuracy of automatic classification.Throughout experimentation, several pre-processing and machine learning techniques are used to investigate the complexities in the NHS dataset and their effect on classification accuracy.Subsequently, an unsupervised clustering approach is applied to the NHS dataset to explore and identify underlying natural classifications in the data.
Section 2 describes existing work on automatic text classification and provides a theoretical background of the approaches used.Section 3 establishes our research problem statement and introduces the datasets used, incorporating both the NHS dataset and the benchmarking datasets used for evaluation.Section 4 details the pre-processing and classification pipeline, followed by the results of our experiments in Section 5. Finally, the findings and conclusions are discussed in Section 6.

Related Work and Theoretical Background
The field of automatic text classification incorporates many differing approaches which vary depending on the type of document and how it needs to be categorized.
The early work in this field focused on the manual extraction of features from text which were then applied to a classifier.This process of feature extraction has also been refined through feature weighting [5] and feature reduction [6] to extract a more detailed representation of input text.The structure of these features when used for classification can also take many forms with the most common approaches being the bag-of-words (BoW) model [7] or word-vector representations [8].
A range of different classification models have been used to produce high accuracy text classification with the most successful approaches being support vector machines (SVM) [9], naïve bayes classifiers [10] and more recently deep learning neural network architectures [11,12].
Although there is a broad variation in the specific processes used in automatic text classification, many of these processes can be summarized into four stages shown in Figure 1.
negative sentiment.In this case, the same keyword used in multiple tweets may have completely different meanings based on the context and tone of the author.The intended sentiment could vary considerably.
Of course, objectivity and subjectivity can also affect the accuracy of the classification process.If the same text, based on Amazon products and tweets, was used for manua classification, then there is likely to be more consensus on the category of a product de scription than on the sentiment of a tweet.This is due to the inherent objectivity of the categories.Aside from these examples, there are numerous other features of a datase which can limit classification accuracy [4].
This paper describes the analysis of a complex UK National Health Service (NHS) patient feedback dataset which contains many of the elements known to restrict the accu racy of automatic classification.Throughout experimentation, several pre-processing and machine learning techniques are used to investigate the complexities in the NHS datase and their effect on classification accuracy.Subsequently, an unsupervised clustering approach is applied to the NHS dataset to explore and identify underlying natural classifi cations in the data.
Section 2 describes existing work on automatic text classification and provides a the oretical background of the approaches used.Section 3 establishes our research problem statement and introduces the datasets used, incorporating both the NHS dataset and the benchmarking datasets used for evaluation.Section 4 details the pre-processing and classification pipeline, followed by the results of our experiments in Section 5. Finally, the findings and conclusions are discussed in Section 6.

Related Work and Theoretical Background
The field of automatic text classification incorporates many differing approaches which vary depending on the type of document and how it needs to be categorized.
The early work in this field focused on the manual extraction of features from tex which were then applied to a classifier.This process of feature extraction has also been refined through feature weighting [5] and feature reduction [6] to extract a more detailed representation of input text.The structure of these features when used for classification can also take many forms with the most common approaches being the bag-of-words (BoW) model [7] or word-vector representations [8].
A range of different classification models have been used to produce high accuracy text classification with the most successful approaches being support vector machines (SVM) [9], naïve bayes classifiers [10] and more recently deep learning neural network architectures [11,12].
Although there is a broad variation in the specific processes used in automatic tex classification, many of these processes can be summarized into four stages shown in Fig- ure 1.The first stage of the pipeline, text pre-processing, primarily focuses on techniques to extract the most valuable information from raw text data [13].Typically, this involves re ducing the amount of superfluous text, minimizing duplication and tagging words based on their type or meaning.This is achieved by techniques such as:

•
Tokenizing.This technique splits text into sentences and words to identify parts of speech (POS) such as nouns, verbs and adjectives.This creates options for selective text removal and further processing.The first stage of the pipeline, text pre-processing, primarily focuses on techniques to extract the most valuable information from raw text data [13].Typically, this involves reducing the amount of superfluous text, minimizing duplication and tagging words based on their type or meaning.This is achieved by techniques such as:

•
Tokenizing.This technique splits text into sentences and words to identify parts of speech (POS) such as nouns, verbs and adjectives.This creates options for selective text removal and further processing.

•
Stop word removal.This technique removes commonly occurring words which are unlikely to give extra value or meaning to the text.Examples include the words "the", "and", "for" and "of".There are many different open-source, stop word lists [14], which have been used in multiple applications with differing levels of success.

•
Stemming or lemmatization.In this technique words are replaced with differing suffixes, whilst maintaining the same common stem.For example, the words "thanking", "thankful" and "thanks" would all be stemmed to the word "thank".Some of the most popular stemming algorithms, such as the Porter Stemmer algorithm [15], use a truncation approach, which although fast can often result in mistakes as it is aimed purely at the syntax of the word whilst ignoring the semantics.A slightly more robust, but slower, approach would be lemmatizing, which uses POS to infer context.Thus, it reduces words to a more meaningful root.For example, the words "am", "are" and "is" would all be lemmatized to the root verb "be".

•
Further cleaning can also occur depending on the raw text data, such as removing URLs, specific unique characters or words identified by POS tagging.
The second stage of the pipeline, Word Scoring, involves transforming the text into a quantitative form.This can be to increase the weighting of words or phrases which are deemed more important to the meaning of a document.Different scoring measures can be applied.These include:

•
Term Frequency Inverse Document Frequency (TF-IDF), a measure which scores a word within a document based on the inverse proportion in which it appears in the corpus [16].Therefore, a word will be assigned a higher score if it is common in the scored document, but rare in all the other documents in the same corpus.The advantage of this measure is that it is quick to calculate.The disadvantage is that synonyms, plurals and misspelled words would all be treated as completely different words.

•
TextRank is a graph-based text ranking metric, derived from Google's PageRank algorithm [17].When used to identify keywords, each word in a document is deemed to be a vertex in an undirected graph, where an edge exists for each occurrence of a pair of words within a given sentence.Subsequently, each edge in this graph is deemed to be a "vote" for the vertex linked to it.Vertices with higher numbers of votes are deemed to be of higher importance, and the votes which they cast are weighted higher.By iterating through this process, the value for each vertex will converge to a score representing its importance.Note that, in contrast to the TF-IDF, which scores relative to a corpus, TextRank only evaluates the importance of a word within the given document.

•
Rapid Automatic Keyword Extraction (RAKE) is an algorithm used to identify keywords and multi-word key-phrases [18].RAKE was originally designed to work on individual documents focusing on the observation that most keyword-phrases contain multiple words, but very few stop words.Through the combination of a phrase delimiter, word delimiter and stop word list, RAKE identifies the most important keywords/key-phrases in a document and weights them accordingly.
The third stage of the pipeline is Feature Generation.It is essential to produce an input which can be used for the machine learning classifier.In general, these inputs need to be fixed length vectors containing normalized real numbers.The common approach is the BoW, which creates an n-length vector representing every unique word in a corpus.This vector can then be used as a template to generate a feature mask for each document.This resultant feature mask would also be an n-length vector.To produce the feature mask, each word in the document would be identified within the BoW vector.Subsequently, its corresponding index in the feature mask vector would be set, whilst all other positions in the vector would be reset to 0. For example, suppose there is a corpus of solely the following two sentences: "This is good" and "This is bad".Its corresponding BoW vocabulary would consist of four words {This, is, good, bad}.The first sentence would be represented by the vector {1, 1, 1, 0}, whilst the second sentence would be represented by the vector {1, 1, 0, 1}.In this example, the value used to set the feature vector is simply the binary representation (0 or 1) of whether the word exists in the given document.Alternatively, the values used to set the vector could also be any scoring metric, such as those discussed above.The BoW model is limited by the fact that it does not represent the ordering of words in their original document.BoW can also be memory intensive on large corpuses, since the feature mask of each document has to represent the vocabulary of the entire corpus.Therefore, for a vocabulary of size n and a corpus of m documents, a matrix of size nm is required to represent all feature masks.Further, as the size of n increases, each feature mask will also contain more 0 values, making the data increasingly sparse [19].There are alternatives to the BoW model which attempt to resolve the issue of word ordering.The most common is the n-gram representation, where n represents the number of words or characters in a given sequence [20].The BoW model could be considered an n-gram representation where n is set to 1, also known as a unigram.For example, given the sentence "This is an example", an n-gram of n = 2 (bigram) would be the set of ordered words {"This is", "is an", "an example"}.This enables sequences of words to be represented as features.In some text classification tasks, bigrams have proved to be more efficient than the BoW model [21].However, it also follows that as n increases, n-gram approaches, are increasingly affected by the size of the corpus and the corresponding memory required for processing [22].
Word embedding is an alternative to separating the preprocessing of text (second stage) from the scoring of text (third stage) of the generalized pipeline.It can be used to represent words in vector space, thus encompassing both sections [8].A popular model for generating these vectors is word2vec [23], which consists of two similar techniques to perform these transformations (i) continuous BoW and (ii) continuous skip-n-gram.Both these processes manage sequences of words and support any length word encoding to a fixed length vector, whilst maintaining some of the original similarities between words [24].Similar techniques can be used on word vectors to produce sentence vectors, and then on sentence vectors to produce document vectors.These vectors result in high-level representations of the document, with less sparce representation and a smaller memory footprint than n-gram and BoW models.They often outperform the n-gram and BoW models with classification tasks [25], but they do have a much greater computational complexity which increases processing time.
The fourth and final stage of the pipeline, Classification, uses the feature masks in training a classification model.There are many different viable classifiers available, some of the most widely used approaches in text classification are k-nearest-neighbor (KNN) [26], naïve bayes (NB) [27], neural networks (NN) [28] and support vector machines (SVM) [9].Each of these classifiers have numerous variations, each with their own advantages and disadvantages.The specific variations used in our proposed pipeline will be discussed in further detail in Section 4.

Research Problem Statement
We plan to investigate how different dataset characteristics can affect the accuracy of automatic text classification.We propose to develop a novel, modular text classification pipeline, so that different combinations of text pre-processing, word scoring and classification techniques can be compared and contrasted.Our research will primarily focus on the complex NHS patient feedback dataset, but also consider other benchmark datasets which share some of the same dataset characteristics.Through experimentation on these datasets, with our novel pipeline, we aim to answer the following questions: (R1) Can our automatic text classification pipeline reduce the workload of NHS staff by providing an acceptable accuracy compared to manual classification?(R2) Can the same pipeline improve the automatic text classification accuracy on other benchmark datasets?

Input Data
This paper focuses on a challenging text classification dataset provided by University Hospitals Plymouth NHS Trust.This dataset is known as the "Learning from Excellence" (LFE) dataset.It is composed of solely positive written feedback given to staff by patients and colleagues.These data were organized into 24 themes; categories where the same sentiment is expressed using slightly different terminology.Subsequently, each item of text (phrase, sentence) was manually classified into one or more theme.Each item may be associated with multiple themes which are ordered.The first theme can be considered the primary theme.As this paper focuses on single-label classification, only the primary themes will be used.Note that the full list of the themes and their descriptions is available in Appendix A, Table A1.The LFE dataset has several characteristics which intrinsically make its automatic classification a difficult task: 1.
The dataset consists of 2307 items.Due to the text or theme being omitted, only 2224 items were deemed viable for classification.This is relatively small compared to most datasets used in text classification.

2.
The length of each text item is short, the average item contained 49.7 words.The shortest text item is 2 words long and the longest text item is 270 words long.

3.
The number of themes is large with respect to the size of the dataset.Even if the themes were evenly distributed, this would result in an average of less than 93 text items into each category.

4.
The distribution of the themes is not balanced.For example, the largest theme "Supportive" is the primary theme for 439 items (19.74%).The smallest theme "Safe Care" is the primary theme for solely 1 item (0.04%).The number of items per category has a standard deviation of 111.23 items.The distribution for the remaining theme categories is also uneven, see Figure 2.

5.
Since all the text is positive feedback, many of the text items share a similar vocabulary and tone regardless of the theme category to which they belong.For example, the phrase "Thank you" appears in 807 items (36.29%).However, only 61 items (2.74%) belong to the primary theme of "Thank You".6.
The themes are of a subjective nature, dependent on individual interpretation so they could be viewed in different ways.For example, the theme "Teamwork" is not objectively independent of the theme "Leadership".Thus, there may be some abstract overlap between these themes.Furthermore, there is no definitive measure to determine which theme is more important than another for a given text item, making the choice of the primary theme equally subjective.
Given the classification challenges posed by the LFE dataset, it was important to benchmark results.Thus, all experiments are compared to both well-known text classification datasets and other datasets which share one or more of the characteristics with the LFE dataset.
The first benchmark dataset was the "ApteMod" split of the Reuters-21578 dataset (Reuters).This consists of short articles from the Reuters financial newswire service published in 1987.This split solely contains documents which have been manually classified as belonging to at least one topic, making this dataset ideal for text classification.This dataset is already sub-divided into a training set and testing set.Since k-fold cross validation was used, the datasets were combined.Finally, since multiple themes were not assigned with any order of precedence, items which had been assigned to more than one topic were removed.Although this dataset does not share many of the classification challenges of the LFE dataset, it is widely used in text classification [29,30].Thus, it provided indirect comparisons with other work in this field.
FOR PEER REVIEW 6 of 24 Three other datasets were chosen since each share one of the characteristics of the LFE dataset.

•
The "Amazon Hierarchical Reviews" dataset is a sample of reviews from different products on Amazon, along with the corresponding product categories.Amazon uses a hierarchical product category model, so that items can be categorized at different levels of granularity.Each item within this dataset is categorized in three levels.For example, at level 1 a product could be in the "toys/games" category.At level 2, it could be in the more specific "games" category.At level 3, it could be in the more specific "jigsaw puzzles" category.This dataset was selected as it provides a direct comparison of classification accuracy, when considering the relative dataset volume compared to the number of categories.

•
The "Twitter COVID Sentiment" dataset is a curation of tweets from March and April 2020 which mentioned the words "coronavirus" or "COVID".This dataset was manually classified within one of the following five sentiments: extremely negative, negative, neutral, positive or extremely positive.The source dataset had been split into a training set and a testing set.As with the Reuters dataset, these two subsets were combined.

•
The "Twitter Tweet Genre" dataset is a small selection of tweets which have been manually classified into one of the following four high level genres: sports, entertainment, medical and politics.
Each of these datasets share some of the complex characteristics of the LFE dataset described at the start of this section.Table 1 presents and compares these similarities.The Three other datasets were chosen since each share one of the characteristics of the LFE dataset.

•
The "Amazon Hierarchical Reviews" dataset is a sample of reviews from different products on Amazon, along with the corresponding product categories.Amazon uses a hierarchical product category model, so that items can be categorized at different levels of granularity.Each item within this dataset is categorized in three levels.For example, at level 1 a product could be in the "toys/games" category.At level 2, it could be in the more specific "games" category.At level 3, it could be in the more specific "jigsaw puzzles" category.This dataset was selected as it provides a direct comparison of classification accuracy, when considering the relative dataset volume compared to the number of categories.

•
The "Twitter COVID Sentiment" dataset is a curation of tweets from March and April 2020 which mentioned the words "coronavirus" or "COVID".This dataset was manually classified within one of the following five sentiments: extremely negative, negative, neutral, positive or extremely positive.The source dataset had been split into a training set and a testing set.As with the Reuters dataset, these two subsets were combined.

•
The "Twitter Tweet Genre" dataset is a small selection of tweets which have been manually classified into one of the following four high level genres: sports, entertainment, medical and politics.
Each of these datasets share some of the complex characteristics of the LFE dataset described at the start of this section.Table 1 presents and compares these similarities.The full specification of all the datasets is available in Table 2. Since all the text are reviews, there are common words in the vocabulary which have no relation to determining the product category.For example, "great" appears in 9762 reviews (24.41%).However, "great" bears no relation to the product category.

Twitter COVID Sentiment
The average number of words in a tweet is 27.8, with the shortest2 containing 1 word and the longest containing 58 words.
Sentiment analysis in general has a subjective nature to the classifications given [31].This dataset also has some specific cases where two very similar tweets have been given opposing sentiments, an example can be found in Appendix C.
Twitter Tweet Genre This dataset consists of only 1161 documents.
The average amount of words in a tweet is 16 with the shortest 2 containing 1 word and the longest containing 27 words.Currently, the LFE dataset is manually classified by hospital staff, who have to read each text item and assign it to a theme.Therefore, we are the first to experiment with applying automatic text classification to this dataset.The Amazon Hierarchical Reviews, Twitter COVID Sentiment and Twitter Tweet Genre datasets were primarily selected for their similar characteristics to the LFE dataset.However, another advantage they provided was that they contained extremely current data having all been published in 2020 (April, September and January, respectively).Although these datasets were useful for our investigation into how dataset characteristics affect classification accuracy, it was difficult to draw direct comparisons with related work in this field due to the dataset originality.The Reuters dataset was selected because of its wide use in this field as a benchmark, allowing direct comparisons of our novel pipeline results to other well-documented work.
Some of the seminal work in automatic text classification on the Reuters dataset was by Joachims [32].Through the novel use of support vector machines, a micro averaged precision-recall breakeven score of 86.4 was achieved across the 90 categories which con-tained at least one training and one testing example.After this, researchers have used many different configurations of the Reuters dataset for their analysis.Some have used the exact same subset but applied different feature selection methods [33], while other work has focused on only the top 10 largest categories [34,35].Unfortunately, the wide range of feature selection and category variation limits reliable comparison.However, by selecting related work with either (i) similar pre-processing and feature section methods, or (ii) similar category variation, we aim to ensure our proposed pipeline is performing with comparable levels of accuracy.

Methodology
For this research, a software package was developed using the Python programming language (Version 3.7), making use of the NumPy (Version 1.19.5)[36] and Pandas (Version 1.2.3) [37] libraries for efficient data structures and file input and output.The core concept was to develop an intuitive data processing and classification pipeline based on flexibility, thus enabling the user to easily select different pre-processing and classification techniques each time the pipeline is executed.As discussed in Section 2, classification often uses a generalized flow of data through a document classification pipeline.The approach presented in this paper follows this model.Figure 3 shows an overview of the pipeline developed.
Since the LFE dataset contained a range of proper nouns which provided no benefit to the classification task, they were removed to optimize the time required for each experiment.The Stanford named entity recognition system (Stanford NER) [38] was used to tag any names, locations and organizations in the raw text.Subsequently, these were removed from the dataset.In total, 3126 proper nouns were removed.A manual scan was performed to confirm that most cases were covered.Some notable exceptions were the name "June" (which was most likely mistaken for the month) and the word "trust" when used in the phrase "NHS trust".Neither of these were successfully tagged by Stanford NER.The dataset used in this work is the final version with the names, locations and organizations removed.
To maintain similarity in the pre-processing approaches, the Stanford core NLP pipeline [39], provided through the Python Stanza toolkit (Version 1.2) [40], was used where possible.This provided the tools for tokenizing the text, and lemmatizing words.However, Stanford core NLP did not provide any stemming options, so an implementation of Porter Stemmer [41] from the Python natural language toolkit (NLTK) (Version 3.5) [42] was used.NLTK also provided the list of common English stop words for stop word removal.The remaining pre-processing techniques (removal of numeric characters, removal of single character words and punctuation removal) were all developed for this project.The final pre-processing component was used to specifically clean the text of the Twitter collections.This consisted of removing URLs, "#" symbols from Twitter hashtags, "@" symbols from user handles and retweet text ("RT").This was the final part of the software developed for this project's source code.
A selection of four word scoring metrics is made available in the pipeline.RAKE [10] was used via an implementation available in the rake-nltk (Version 1.0.4)[43] Python module.A TextRank [9] implementation was designed and developed derived from an article and source code by Liang [44].An implementation of TF-IDF and a term frequency model were also developed.
Within the stage of feature extraction, the BoW model was developed using standard Python collections.These were originally listed and subsequently converted to dictionaries to optimize the look-up speed when generating feature masks.Feature masks were represented in NumPy arrays to reduce memory overhead and execution time.All classifiers used in this publication originate from the scikit-learn (Version 0.24.1)[45] machine learning library.This library was selected since (i) it provided tested classifier builds to be used, (ii) a range of statistical scoring methods and (iii) it is a popular library used in similar literature thereby enabling direct comparison with other work in this field.For this project a set of wrapper classes were designed for the scikit-learn classifiers.All classifier wrappers were developed upon an abstract base class to increase code reuse, speed up implementation of new classifiers and ensure a standardized set of method calls through overriding.The base class and all child classes are available in the "Classifiers" package within the source code.A link to the full source code can be found in the Supplementary Materials Section.
Within the final stage of Classification, four of the most common text classifiers are provided.These are k-nearest neighbor (KNN), compliment weighted naïve bayes (CNB), multi-layer perceptron (MLP) and support vector machine (SVM).Tuning the hyper parameters of each of these classifiers, for every dataset, would have produced too much variability in the results.Therefore, each classifier was tuned to the LFE dataset; the same hyper parameters were used on all datasets.When tuning was performed, only one variable was tuned at a time, the remainder of the pipeline remained constant, see Figure 4.

PEER REVIEW
11 of 24 multi-layer perceptron (MLP) and support vector machine (SVM).Tuning the hyper parameters of each of these classifiers, for every dataset, would have produced too much variability in the results.Therefore, each classifier was tuned to the LFE dataset; the same hyper parameters were used on all datasets.When tuning was performed, only one variable was tuned at a time, the remainder of the pipeline remained constant, see Figure 4.The first classifier, KNN, determines the category of an item based on the categories of the nearest neighbors in feature space.The core parameter to set is the value of k; the number of neighbors which should be considered when classifying a new item.To define k, a range of values were tested, and their accuracy was assessed based on their F1 score.See the full results in Appendix C, Table A2.The value of k was defined as 23.Work by The first classifier, KNN, determines the category of an item based on the categories of the nearest neighbors in feature space.The core parameter to set is the value of k; the number of neighbors which should be considered when classifying a new item.To define k, a range of values were tested, and their accuracy was assessed based on their F1 score.
See the full results in Appendix C, Table A2.The value of k was defined as 23.Work by Tan [46] suggested weighting neighbors based on their distance may improve results when working with unbalanced text corpuses.Thus, this parameter was also tuned.However, when this was applied to the LFE dataset, uniform weighting produced better results.The results of these tests are shown in Appendix C, Table A3.
The second classifier, Compliment weighted Naïve Bayes (CNB), is a specialized version of multinomial Naïve Bayes (MNB).This approach is reported to perform better on imbalanced text classification datasets, by improving some of the assumptions made in MNB.Specifically, it focuses on correcting the assumption that features are independent, and it attempts to improve the weight selection of the MNB decision boundary.This approach did not require any hyper parameter tuning, and the scikit-learn CNB was implemented as described by Rennie et al. [27].
The third classifier provided is based on the multi-layer perceptron (MLP).There are multiple modern text classification approaches which use deep learning variants of neural networks.Some notable examples are convolutional neural networks (CNN) [47] and recurrent neural networks (RNN) [11], both of which have been used extensively in this field.These approaches have a substantial computational overhead for feature creation.Therefore, deep learning would have been too unwieldly for some of the datasets used in this work.Furthermore, scikit-learn does not provide an implementation of CNN or RNN neural network architectures.Therefore, their use would require another library, reducing the quality of any comparisons made between classifiers.For these reasons, a more traditional, MLP architecture, with a single hidden layer, was used instead.The main parameter to tune for this model was the number of neurons used in the hidden layer.There is much discussion on how to optimize selection of this parameter, but the general rule of thumb is to select the floored mean between the number of input neurons and output neurons as defined below: The remaining MLP hyper parameters are the default values from scikit-learn and a full list of these can be found in Appendix C, Table A4.The MLP was also set to stop training early if there was no change in the validation score, within a tolerance bound of 1 × 10 −4 , over ten epochs.
For the fourth classifier, SVM, it has been reported that the selection of a linear kernel is more effective for text classification problems than non-linear kernels [48].Four of the most commonly used kernels were tested and confirmed that this was also the case with the LFE dataset.Therefore, a linear kernel was selected for use in this classifier.The results of the tests are found in Appendix C, Table A5.To account for the class imbalance in the LFE dataset, each class is weighted proportionally in the SVM to reduce bias.
Aside from these supervised classification approaches, an unsupervised model was also developed using scikit-learn and the same classification wrapper class structure.The purpose of this was to examine whether any natural clusters form within the LFE dataset, to enable a wider range of comparisons.K-means [49] was selected as the unsupervised approach, where k represents the number of groups the data should be clustered into.To tune this parameter, two metrics were recorded for a range of potential values of k: the j-squared error and the silhouette score [50].A lower j-squared error represents a smaller average distance from any given data point to the centroid of its cluster, and a higher silhouette score represents an item exhibiting a greater similarity to its own cluster, compared to other clusters.Therefore, an optimal k value should be minimizing j-squared error whilst maximizing silhouette score.However, j-squared error is likely to trend lower as more clusters are added, leading to diminishing returns for larger values of k.So, it is better suited to examining where the benefit starts to drop off, this is often referred to as finding the "elbow" in the graph.
Appendix C, Figure A1 shows the graph comparing the j-squared error and the average silhouette score for all clusters.From this analysis it was difficult to define the optimal value of k, since the j-squared error trended downwards almost linearly, and the average silhouette score was low for all values of k.Therefore, the LFE dataset was clustered using different small values of k, {2, 8, 13, 16, 20}, which performed better.

Results
This research evaluates how fundamental differences in database volume, category distribution and subjective manual classification affect the accuracy of automatic document classification.All experiments were performed on the same computer.It had the following hardware specification: Intel Core i5-8600K, 6 cores at 3.6 Ghz.RAM: 32 GB DDR4.GPU: Gigabyte Nvidia GeForce GTX 1060 6 GB VRAM.The Stanza toolkit for Stanford core NLP supported GPU parallelization, and all experiments exploited this feature.The scikitlearn library did not have any GPU enabled options, so all classification was processed by the CPU.
During experiments, each dataset was tested for the given variables.A fivefold cross validation was used, and the mean score for each validation is reported.If not otherwise stated, all other elements of the pipeline are identical to the constant processing pipeline, described in Section 4. The core metric used for evaluating accuracy was the F1 score, which combines both precision and accuracy into a single measure.This was recorded as both a micro average and a macro average.Tables 3-5 contain the experimental results yielded by the evaluation of changes to the different sections of the proposed pipeline (pre-processing, word scoring and classification respectively).Based on the results from Tables 3-5, the optimum pipeline for each dataset was tested and the results can be found in Table 6.To benchmark the accuracy of our pipeline against other related work on automatic text classification, Table 7 presents our results on the Reuters corpus compared to the works mentioned in Section 3.2.As stated in this previous section, it should be noted that a direct comparison of these results is difficult due to the differences in document/category reduction, pre-processing approaches and feature selection.However, the results presented suggest that the approach outlined in this paper produces comparable accuracy to other state-of-the-art approaches.
Table 7.Comparison of the highest achieved micro-averaged score of our pipeline (shown in bold), compared to other published automatic text classification results on the "ApteMod" split of the Reuters-21578 corpus.Accuracy metrics are all F1 scores, except Joachims which is the precisionrecall breakeven point.

Automatic Text Classification Approach
Overall Accuracy (Micro-Averaged) 1 This score is not stated explicitly but was calculated as the average of the F1 testing scores provided in the referenced paper.

Practical Implications
Based on the results of our experiments we will discuss the two research questions introduced in Section 3.1.
(R1) The NHS is likely to adopt our approach to automatically classify feedback.This means we have successfully reduced the workload of NHS staff by providing a tool which can be used in place of manual classification.Therefore, the answer to (R1) is positive.Although our proposed classification pipeline attained a lower microaveraged F1 score on the LFE dataset compared to the benchmark datasets, given the limitations of the dataset, the NHS has found this better than the alternative of manually classifying future datasets.(R2) The performance of the classification pipeline published in this paper is evaluated by comparing it against the results of the Reuters dataset with other published work.In this research, a micro-averaged F1 score of 93.30% was achieved.As shown in Table 7, that accuracy outperforms the seminal SVM approaches of Joachims [32], which achieved a micro-averaged breakeven point of 86.40%.Furthermore, the classification pipeline performed in-line with or surpassed more recent approaches [33][34][35]; demonstrating that this classification pipeline produces high accuracy results on other datasets.Therefore, the answer to (R2) is positive.

Theoretical Implications
Despite the classification pipeline performing very well, the LFE dataset attained a lower micro-averaged F1 score than the benchmark datasets.This discussion will outline the factors which may have caused this result.The four comparison datasets all outperformed the LFE dataset for almost all potential pipeline setups.This suggest that there is an underlying limiting factor, or factors, within the dataset itself.To break down this comparison, each of the characteristics (see Section 3) will be discussed.

1.
The dataset is relatively small.The overall size of the items in the dataset may have resulted in an advantage to the Reuters and Amazon Hierarchical Reviews results, as it is widely accepted that a larger and more varied dataset will produce better classification results [51,52].However, the much smaller Twitter Tweet Genre dataset, also achieved a high level of accuracy with a micro-averaged F1 score of 82.80%.
Considering the LFE dataset had almost double the number of items, this characteristic alone is unlikely to be the sole cause of the low accuracy results.

2.
The length of each text item is short.Both Twitter datasets attained vastly different results to the LFE dataset despite the fact they are similarly characterized as being short in length.These Twitter datasets also had considerably shorter average word counts than the LFE dataset and still outperformed it overall.In conclusion, the average length of each text item is unlikely to be a discriminatory characteristic.

3.
The number of text items per category is small.The average distribution of items per category did not limit performance on the Reuters dataset.However, that could be attributed to the larger overall size, which would have provided more samples for each category in comparison to the LFE dataset.

4.
The distribution of categories is not balanced.In terms of category distribution, all classification techniques for both the Reuters and LFE datasets suffered from the same issue, where the smallest categories were never applied when classifying the test dataset.Specifically, nine of the Reuters categories and five of the LFE categories never appeared in any of the test classifications.Although this did not impact the overall results of the Reuters classification, the percentage of small categories was much greater in the LFE dataset.In the LFE dataset, 25% of categories comprised less than 1% of the dataset compared with 13.8% in the Reuters dataset.These tiny categories are almost certainly a contributing factor to the lower accuracy of the LFE results.

5.
All the text is positive.The use of common terms across all categories did not have a significantly negative impact on the classification accuracy of the Amazon Hierarchical Reviews dataset.However, the use of common terms did significantly impact the LFE results.This could be attributed to the fact that each Amazon review had an average word count of more than 50% the average LFE item, resulting in a diluting effect of the repeated common words.Due to the overall larger size of the Amazon Hierarchical Reviews dataset, it had a much larger vocabulary in comparison to the LFE dataset, which may explain why TF-IDF was the optimal scoring method for this dataset.

6.
The categories are subjectively defined.The subjective nature of both the manual classification and the categories themselves are likely to have played a role in the lower scores for accuracy in both the LFE and the Twitter COVID Sentiment datasets.Although, if the accuracy alone is considered, it is not possible to determine a direct link.
Based on these comparisons, the limiting factors in the LFE classification results are most likely to be (i) the imbalanced category distribution, (ii) the use of repeated common terms across different categories and (iii) the subjective nature of the manual classification.To explore these factors, a manual analysis was performed on the K-means clustering result to see if these same factors were limiting when the LFE dataset was treated as an unsupervised clustering problem, rather than a supervised classification problem.
The first test was used to evaluate how evenly the text items are distributed for different values of k (2, 8, 13, 16, 20 and 150).For any number of clusters, a similar trend emerged, where one cluster would account for between 51% and 98% of all the items.The remaining items were thinly spread between the remaining categories.Figures 5 and 6 depict how the data is unevenly distributed with k values of 20 and 150, respectively.Therefore, there is no evidence of a natural separation for most of the text items in the LFE dataset.Thus, they are either sufficiently generic they get clustered into one large group or that they are overly similar leading to the formation of limited smaller clusters.Furthermore, when k = 8, there was an even smaller cluster identified wh tained only 9, out of the total 2307, items.This tiny cluster came from sequential the dataset, which share almost exactly the same words.It appears that someone ted multiple "excellence texts" for a range of different staff.They copied and pa same text framework, just changing the name, organization or slightly rewording So, after using the NER, cleaning, lemmatization and removing the stop words, items are virtually identical.What is also interesting in this cluster is how the wo tastic" appeared in every entry whereas it only appears in 5.98% of the whole corp  Furthermore, when k = 8, there was an even smaller cluster identified wh tained only 9, out of the total 2307, items.This tiny cluster came from sequential the dataset, which share almost exactly the same words.It appears that someone ted multiple "excellence texts" for a range of different staff.They copied and pa same text framework, just changing the name, organization or slightly rewording So, after using the NER, cleaning, lemmatization and removing the stop words, items are virtually identical.What is also interesting in this cluster is how the wo tastic" appeared in every entry whereas it only appears in 5.98% of the whole corp To investigate this clustering and to evaluate other limiting dataset characteristics, a manual comparison of the text entries in the small and medium sized clusters was performed.When k = 8, a cluster emerged with only 29, out of the total 2307, items assigned to it.This cluster had a lot of similarities in its text items; almost always congratulating a member of staff on completing a course or gaining a qualification.Words such as "course", "success", "level", "pass" and "congratulations" appeared in this cluster in scales of magnitude higher than across the rest of the clusters.As the value of k varied, this cluster appeared with a 96.6% overlap with a cluster in k = 2, and a 79.3% overlap with a cluster in k = 13.
Furthermore, when k = 8, there was an even smaller cluster identified which contained only 9, out of the total 2307, items.This tiny cluster came from sequential items in the dataset, which share almost exactly the same words.It appears that someone submitted multiple "excellence texts" for a range of different staff.They copied and pasted the same text framework, just changing the name, organization or slightly rewording the text.So, after using the NER, cleaning, lemmatization and removing the stop words, all these items are virtually identical.What is also interesting in this cluster is how the word "fantastic" appeared in every entry whereas it only appears in 5.98% of the whole corpus.This shows one of the downsides to TF-IDF in this case, as words which have no bearing on the classification are getting scored highly due to their rarity across the rest of the corpus.This also supports the argument that common terms, unrelated to the category, could be limiting the classification accuracy.A full breakdown of the occurrence of the most common words in each cluster when k = 8, shown in Table 8, shows a general theme can be manually identified for most of clusters.• Placement.Cluster ID 5 has a high prevalence of the words: "placement", "mentor", "support" and "team".

•
General Support.Cluster ID 3 has a high prevalence of the words: "thank", "support" and "help".
Consider how the large remaining cluster has a similar distribution of words when compared to the full dataset.This suggests that this large cluster is a 'catch-all' for all the items not specific enough to be classified elsewhere.This reinforces the conclusion that rarely used but generic terms in the LFE dataset are biasing the accuracy of classification.
This could also explain why the simple scoring metric of term count was optimal for the LFE data.The other single word scoring methods (Text Rank and TF-IDF) both give higher weight to words which are common in a given item, in comparison to the rest of the corpus.However, in this dataset the most commonly used words are actually those that most closely represent the categories: • "Thank" appears in 43.44% of items.
When you consider there are categories specifically for "Thank you", "Supportive" and "Hard Work", it is clear these terms being underweighted could be another limiting factor of the LFE dataset.The limiting factor of subjective manual classification is evident in this same analysis.Although "Thank" appears in 43.44% of items, only 2.74% of the items have got a primary theme of "Thank you".A specific example of this can be seen in one of the text items, after it has been lemmatized and stop words have been removed.Consider the text "ruin humor wrong sort quickly good day much always go help support thank".This seems quite generic and contains many keywords which might suggest "Supportive" or "Thank you" as the category.However, this text item was manually classified with a primary theme of "Positive Attitude" and a secondary theme of "Hard Work" despite it not having any of the common keywords associated with these themes.
Overall, the data suggest that the common limiting factors of classifying the LFE dataset are also present when it is clustered.Indeed, this means that there is an intrinsic limitation on the ability to classify this specific dataset.

Future Research
A number of open issues offer opportunities for future work.For example, it would be interesting to evaluate our pipeline with the latest iteration of the LFE dataset, as new entries are added every month.A larger dataset would hopefully provide more instances of different themes and reduce the imbalanced theme distribution.
An alternative option would be to see if the accuracy of our pipeline could be improved on the same dataset if the number of themes was reduced.For instance, some similar themes could be combined such as "Kindness" and "Positive Attitude", which have a high degree of overlap.Some of the more generic, larger themes could also be removed entirely, for example, "Supportive" and "Hard Work".Based on the discussion above, it would be expected that this would reduce the imbalanced theme distribution and increase the ratio of text items to themes.
A separate area of research would be improvements to novel pipeline software.Currently, it is a useful tool to test a range of different text pre-processing, word scoring and classification methods to determine which is the most suitable for a given dataset.However, it could be improved if this process was automated, so that the pipeline would test different combinations, rank them and automatically select the most efficient one.To achieve this, the novel pipeline would require a high level of optimization and structure reordering.However, this addition would make this tool more accessible to researchers outside the field, as it would require less inherent knowledge of the processes used.A3.KNN tuning: F1 score (micro and macro averaged) using uniform weighting compared to distance weighting.Although the F1 macro average score is higher for distance weighing, micro averaging is less susceptible to fluctuations from class imbalance, therefore this was chosen as the deciding factor.Tests performed using the constant processing pipeline.Selected weighting method shown in bold.

Figure 1 .
Figure 1.Procedural diagram of the processes used in automatic text classification approaches where rhomboids represent data and rectangles represent processes.

Figure 1 .
Figure 1.Procedural diagram of the processes used in automatic text classification approaches, where rhomboids represent data and rectangles represent processes.

Figure 2 .
Figure 2. Chart of LFE theme category distributions, where the size of a bubble denotes the number of occurrences of each text item for a particular theme category.Note that the position of the bubbles is synthetic, solely used to portray that overlap occurs between themes.

Figure 2 .
Figure 2. Chart of LFE theme category distributions, where the size of a bubble denotes the number of occurrences of each text item for a particular theme category.Note that the position of the bubbles is synthetic, solely used to portray that overlap occurs between themes.

Figure 3 .
Figure 3. Representation of the text processing and classification pipeline, showing each stage described in Section 2. Rhomboids represent data, rectangles represent processes and diamonds represent decisions which can be made modified within the software parameters.

Figure 3 .
Figure 3. Representation of the text processing and classification pipeline, showing each stage described in Section 2. Rhomboids represent data, rectangles represent processes and diamonds represent decisions which can be made modified within the software parameters.

Figure 4 .
Figure 4. Representation of the processing pipeline used for hyper parameter tuning.From the text input all the way through the processes of tokenizing, lemmatizing, additional text cleaning, TF-IDF scoring, BoW modeling, creation of feature masks to the final stage of using the complement weighted naïve bayes classifier.

Figure 4 .
Figure 4. Representation of the processing pipeline used for hyper parameter tuning.From the text input all the way through the processes of tokenizing, lemmatizing, additional text cleaning, TF-IDF scoring, BoW modeling, creation of feature masks to the final stage of using the complement weighted naïve bayes classifier.

Figure 6 .
Figure 6.Distribution of LFE dataset when clustered using k-Means where k = 150.

Figure 6 .
Figure 6.Distribution of LFE dataset when clustered using k-Means where k = 150.

Figure 6 .
Figure 6.Distribution of LFE dataset when clustered using k-Means where k = 150.

Figure A1 .
Figure A1.K-Means Tuning Graph.Comparison of how the j-squared error and average silhouette score vary for differing numbers of clusters (k).

Table 1 .
Comparison of the datasets based on the complexity of their classification characteristics.The numbered characteristics refer to the list at the start of this section.

Table 2 .
Full specification of datasets.All values presented in this table represent the raw datasets prior to removal of any invalid entries, pre-processing or text cleaning.
1Assuming items were evenly distributed between categories, this is the minimum number of items assigned to each category.2Multiplevalues are presented for the Level 1, Level 2 and Level 3 hierarchy of categories, respectively.

Table 3 .
Evaluates the effect of different pre-processing techniques on the accuracy of classification.The same processing pipeline is maintained aside from pre-processing (TF-IDF, BoW, CNB).

Table 4 .
Evaluates the effect of using different word scoring techniques on the accuracy of classification.The same processing pipeline is maintained aside from word scoring (stop words removal, words lemmatization, additional text cleaning, BoW, CNB).

Table 5 .
Evaluates the use of different classifiers.The processing pipeline is maintained aside from word scoring (stop words removed, words lemmatized, additional text cleaning, TF-IDF, BoW).

Table 6 .
Optimum pipeline result for each dataset with F1 micro averaged score.

Table 8 .
The percentage of times words appeared in each text item, in each cluster, when k = 8.Only the five largest clusters are shown, since the other three clusters only contained a single item each.Bold values represent statistically significantly values (p < 0.5) in a given cluster when compared to their occurrence in the entire dataset.The case (upper or lower) of the words was not considered.

Table A5 .
SVM tuning: F1 score (micro and macro averaged) for different kernels.Although the F1 macro average score is higher for the linear kernel, micro averaging is less susceptible to fluctuations from class imbalance, therefore this was chosen as the deciding factor.Tests performed using the constant processing pipeline.Selected weighting method shown in bold.