Sehaa: A Big Data Analytics Tool for Healthcare Symptoms and Diseases Detection Using Twitter, Apache Spark, and Machine Learning

Smartness, which underpins smart cities and societies, is defined by our ability to engage with our environments, analyze them, and make decisions, all in a timely manner. Healthcare is the prime candidate needing the transformative capability of this smartness. Social media could enable a ubiquitous and continuous engagement between healthcare stakeholders, leading to better public health. Current works are limited in their scope, functionality, and scalability. This paper proposes Sehaa, a big data analytics tool for healthcare in the Kingdom of Saudi Arabia (KSA) using Twitter data in Arabic. Sehaa uses Naive Bayes, Logistic Regression, and multiple feature extraction methods to detect various diseases in the KSA. Sehaa found that the top five diseases in Saudi Arabia in terms of the actual afflicted cases are dermal diseases, heart diseases, hypertension, cancer, and diabetes. Riyadh and Jeddah need to do more in creating awareness about the top diseases. Taif is the healthiest city in the KSA in terms of the detected diseases and awareness activities. Sehaa is developed over Apache Spark allowing true scalability. The dataset used comprises 18.9 million tweets collected from November 2018 to September 2019. The results are evaluated using well-known numerical criteria (Accuracy and F1-Score) and are validated against externally available statistics.


Introduction
Smart cities and societies are driving unparalleled technological growth manifested in our daily lives [1]. We are witnessing a rapid evolution, rather a transformation, of our societies. Novel solutions are being developed and adopted in work and life, benefitting from the growing ability to monitor and analyze our environments in near real-time. A range of devices and technologies are being used for monitoring purposes including the Internet of Things (IoT), GPS, cameras, (radiofrequency identification) RFIDs, smartphones, smartwatches, other smart wearables, and social media. These devices produce diverse data that are analyzed using artificial intelligence (AI) and other computational intelligence methods, and used for decision-making purposes. The key to this The rest of this paper consists of six sections. Section 2 presents a brief overview on the background material and Section 3 presents a review of the relevant literature. The methodology, design and architecture of the Sehaa tool are described in Section 4. Section 5 presents a detailed analysis of the results obtained through Sehaa. Section 6 discusses the numerical evaluation and external validation of the Sehaa system. Section 7 concludes and proposes future directions.

Background
This section briefly describes important background concepts related to this study, including big data, Apache Spark, the machine learning algorithms, and the feature extraction techniques.

Big Data
Recently, the term big data has been widely used to describe specific type of datasets. Big data is defined as "the datasets that could not be perceived, acquired, managed, and processed by traditional IT and software/hardware tools within a tolerable time" [24]. Big data also refers to the "emerging technologies that are designed to extract value from data having the four Vs characteristics; volume, variety, velocity and veracity" [25]. Hadoop, Apache Spark, and Tableau are examples of technologies that provide solutions for big data.

Apache Spark
Apache Spark is "a unified analytics engine for large-scale data processing" [26]. Spark has many features, which make it an optimal choice for big data analytics. For example, in parallel processing, applications can be easily constructed using more than 80 high-level operators that are provided by Spark. Spark also offers an interactive shell written in different programming languages such as Python, Scala, or R. Moreover, Spark contains most of the required libraries for big data analytics steps. Figure 1 shows the programming libraries provided by Spark: Spark SQL, Spark Streaming, MLib, and GraphX.

Machine Learning
The term machine learning can be defined as the process of learning from input data in order to build adequate experience and then generate the required output [27]. The learning process can be supervised or unsupervised. In supervised learning, the scenario includes predicting missing information (usually a class or a label) for some data. The prediction is made after learning from provided information in the training data. By contrast, in unsupervised learning, the data is not divided into training and testing sets, and the learning process is achieved by grouping data into subsets of similar objects according to various features. The  In logistic regression, the learner uses the logistic sigmoid function (Equation (1)) to calculate the probability value for the testing data. The appropriate label or class is then assigned according to the resulting probability value [28]. In Equation (1), s(z) is the output between 0 and 1 (probability value), z is the input to the function, and e is the base of the natural logarithm.

Naïve Bayes (NB)
For the Naïve Bayes classifier, the learner applies the Bayes theorem assuming independency between the extracted features of the training data. Hence, the NB algorithm is highly scalable according to the number of features included [29]. This algorithm is commonly used in a wide range of big data analytics research.

Feature Extraction
Feature extraction is a crucial step in constructing a machine learning classifier. It aims to reduce the raw data into a manageable number of variables (set of features) while maintaining the data accuracy. This allows us to select the significant features when building the classifications models. There are several techniques for feature extraction; N-Gram and TF-IDF, which have been used in this study, are defined below.

N-Gram
In machine learning, one of the feature extraction methods is N-Gram. It involves converting the input data into a sequence of separate n tokens, which are usually words. N is an integer number, which is usually one in the one-gram method, two in the bi-gram method, or three in the tri-gram method. In practice, pyspark libraries implement this method using N-Gram class instances [30].

TF-IDF
The term TF-IDF refers to Term Frequency-Inverse Document Frequency. It measures the term importance in a document in a corpus by considering its frequency [30]. If the term appears frequently, that means it does not have special information about a particular document-for example, "and", "a", "the", and "of". For a given corpus D, which contains the number of documents d, the TF-IDF numeric value for a term t is calculated as shown in Equation (2).
( , ) is the term t frequency in a document d.
( , ) is the number of documents in the corpus D that contain the term t. The term ( , ) in Equation (2) is calculated using Equation (3).

CountVectorizer
The idea of CountVectorizer is to transform a collection of text documents to vectors of token counts. In cases where an a priori dictionary does not exist, CountVectorizer can be used to extract the vocabulary and construct the required dictionary [30].

HashingTF
For each sentence (bag of words), HashingTF can be used to hash the sentence into a feature vector. We use IDF to rescale the feature vectors; this is done in order to improve performance when using text as features [30].
In this section, we review the literature related to the use of Twitter sentiment analysis in healthcare. Section 3.1 focuses on healthcare-related analytics in English, while Section 3.2 focuses on the same but for the Arabic language. Section 3.3 describes the research gap.

Twitter Data Analytics in Healthcare
The data generated by social media, such as Twitter, provides unexpected opportunities to enrich the health care sector (see, e.g., [62]). Fields of applications in health care that have benefitted from social media data include, for example, building surveillance tools to track a certain disease, studying side effects of medications, and exploring healthcare-related habits. The most common methodologies found in the literature are statistical analysis and text mining (including sentiment analysis) of Twitter data using machine learning algorithms. The notable works are reviewed below.
Parker et al. [63] analyzed tweets without aiming to detect a particular illness. The target was to produce "interest curves" that document the generation of hypotheses regarding which healthrelated conditions/topics have occurred frequently. Unlike other studies, this approach is not dedicated to discovering one certain disease. The main contribution was to convert the stream of tweets to a list of health-related topics. Paul and Dredze developed a model, called the Ailment Topic Aspect Model (ATAM), to link each disease with its possible symptoms, medications, and related words [64]. The ATAM model has been constructed using Twitter data and trained by a support vector machine algorithm (SVM). The experiments demonstrated the efficiency of ATAM in aligning different groups of diseases with their symptoms, medications, and related words. The authors mentioned that ATAM could be used as a surveillance tool for general health topics. The authors extended in [65] the original ATAM to discover mentions of additional illnesses, including allergies, obesity, and insomnia.
The idea of detecting influenza cases from Twitter data was explored by Aramaki et al. [66]. Using a support vector machine (SVM)-based classifier, they detected the influenza patients. The experiments' results demonstrated the practicability of the proposed approach, which showed acceptable correlation comparing with medical reports statistics, especially at the outbreak and early spread (early epidemic) stage. The authors extended their work in [67] and implemented a robust influenza prediction model that enabled the use of direct and indirect information using tweets from urban and rural areas in Japan. This work was further extended in [68]. The authors constructed a more generalized diseases surveillance tool that performs multi-label and cross-language tasks. The multi-label approach was used to classify tweets into eight different diseases symptoms and label them appropriately with patients' symptoms. The tool is cross-language and works on three different languages: English, Japanese, and Chinese.
Lamb et al. [69] found that deeper content analysis of tweets leads to a remarkable improvement in influenza's surveillance performance. More precisely, flu-related tweets do not reflect infections alone; rather the tweets might be related to suspicions related to flu infections, worries about the infection, or discussions about the disease itself. Hence, they relied on trained classifiers that can distinguish between the awareness and infection tweets using a pre-annotated set of trained data. Their results demonstrated that by distinguishing between types of flu tweets to identify reports of infection, reasonable surveillance could be recovered. The obtained results brought to the attention of the NLP community that deeper content analysis of tweets is worth investigation. Smith et al. [70] developed a real-time surveillance tool for disease awareness rather than monitoring infection cases. This system offered an opportunity not only for public health officials to identify awareness trends, for which they often have no other data sources, but also to study what drives awareness of influenza in a population.
The studies are not limited to tracking the spread of a specific epidemic; they also examined following the side effects of certain medications. For example, Bian et al. [71] reported that sentiment analysis of Twitter data revealed some of the unreported side effects for five drugs related to cancer medications. The researchers firstly specified the drug users' Twitter accounts using a Support Vector machine (SVM) classifier. Then, they evolved an analytic framework that integrates natural language processing and machine learning methods to capture drug-related adverse events from the Twitter messages. Their findings showed the possibility of supporting pharmacovigilance by extracting knowledge from tweets.
Mayslin and Zhu explored in [72] the role of Twitter data analysis in tobacco consumption surveillance. The extracted sentiment was linked in complex ways with social image, personal experience, and recently popular products such as the hookah and electronic cigarettes. It also showed the need for public awareness of their health effects. Taken together, these findings suggest a role for machine classification of tobacco-related posts over strictly keyword-based approaches in enhancing tobacco surveillance applications. Jashinsky et al. [73] investigated whether Twitter data can be a promising source for researchers to identify suicide-related risk factors. They indicated in their study that Twitter could be a useful tool for the early detection of individuals who are at risk of suicide. Their findings demonstrated a strong correlation between Twitter data and actual suicide data for a certain state in the United States.
Achrekar et al. [74] implemented a novel approach in estimating influenza statistics using Twitter data and real data provided by Centers for Disease Control and Prevention (CDC). Their model showed that to some extent, Twitter data could be conferred as a reliable source for the realtime assessment of epidemic conditions and could offset the lack of real statistics. They showed that text mining significantly improves the correlation between Twitter and the Influenza like Illness (ILI) rates provided by the Centers for Disease Control and Prevention (CDC). They built a model using a support vector machine algorithm and simple bag-of-words text processing to detect flu cases. Due to the high correlation of their model results with real statistics, they then built an estimation tool using the same techniques that could compensate for the absence of real statistical cases.
Furthermore, for a certain disease such as influenza, surveillance approaches have proved their efficiency when tested on larger scale regions [75]. However, not only influenza was a public health concern; asthma is another concern that implies readiness from emergency rooms in hospitals. In [76], a robust model was developed using different sources of real data to predict the number of asthmarelated emergency department (ED) visits in a specific area. Unlike the traditional surveillance tools, which rely on EMRs (electronic medical records), the study relied on employing machine learning in social media and environmental sensors data. In practice, for a specific geographic area within a time period, they collected Twitter data and examined the relationship between Twitter data, internet users' search interests from Google, ED asthma-related visits data, and pollution sensor data. The association between these different data was examined to build a prediction tool for asthma-related visits. After that, a prediction tool was built to estimate the number of ED asthma related visits. The tool successfully estimated the rate of asthma ED visits using a combination of independent variables from the above-mentioned data sources.
Culotta in [77] developed a flu tracking system using a supervised learning approach for the analysis of flu-related tweets. The results of the tracking system showed a high correlation with the real statistics generated by the national health authorities. That provided further confidence in the usefulness of Twitter as a data resource for health-related research and for the robustness of natural language processing NLP algorithms.

Twitter Data Analytics in Healthcare (Arabic)
We have found only three works on Twitter data analytics for healthcare in the Arabic language. Alayba et al. [15] introduce a new Arabic annotated dataset about health services which they state is a necessary component in sentiment analysis studies. The dataset consists of about 2026 tweets in the Arabic language. The tweets were collected over a six-month period using the four most popular hashtags about health services. Pre-processing including the normalizing steps was applied. The clean tweets were annotated by three annotators to be either positive or negative. The classification of tweets was limited to two classes only, due to the difficulty of rating the opinion in the Arabic language compared to English. Their experiments were performed by various machine learning algorithms and deep neural networks using different settings. They reported to have obtained a best classifier results using SVM with Linear Support Vector Classification and Stochastic Gradient Descent. Alkouz and Aghbari [16] detect influenza in the UAE from Arabic tweets. They classified tweets and used them to predict the number of future hospital visits using a linear regression model. Their work focused on analyzing tweets in Arabic MSA and in the UAE dialect. The authors reported correlations between their reported results and those obtained from the UAE Ministry of Health. This work did not use any big data technologies. Ilyas and Alowibdi [17] used tweets in Arabic to track diseases in the Gulf Cooperation Council (GCC) countries. This work used a small number of tweets and did not use any AI and big data technologies.

Research Gap
The discussions on the related works provided above clearly establish the immense potential of Twitter data analytics in healthcare. Two major challenges are evident. First, the current state of research on the topic is limited, both in terms of the scope of the works [10] as well as the investigation into and comparison of the methods for healthcare analytics, including machine learning methods [11]. A number of works exist in the analysis of tweets in the English language; however much more is needed to develop robust analytics methods, tool functionalities, and usability. Further works are needed in languages other than English, particularly cross-language works. There are several challenges that hinder the development of tools for Twitter data analytics in the Arabic language, the greatest being the complexity of the language itself. Research on Twitter data analytics in Arabic has begun to appear in recent years in various application domains (detecting authors' genders [12], detecting traffic related events [18,20,38], finding restaurants' reputations [13]) but the progress has been slow. Moreover, some works are available in Modern Standard Arabic (MSA), but in general (not specific to healthcare), the works on Arabic dialects are very limited in number and scope [10,14]. The three works in Arabic specific to healthcare [15][16][17] that we have discussed in the previous section are limited in scope, depth, and/or functionalities. There is no known work in data analytics specifically on Saudi Arabic dialect in healthcare.
The second challenge related to Twitter data analytics in healthcare concerns the scalability and interoperability of the Twitter data analytics systems. The challenges in this respect include management, integration, and distributed computation of data including the difficulties related to managing the 4V characteristics of big data, i.e., volume, velocity, variety, and veracity. There have been some works on the use of big data platforms in Twitter data analytics in various languages but in different application domains [18][19][20][21][22][23]37,78,79]. To the best of our knowledge, no work has been reported that uses big data technologies for data analytics in healthcare using tweets in the Arabic language.
Healthcare is among the most data-intensive generating and consuming sectors. The use of big data distributed computing technologies is important for scalability and integration with other sources of healthcare information. Motivated by these research gaps, we have attempted the design of a tool that provides healthcare analytics capabilities from tweets in the Arabic language using big data technologies.

Sehaa Tool: Methodology and Design
In this section, we describe the methodology and design of our proposed Sehaa system. We have built a software tool based on the design of the proposed Sehaa system and we will refer to it as a tool or system interchangeably. The tool comprises four modules and these are described in separate subsections (Section 4.2 and 4.5) subsequent to the first Section (4.1), which provides an overview. The dataset is described in Section 4.2.

Sehaa: An Overview
We built the Sehaa system in order to detect the most frequent health symptoms and diseases. The system methodology and design are generic and can be adopted globally. However, our focus in this work is on Saudi Arabia and therefore the tool currently works with the Arabic language. The architecture of the Sehaa tool is illustrated in Figure 2Error! Reference source not found.. Sehaa is composed of four main modules. The overall process of the Sehaa system can be summarized as follows. First, we capture and download public tweet messages from Saudi Arabia using a Twitter streaming API according to a set of predefined parameters. These parameters are the (Arabic) language, the location, and a set of search keywords which represent health symptoms and diseases (see Section 4.2, [80], and Table 1).
It is well known that social media contents are unstructured and contain many errors (i.e., veracity of big data). Therefore, in the Pre-Processing module, the acquired data are cleaned and preprocessed to ensure its readiness for the actual learning, classification, and prediction stages. This forms our dataset. We then divide the dataset into training and testing datasets. 60% of the tweets are included in the training set and the rest in the testing dataset. There is no lexicon or libraries available in the Arabic language for machine learning, particularly for healthcare, and, therefore, we have manually labeled the tweets; see details of the Pre-Processing Module in Section 4.3. Subsequently, we build six classification models (classifiers) to be used in two classification stages, first to classify related and unrelated tweets, and then to detect symptoms and diseases; see Section 4.4. Finally, the results are validated using multiple numerical criteria and external sources (news media) and visualized; see Section 4.5.

Data Collection Module
The main purpose of this module is to collect relevant tweets. Our aim is to capture tweets in the Arabic language that are related to healthcare in Saudi Arabia. We have used a stream listener for the purpose.

Keywords and Geolocation
Firstly, we collected the tweets using a filter based on a list of keywords. These keywords were partly taken from the latest statistics published by the Saudi Ministry of Health (MOH) on their official website [80]. Some of the other keywords were acquired from the domain experts through personal discussions, and from common knowledge and vocabularies in the Saudi Dialect. Table  1Error! Reference source not found. lists common vocabularies for most common symptoms and diseases in Saudi Arabia. Column 1 lists the set of keywords that could be used by Tweeters to mention symptoms, diseases, or medications and hence were included in the search keywords. These keywords are grouped based on particular representative diseases in English and Arabic, and are listed in Column 2. The abbreviations for the diseases are provided in Column 3. The next three columns provide similar details for other diseases.
Secondly, we filtered the tweets based on the geographical location of the tweets to ensure that the tweets are generated from Saudi Arabia. This was achieved by specifying the top-right and bottom-left geographical coordinates and defining a bounding square box around the country. The tweets that originated from within the defined bounding box were extracted by the stream listener. This is different from the paid Twitter Search API, where the user's location field is used to filter tweets. A Python function using the Tweepy library was written that incorporated the two filters described above into the stream listener.
We understand that some tweets originating outside Saudi Arabia could also be relevant to this work and should have been collected. However, we used a free streaming Twitter API and were limited in resources. Moreover, the tweets originating from outside Saudi Arabia would form a small proportion of the overall tweets and we plan to consider this in future work.

The Data Set
The data were collected between 20 November 2018 and 9 September 2019. However, we were unable to run the collection script several times during the period due to technical difficulties and personal circumstances. The periods when the tweets were collected are: 20 November 2018 to 8 January 2019, 13 February to 29 May 2019, and 31 July to 9 September 2019. This makes a total of 195 days of data. A total of 18.9 million tweets were collected.

The JSON Parser
The tweets from the streaming API as described above are tweet objects in the JSON (JavaScript Object Notation) format. The structured JSON format is the default response of the Twitter API. A part of a tweet object in the JSON format is illustrated in Figure 3. The JOSN format consists of pairs of attributes and values for different objects. Tweets and Users are the two main objects, and each object has its own attributes. The values of the attributes can be accessed using appropriate indexing. There are a number of difficulties related to handling tweets in the JSON format, including programming and computational complexities. Moreover, the existing parsers are dedicated to the English language and many encoding issues are encountered when they are used for Arabic. Therefore, we built a parser to extract the required attributes of the tweets in the JSON format and to store them into the CSV format. Each tweet in the JSON format is stored as a separate file. The parser takes these files containing JSON tweets, extracts all the required attributes of each tweet, and stores them in CSV files. This time, however, each tweet is not stored as a separate file; rather all the tweets related to a specific keyword (see Table 1) are stored in a single file. A new file is used when the maximum file size limit is reached.
The parser algorithm is given as Algorithm 1. It starts by iterating over all the tweet objects in the JSON format. The necessary attributes, such as tweet ids, text, time, and location are extracted and stored in RAM in multiple lists. Finally, these lists are exported to CSV files in the secondary storage (see Figure 4). The generated CSV files are relatively easy to modify for annotation and labeling purposes.

Data Pre-Processing Module
Data pre-processing or preparation is a crucial step within a data analytics pipeline, due to the unstructured and informal nature of the data generated by social media. It involves applying several techniques to the acquired data set in order to clean the data. Data pre-processing should be performed to ensure the data readiness for the subsequent steps wherein the actual analytics will be performed. It also enhances the quality and accuracy of data analytics. Some libraries are available for the pre-processing of text in various languages. The NLTK (natural language toolkit) library is an example. It is used to pre-process text in a wide range of languages including English. However, it does not provide satisfactory support for the Arabic language.
Based on our design preferences and the nature of our data set, we have created a specific preprocessing algorithm appropriate for the Arabic language (see Algorithm 2). It starts by removing all advertisement tweets; then incomplete and non-Arabic tweets are removed. Duplicated and unwanted characters such as emoticons characters are also removed.

Labeling the Tweets
Subsequent to the data cleaning phase, the dataset needs to be labeled for training and testing purposes. We randomly selected a part of the data set and divided it into training (60%) and testing (40%) datasets. We manually labeled all the selected tweets using two levels of labeling. At the first level, we label tweets to distinguish between "related" and "unrelated" tweets, using the labels "R" and "U", respectively. Tweets that express sickness cases and include information about a specific disease or news about awareness events regarding certain health phenomena are considered as related. However, if the text contains one of the search keywords but the context does not reflect a health concern, such as supplications, jokes, and poems, it is labeled as unrelated. A few examples of tweets and their labels are shown in Table 2.
The second level of labeling is performed on the related tweets to distinguish between the tweets that communicate to create "awareness" about diseases and those which are reporting actual cases of being "inflicted by" a disease. The former category of the related tweets is labeled as "A" and the latter category of tweets as "I".

Classification Module
In machine learning, classification aims to predict a category, a class, or a label of a given input data based on the rules generated during the learning or training phase. The label values are already known, and the class boundaries are well defined in the training data. In the learning phase, the classifier uses the labeled data (training data) to generate the rules of classification in order to learn how to predict the labels for the data provided in the future.
Several classification techniques have been suggested and implemented in various programming language libraries, such as Python and Pyspark. Support vector machine, decision tree and deep learning are examples of machine learning techniques [81]. In our study, we have used the Naïve Bayes (NB) and Logistic Regression (LR) algorithms in different combinations with four feature extraction methods: BiGram, TriGram, HashingTF, and CountVectorizer.
The Classification Module receives data from the Pre-Processing module containing labeled and unlabeled tweets and classifies the tweets using the NB and LR algorithms in different combinations with the four above-mentioned extraction techniques. At the first level, the tweets are classified into related and unrelated tweets. Subsequently, the related tweets undergo a second level of classification where we separate the tweets into the "awareness" and the "afflicted by" tweets (see Section 4.3 and Table 2).

Validation Module
The validation of the classification results is the most important task in data analytics. The task involves evaluating the classification models using a set of criteria. We have used two widely used numerical evaluation criteria, Accuracy and F-1 Score. These are calculated as follows: F-1 Score = 2/(1/Precision + 1/Recall), where True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), Precision, and Recall are well known metrics. We use the Accuracy and F-1 Score criteria to select the best algorithm for the classification of the tweets.

Visualization
An important function in the Sehaa system is to prepare various statistics such as the numerical evaluation metrics mentioned above and to visualize them.

External Validation
We also validated the results obtained through the Sehaa system against various external sources. These sources include the Institute for Health Metrics and Evaluation (IHME), which is an independent global health research center at the University of Washington [82], the official World Health Organization (WHO) website [83], the Centers for Diseases Control and Prevention [84], and the Saudi Ministry of Health (MOH). We looked at these sources and compared the relative occurrence of various diseases in Saudi Arabia. The results will be discussed in the next section.

Sehaa: Results and Discussions
This section summarizes the main findings of this research. In Section 5.1Error! Reference source not found. we analyze the pre-classification results. Section 5.2Error! Reference source not found. presents the post-classification results using the first level classification. The results differentiating the awareness and afflicted cases from the second level classifier are discussed in Section 5.3. See Section 4.4, for the classification methodology.
An important point in reporting the results in this section should be noted here. We understand that a tweet could mention multiple diseases (though our experiences suggest that tweets mentioning multiple diseases are rare). However, in this work, we made the decision to associate each tweet with only one disease, the one which is mentioned first in a tweet. Future work will explore this further and improve the quality of data analytics in Sehaa. Figure 5 shows the frequency of symptoms, medications, and diseases in Saudi Arabia detected by the Sehaa system; see Table 1 for the list of keywords and their representative diseases. The x-axis represents symptoms, medications, and diseases in the Saudi dialect. The y-axis shows the number of tweets that mention each keyword. We have used the log-10 scale for the y-axis in order to be able to show both the low-frequency and high-frequency keywords. These are pre-classification results, and they also include irrelevant tweets such as advertisements. The first level classification has not been performed yet, and hence the tweets may include the use of various terms that are related to healthcare but are being used in the contexts other than healthcare. See Table 2, example 4, where the Arabic word (transliterated into English as) "araqq" could mean insomnia but is being used to mean delicate. Other examples include supplications listed as example 5 in Table 2.  ), (eczema, and vitiligo), are the least frequent keywords. Looking at the uses of various keywords for a specific disease, most of the tweets that mentioned heart disease used the keywords (  ‫ﻗﻠﺐ‬  ,  ‫ﺟﻠﻄﺔ‬ ). Similarly, for dermal diseases, the most frequently used keywords were ( ‫)ﺟﺮ‬ and ‫.)ﺍﻛﺰﻳﻤﺎ(‬ For cancer, the most frequently used keywords were ( ‫ﺍﻟﺨﺒﻴﺚ‬ , ‫ﺳﺮﻁﺎﻥ‬

Pre-Classification Results
). For hypertension diseases there is the only word that is used in Arabic, ‫,)ﺍﻟﻀﻐﻂ(‬ and the figure shows that it is one of the most of common diseases in Saudi Arabia. The diabetes disease is represented by multiple keywords (see Table 1). Figure 5 shows that a number of these keywords have been used to talk about diabetes and with fairly high frequencies, implying that it is one of the most common diseases in Saudi Arabia. Finally, we note that scabies is one of the most common diseases in Saudi Arabia according to the figure. However, it is in fact not that common in Saudi Arabia based on our external validation (see Section 6.2). The reason for the high occurrence of the Scabies disease in Figure 5 is due to a number of epidemic cases that were reported in Saudi Arabia in late 2018 (the period when this data was collected). Figure 6 plots the data as in Figure 5 but as a pie chart, and depicts the numerical proportions of various diseases. The symptoms, diseases, and medications are aggregated using the terminology given in Table 1. Note that the highest four common diseases in the figure are DBT (diabetes) with 31% of total mentions, HYT (hypertension) with 28%, DRD (dermal) with 21%, and HRD (heart disease) with 16% of total mentions. Diabetes is the most common disease in Saudi Arabia based on the unfiltered results.

Post-Classification Results (First-Level)
This section presents and analyzes the results obtained from the first-level classifier, i.e., the classifier that removed unrelated (or irrelevant) tweets (for classifier details, see Section 4.4). The difference between these results compared to the ones presented in the previous section is as follows. Firstly, these results do not include irrelevant tweets as explained in the previous section. Secondly, the results are aggregated based on the common terminologies provided in Table 1. Thirdly, we provide results for major cities in addition to the whole of Saudi Arabia and also provide statistics based on the data, which were normalized with the population sizes of the examined cities. Figure 7 illustrates the post-classification (first-level) distribution of the diseases detected by the Sehaa system. Each set of bars in the figure represents a unique disease and the detected numbers of occurrences are plotted using the log-10 scale on the y-axis. For detailed analysis, we have extracted the tweets for the five major cities in Saudi Arabia and plotted them for each unique disease. There  are a total of 14 diseases. The cities are Riyadh, Jeddah, Dammam, Makkah, and Taif. After the classification stages, we filtered the tweets according to their locations by retrieving the locations of the users. The location of a tweet is available in the location attribute of the user object, which is part of the tweet objects. Note that we were only able to find locations of tweets where these were enabled by the user. The sixth bar in the figure for each disease is the total number of occurrences in the whole of Saudi Arabia. The absence of bars implies no cases detected for a disease in a city. The threelettered abbreviations of the diseases are provided in Table 1. For further exploration, and to take into consideration the fact that bigger cities, such as Riyadh, have a higher population and hence potentially a higher number of tweeters, the frequency of each disease was normalized against the relevant population. That is, the frequency of a disease for a city was divided by the population of the city, and the frequency of a disease for the whole of Saudi Arabia was divided by the population of the country. These normalized values for the 14 diseases, five cities, and the whole of Saudi Arabia are plotted in Figure 8. Figures 7 and 8 show that the top five prevalent diseases in Saudi Arabia are hypertension (HYT), dermal diseases (DRD), heart diseases (HRD), diabetes (DBT), and cancer (CNR). This pattern is somehow different from the pattern of all five major cities (Riyadh, Jeddah, Dammam, Makkah, and Taif). The top five prevalent diseases for the five cities are heart diseases, hypertension, dermal diseases, cancer, and diabetes. However, there are clear differences in some of these cases in terms of the precise frequencies of the top five diseases across the five cities.
Note in the normalized values ( Figure 8) that Riyadh and Jeddah have the highest frequency among the five cities for most of the diseases. One exception is Makkah, which is the only city where influenza has been detected, and this could be due to a large number of people visiting from all over the world coming in close proximity, which could spread influenza in Makkah more than in other cities. This is generally known in Saudi Arabia and therefore the Sehaa findings have verified the common knowledge in Saudi Arabia. We would like to note here that we are not suggesting that influenza does not exist in other cities, but that perhaps people do not talk about it as much as they do in Makkah.
Note also in Figure 8 that the overall national normalized number of the occurrence of most diseases is lower than the frequencies for one or more of the five cities except asthma (AST), hypertension (HYT), dermal diseases (DRD), diabetes (DBT), and thyroid diseases (THD). Riyadh, followed by Jeddah, have the highest normalized value among the five cities and the national value. Asthma (AST) and cholesterol (CHT) were only detected in Riyadh and Jeddah and not in the other three cities. For Makkah, there was an almost equal number of detected cases for diabetes and cancer. The most intriguing finding is that Taif city appears to be the healthiest city. It had the lowest number of detected diseases; only seven diseases were detected out of a total of fourteen diseases. It had the lowest normalized value for all the detected diseases.
Finally, note that the top five diseases detected by the pre-classification ( Figure 6) and postclassification (Figures 7 and 8) results are different. A careful look at the numbers reveals that the pattern is more or less the same for all diseases except for diabetes. The reason for these differences is firstly the removal of irrelevant tweets, which makes the post-classification results a better indication of the reality. Secondly, in Saudi Arabia, the most common word used by the public for diabetes is ‫"ﺳﻜﺮ"‬ that literally means "sugar". People also use the word ‫"ﺳﻜﺮ"‬ to show love for each other, to (positively) make jokes to each other, and greet each other to have a sweet morning, evening, etc. (see Table 2Error! Reference source not found., Row 6). There were a large number of such tweets that were filtered out by the classifier and therefore diabetes was not detected as the top disease.
Moreover, the words ‫"ﺳﻜﺮﻱ"‬ and " ‫ﺍﻟﺴﻜ‬ ‫ﺮﻱ‬ " also refer to a type of date fruit (see Table 2, Row 7). These words in fruit connotation are also found very frequently in tweets and these were filtered out as advertisements and other types of unrelated tweets. Nevertheless, a deeper look into these results and the classifiers is needed and forms our future work.

Awareness vs. Afflicted Tweets
The tweets detected by the first-level classifiers comprise both the "afflicted by" and "awareness" tweets (see Section 4.3). The "afflicted by" tweets include cases of sickness, suffering, and medication. The awareness tweets communicate to create awareness about diseases. Usually, the awareness tweets are posted by medical practitioners who are tweeting for raising public awareness purposes. The purpose of the second-level classifier is to separate the actual cases from the awareness tweets in order to compute the actual number of diseases cases. Moreover, this classification also helps with finding the level of awareness-related activities for particular diseases across various cities in the Kingdom of Saudi Arabia (KSA). Figure 9 depicts the total number of awareness and afflicted cases for the set of fourteen diseases in the KSA. The top five diseases in terms of awareness tweets were hypertension (HYT), dermal diseases (DRD), heart diseases (HRD), diabetes (DBT), and cancer (CNR). The top five diseases in terms of the actual "afflicted by" cases were dermal diseases (DRD), heart diseases (HRD), hypertension (HYT), cancer (CNR), and diabetes (DBT). The rankings of the diseases using the awareness tweets are the same as for the first-level classification. However, the ranking of the occurrence of the diseases using the "afflicted by" cases is different from the first-level classification (see Figure 9). Although this appears to be a surprise, it is expected because the number of the awareness tweets is much higher than the number of the tweets reporting the actual cases, and these bigger numbers dominated the first-level classification trends in Figure 7. Compare these results with the pre-classification results in Figure 6 and note that the trend for the diseases is also different from the second-level classification results in Figure 9. We conclude that the most accurate top five diseases detected by the Sehaa system are the ones depicted in Figure 9 for "afflicted by" cases, that is, dermal diseases (DRD), heart diseases (HRD), hypertension (HYT), cancer (CNR), and diabetes (DBT).  Figure 10 depicts the ratio of awareness and afflicted cases for Saudi Arabia and its five major cities. We infer from these values the level of awareness activities being carried out by various cities in comparison to the occurrence of each of the fourteen diseases. We expect bigger cities to carry out a higher number of awareness activities, as is usually the case due to the larger population, higher education levels, and use of technologies (Twitter). However, we reiterate that Figure 10 depicts the ratio and not the number of awareness activities. Note in Figure 10 that Riyadh has the highest ratio of awareness to afflicted cases for six of the fourteen diseases: AST, HYT, CNR, CGH, DBT, and THD. Interestingly, the top two diseases-dermal diseases (DRD) and heart diseases (HRD)-are not included in these six diseases, implying that Riyadh should do more in creating awareness for these diseases.
Jeddah is considered the second major city in Saudi Arabia. It has the highest ratio of awareness to afflicted cases for three of the fourteen diseases: CHT (cholesterol), CLN (colon), and TBC (tuberculosis). None of these are among the top five diseases in Saudi Arabia. Jeddah needs to do more in creating awareness for the top five national diseases. Taif is the fifth major city with a population that is one-eighth of that of the largest city Riyadh. However, it is commendable it has a comparable ratio of awareness to afflicted cases for several diseases. Note in Figure 7 that seven out of 14 diseases were not detected in Taif city and hence these diseases have zero values in Figure 10. Considering the two facts together, Taif has the lowest number of disease cases in Saudi Arabia while maintaining a high number of awareness activities.
Compare further Figures 7 and 10 and note that the national ratios for the fourteen diseases are significantly lower compared to the other cities (which is not the case in Figure 7). This shows that most of the awareness activities are happening in the five major cities and more effort is needed in other cities in Saudi Arabia.

Results Validation
We now discuss the numerical evaluation results of the classifiers and the external validation for the Sehaa system.

Numerical Evaluation
The Classification module provides critical functionality for the Sehaa system (see Section 4.4). The classification is accomplished in two levels. The first level is to distinguish the health-related tweets from the unrelated ones, while the second level is to classify the related tweets into awareness or afflicted tweets (see Table 2). At each level, the classification relies on machine learning algorithms.
In order to select the most efficient algorithm, we used different machine learning algorithms in the Spark ML packages, using Python with different feature extraction techniques to train the classification models. We split the manually labeled data into training sets (60%) and testing sets (40%). We built different model pipelines and trained these models using the Naïve Bayes (NB) and logistic regression (LR) algorithms. To find the best algorithm, we numerically evaluated them using the testing data set using the well-known evaluation criteria Accuracy and F1-score (see Section 4.5).
These scores for the first-level and second-level classifiers are plotted in Figures 11 and 12, respectively. Figure 11 shows that Naïve Bayes with Trigram feature extraction provided the highest Accuracy (78.2%) compared to any other combinations of feature extraction techniques with Naïve Bayes or Logistic Regression. However, Logistic Regression with Trigram feature extraction provided the highest F1-score among all the classification and feature extraction methods. We had selected Naïve Bayes with Trigram (due to the higher Accuracy score) for the results presented in the previous sections. Figure 12 shows the results for the second-level classifier. Note that it provides higher accuracies than the first-level classifier. For both Accuracy (86.7%) and F1-score (85.6%), Logistic Regression with HashingTF provided the best results, and therefore it was selected for the second-level classification.

External Validation
We have attempted to validate the results obtained from the Sehaa system against external sources, including reports from various relevant organizations, news media, and results reported in research articles. This involved comparing the statistics reported in various sources and the statistics reported from Sehaa. Unfortunately, we found limited information related to health and diseases in Saudi Arabia in the various media. The information is either incomplete or is old. Moreover, the existing information is limited to communicable diseases only. The sources we have looked at include the official website of the Saudi Ministry of Health [80], published articles from various organizations including the Institute for Health Metrics and Evaluation (IHME) (an independent global health research center at the University of Washington) [82], the official website of the World Health Organization (WHO) [83], and the Centers for Diseases Control and Prevention (CDCP).
According to the latest report published by CDCP [84], heart disease was considered among the top causes of death in Saudi Arabia in 2018. The Institute for Health Metrics and Evaluation reported that the risk of death from heart diseases increased by 36% from 2007 to 2017 [82]. These data confirm the findings of the Sehaa tool, where heart disease was detected as the second top disease in Saudi Arabia (see Section 5.3).
Al-Nozha et al. [85] reported in 1997 that more than a quarter of Saudi adults were suffering from hypertension. Aljohani reported, based on a study considering all hospitalized patients in one of Jeddah's hospitals, that hypertension was the sixth most widely prevalent disease in 2010 [86]. The Saudi Ministry of Health reported in 2017 that there was a remarkable increase in hypertension cases among Saudi adults [87]. These findings from as early as 1997 until recently show the high prevalence of hypertension in Saudi Arabia and hence a good correlation with the Sehaa findings.
MOH announced 1452 reported cases of Tuberculosis (TBC) in 2018 [88]; 451 of these cases were from Jeddah, 338 from Riyadh, and 21 from Taif. While this news item does not relate directly to the findings of the Sehaa tool, the low TBC numbers from Taif are in agreement with our findings for the low levels of diseases in Taif.

Conclusions
Smartness, which underpins smart cities and societies, is defined by our ability to engage with our environments, analyze them, and make decisions, all in a timely manner. Healthcare is the prime candidate needing the transformative capability of this smartness due to reasons including healthcare spending reaching a significant proportion of national GDPs (around one-fifth of the GDP in the US), an aging population, gross inefficiencies, and bad eating habits around the world, giving rise to the prevalence of lifelong diseases. With half of the world population connected to social networks, social media provides a vital solution for a ubiquitous and timely engagement among healthcare stakeholders. Twitter is one of the most popular social media platforms today. 500 million tweets are sent every day. Saudi Arabia has the fifth largest number of Twitter users in the world.
Our focus in this research has been on the use of Twitter media for healthcare in Saudi Arabia with the aim to develop technologies that provide enhanced healthcare in the country. We provided an extensive review of the relevant literature and identified two major challenges: first, the rudimentary level of the existing research (in terms of scope, functionalities, and usability) on Twitter data analytics in healthcare in English, and particularly in other languages, including Arabic; second, the scalability and interoperability of the analytics tools for healthcare, such as the management, integration, and distributed computations of big data.
We proposed Sehaa, a big data analytics tool for healthcare in Saudi Arabia using Twitter data in Arabic. Sehaa used Naive Bayes and Logistic Regression and multiple feature extraction methods to detect various diseases in Saudi Arabia. Sehaa was able to successfully detect various diseases. The top five diseases in Saudi Arabia in terms of the actual afflicted cases are dermal diseases, heart diseases, hypertension, cancer, and diabetes. Riyadh and Jeddah need to do more in creating awareness about the top diseases. Taif is the healthiest city in the KSA in terms of the detected diseases and awareness activities. Sehaa is developed over Apache Spark, allowing true scalability. The results were evaluated using the well-known numerical criteria Accuracy and F1-Score, obtaining 83.9% and 86.7% scores for the two classification stages. The Sehaa results were validated against externally available statistics and shown to have a good correlation with them. For example, heart disease was found to be one of the top causes of death in Saudi Arabia, as reported in external sources, and this was in agreement with Sehaa, which detected heart diseases among the top five diseases in the country. Taif was shown to have low disease occurrence in external media and this was also in agreement with the results obtained through Sehaa.
Sehaa is an excellent example of integrating artificial intelligence (AI), distributed big data computing, and human cognition, brought together as a convenient tool for the betterment of public health and the economy. The system methodology and design are generic and can be extended to other countries in the Arab world as well as globally. Our focus in this work is on Saudi Arabia and therefore the tool currently works with tweets only in the Arabic language (it can be used in other Arabic speaking countries, such as UAE, Kuwait, and Egypt). Potential users of this tool are hospitals and other healthcare organizations, ministries of health, pharmaceutical companies, and other healthcare stakeholders.
This study is the first of its kind in Saudi Arabia using Apache Spark and tweets in the Arabic language. Sehaa is an important step in developing data analytics tools for Twitter (and other social media) in Arabic. The use of a scalable distributed computing platform for big data (Apache Spark) in this paper is also an important step in the right direction. The future will see the integration of more and more disparate systems to allow global system optimization [89][90][91][92]; the use of open-source scalable distributed computing platforms is very important for this purpose. The integration of Social media data with other smart city systems for real-time healthcare analytics, planning, and operations is a grand challenge with unimaginable benefits and applications. Another important challenge in Twitter data analytics is the labeling of data or Tweets. In this paper, we have used manual labeling, which is an extremely time-and resource-consuming task. An alternative is to use semi-supervised or unsupervised machine learning to automatically label large amounts of data (see, e.g., [93]). These methods are in their infancy and are limited due to low accuracies. Further investigation is planned for developing automatic labeling methods. Future work will work in these directions and improve the scope, functionality, scalability, analytics, data management, productivity, usability, and accuracy of the tool.
Modern living includes ubiquitous use of smartphones, wearables such as smartwatches, and other mobile devices. The concept of Smartness that we have discussed in this paper, i.e., our ability to engage with our environments, analyze them, and make decisions, requires embedding mobile devices and sensors in our environments. These sensor-rich environments undoubtedly have disadvantages in terms of the security and privacy risks they pose to us [94]. Twitter data, which is the focus of this research, is already public and our analysis only reports results on the population level; thus we believe that the privacy risks would either be non-existent or would be of a minor nature. However, this is an important concern and should be properly investigated. We have some background in developing privacy-preserving technologies [59,[95][96][97], and we plan to look further into the privacy issues related to Twitter data and propose solutions to minimize these privacy risks.