IoT Based Health—Related Topic Recognition from Emerging Online Health Community (Med Help) Using Machine Learning Technique

: The unprompted patient’s and inimitable physician’s experience shared on online health communities (OHCs) contain a wealth of unexploited knowledge. Med Help and eHealth are some of the online health communities o ﬀ ering new insights and solutions to all health issues.


Introduction
The metabolic disease diabetes mellitus, the contagious infection tuberculosis and thyroid disease are major chronic diseases which affect billions of people every year.These chronic diseases rapidly increased death rates over past decades and they act as a gateway to several other diseases by weakening the immune system of humans.According to the World Health Organization, 422 million people are affected by diabetes and 1.6 million deaths occur each year due to diabetes and tuberculosis [1].A BioMed Centre (BMC) public health journal survey indicates that lower levels of thyroid hormones increase the risk of diabetes mellitus.
Diabetes mellitus is a metabolic disease in which blood glucose levels are divergently high.Insulin is a hormone produced by the pancreas and is responsible for lowering the glucose level in blood.Insufficient production of insulin, absence of insulin and an inability of human bodies to properly utilize insulin are major causes of diabetes [2].Diabetes mellitus is categorized as type1 or insulin-dependent or juvenile-onset diabetes and type2 or insulin-independent or adult-onset diabetes [3].In the United States, diabetes is the seventh most common cause for death.
Tuberculosis is an infectious disease caused by a bacterium called Mycobacterium tuberculosis (MTB).Tuberculosis (TB) directly affects lungs and also invades through other organs.It spreads from one person to another person through coughs, sneezes and saliva.TB is categorized into active TB or extrapulmonary TB and latent TB infection.The BCG vaccine acts as a barrier to the deadly disease tuberculosis.The WHO describes TB as an "epidemic" and proclaims that tuberculosis is one of the preeminent causes of death by a single contagious agent [4].
The thyroid gland is a butterfly-shaped endocrine gland present in the neck.The thyroid gland is responsible for producing thyroid hormones that control various metabolic activities in the human body.An abnormal increase or decrease of the thyroid hormone leads to thyroid disease.Thyroid disease is classified into hyperthyroidism, or an overactive thyroid, and hypothyroidism, which is an underactive thyroid.Hashimoto disease, Graves' disease, thyroid nodules and goiter are the most prominent disorders of the thyroid.Thyroid disease is a truculent disease, which is almost impossible to eradicate and exists in the human body throughout its lifetime [5].
Social media platforms support reciprocated computing-mediated technologies that facilitate users to share new information, ideas and their opinions [6] with their communities.Online health communities (OHCs) and health care professionals (HCPs) are an emerging phenomenon in social media which connect various groups of individuals having similar health-related issues and interests [7][8][9][10][11].Using this persuasive platform HCPs clarify public health-related problems, illustrate the use of health care policy and practice issues, promote public health programs, motivate patients and educate every individual by providing continuous support and service.
The information collected from Med Help, e-Health, WebMD, Healthline, Medscape, Everyday Health and Health Central are helpful in identifying inter-relationships among generally arising acute diseases [12].The keywords from the collected information are helpful for patients and physicians to explore information about these chronic diseases.The knowledge gathered from these keywords acts as an aegis to reduce the possible death rate.
The analysis of 750 messages collected for four different chronic diseases depicts a perception of a diverse and varied range of activities carried out by moderators [13,14].Community development and a strengthening of local networks help to improve the quality of life (QOL) of older people and self-harming behavior patients affected by various diseases [15][16][17].
In the work [18] data are collected from a Zambia rural community and the analyzed results evidently explain the experience and responsibility of the mother, who satisfies cultural and health expectation during new-born care.Through community content and thematic relationships, the effect of climatic changes on human physical and mental health are explained in [19].Text mining and science mapping techniques are used to analyze and interpret the results [20].A systematic pharmacological method is combined with other data mining techniques for the evaluation of drug similarity [21].
Dataset data mining techniques are applied on a dataset of MTA (Metropolitan Transportation Authority) customer feedback to enhance QOS (quality of service) and identify customer satisfaction levels [22].The tool interprets and identifies diagnostics patterns from a huge free clinical dataset of notes of patients, using text mining techniques [23].The study used the K-means++ algorithm to increase accuracy of the recommendation system [24].
An improved K-means algorithm and dimensionality reduction were used to perform clustering of Arabic text [25].A K-means text clustering algorithm was efficiently used in spam detection [26].In another study a weighted K-means algorithm text clustering was performed [27].
The analysis reports of trusted health care organizations are an important source from which to find relationships between the mentioned chronic diseases.The online health community platform is recommended by physicians to obtain accurate knowledge about all diseases.The OHC texts play a vital role in the extraction of keywords and in finding inter-relationships between all three chronic diseases.
The objectives are designed in a way to emphasize social values and to eradicate lingering diseases, namely diabetes mellitus (DM), tuberculosis (TB) and thyroid disorders.The three prominent objectives are delineated as follows:

•
To extract important keywords of each disease from each cluster.

•
To find inter-relationships among three chronic diseases.

•
To measure the accuracy of extracted keywords by comparing keywords with the world's trusted organization reports.

Materials and Methods
Figure 1 portrays the overall architecture of the system.The key points to highlight regarding our contribution to the proposal are listed below: 1.
The comments discussed by both patients and physicians in the healthcare forum, Med Help, are collected for the three chronic diseases.

2.
The datasets are pre-processed using NLP techniques such as tokenization, stopword removal and punctuation removal.The term frequency-inverse document frequency (TF-IDF) measure is used to collect the most important words from the collected pre-processed datasets.

3.
The most important feature words are filtered using the chi-square test from the three pre-processed datasets.4.
The K-means++ algorithm is applied to the reduced feature datasets.With evidence of clustering groups, LDA is used to identify the most frequently occurring meaningful keywords.5.
Keywords identified from each cluster of all three diseases are compared with the world's trusted healthcare organizations to measure their accuracy.
The analysis reports of trusted health care organizations are an important source from which to find relationships between the mentioned chronic diseases.The online health community platform is recommended by physicians to obtain accurate knowledge about all diseases.The OHC texts play a vital role in the extraction of keywords and in finding inter-relationships between all three chronic diseases The objectives are designed in a way to emphasize social values and to eradicate lingering diseases, namely diabetes mellitus (DM), tuberculosis (TB) and thyroid disorders.The three prominent objectives are delineated as follows:

•
To extract important keywords of each disease from each cluster.

•
To find inter-relationships among three chronic diseases.

•
To measure the accuracy of extracted keywords by comparing keywords with the world's trusted organization reports.

Materials and Methods
Figure 1 portrays the overall architecture of the system.The key points to highlight regarding our contribution to the proposal are listed below: 1.The comments discussed by both patients and physicians in the healthcare forum, Med Help, are collected for the three chronic diseases.2. The datasets are pre-processed using NLP techniques such as tokenization, stopword removal and punctuation removal.The term frequency-inverse document frequency (TF-IDF) measure is used to collect the most important words from the collected pre-processed datasets.3. The most important feature words are filtered using the chi-square test from the three pre-processed datasets.4. The K-means++ algorithm is applied to the reduced feature datasets.With evidence of clustering groups, LDA is used to identify the most frequently occurring meaningful keywords.5. Keywords identified from each cluster of all three diseases are compared with the world's trusted healthcare organizations to measure their accuracy.

Data Set Gathering and Preprocessing
In this study, the online health community, Med Help, is used as a platform to collect the dataset.Powerful web API's are helpful in the translation of information in the connected world (IoT) [10].The precautions, remedies and knowledge about the three chronic diseases are discussed in comments by physicians and patients in the online health community [33].The comments discussed are stored in the Med Help cloud for numerous diseases and disorders.The comments discussed about diseases are collected as a dataset over the years of 2018, 2019 and 2020 (up to January) using a web API; the results are then stored in a local database.
In NLP, pre-processing is an inevitable step where normal texts are transformed into a simple form.Pre-processing is an underlying step responsible for better performance of machine learning (ML) algorithms.Tokenization is a pre-processing step where paragraphs are split into sentences and sentences are split into individual words.Stop words are connecting words in a sentence which do not produce intent meaning.Stopwords are removed in the pre-processing step by utilizing a manually created stopword dictionary or prebuilt libraries based on sensitivity.Term frequency-inverse document frequency, shortly known as TF-IDF [34], is an important statistical measure which calculates the importance of a word in a document or in a corpus.
The term frequency-inverse document frequency (TF-IDF) calculation is described here.The TF value of a term "t" in a document "d" is given by the frequency "f" of that term in the document, divided by the number of words in that document, as mentioned in Equation ( 1).The IDF value for a word refers to its importance within the whole dataset, considering its occurrence in every document, as given in Equation ( 2).The TF-IDF value is merely the product of these values, represented in Equation ( 3).Algorithm for text preprocessing is discussed in Algorithm 1. Until all words in the document in a file are exhausted Calculate the TF value from Equation (1) 3.
Calculate the inverse document frequency from Equation (2) 4.
Calculate the TF-IDF value, set a minimum threshold value using Equation (3) 5.
If TF-IDF score >the threshold value (0.53)

•
Append the word into a document 6. End

Chi-Square Test
Algorithm 2 portrays the execution of the chi-square Test.Feature selection or attribute selection is a process of extracting the most relevant features from a dataset.The chi-square test is a statistical measure used for feature selection based on the dependency of words in a document.The chi-square test is calculated using the following formula:

K-Mean++
K-means++ is an unsupervised iterative clustering algorithm.Algorithm 3 discusses the pseudo code for K-Mean++ techniques.K-means ++ is an extended version of the popular K-means algorithm, which ensures perfect and nimble initialization of centroids and enhances the quality of clustering [35].The K-means++ algorithm is limited to numerical values and groups similar documents into a single group in a corpus.Doc2Vec is a deep learning unsupervised algorithm which generates feature vectors for documents.The generated feature vectors are used to find similarity between documents.

Algorithm 3 K-means++
Input: Essential feature dataset extracted based on the chi-square test.Output: Seven clustered documents.

2.
Choose one center k randomly from the data points d.
For x in d • Find the nearest centroid (C2 . . .Cn) using the distance formula Selection of next centroid is based on the probability that relies upon the distance of the first initialized centroid 6.

LDA
LDA is an unsupervised machine learning model used for topic modelling [30].In LDA, each document is considered as a topic mixture and each topic is considered as a mixture of words.Several words describe the same topic and several topics construct the same document.LDA represents the correct meaning of words in topic modelling as compared to LSA (latent semantic analysis).LDA provides better results and accuracy than LSA [36,37].Spectral clustering is used to cluster the document and PNN classifier is used to identify the label of the cluster discussed in [38].Different classifier algorithms and machine learning techniques are used to classify the data set and text document [39][40][41][42].LDA and LSA are used to identify the topics from the given text document without clustering.

Data Set Gatering and Preprocessing
The dataset of diabetes mellitus consists of 74, 233 and 311 documents from 2020, 2019 and 2018, respectively.The tuberculosis dataset consists of 625 documents, which comprise 117, 276 and 232 documents from 2020, 2019 and 2018, respectively.The thyroid dataset is composed of 591 documents, which contain 116, 219 and 256 documents from 2020, 2019 and 2018, respectively.From all three diseases, a total of1824 documents are collected from the online health community.
The collected dataset of each disease is pre-processed using the Python NLP NLTK and Scikit-learn packages, which include tokenization, stop word removal and punctuation removal methods.The TF-IDF measure is used to collect the most important words in a dataset and to remove low-frequency terms from the corpus.After the pre-processing step, the pre-processed dataset of each disease contains the most meaningful words.

Chi-Square Test
The chi-square test is applied to each pre-processed dataset to extract the most important features based on a threshold value.The extracted important features are recorded.Figure 2 shows the results of the Chi-square test for each dataset.In this figure the top 30 words of each disease are represented in graph format based on the threshold value of 0.47.

K-Mean++
The Python Gensim package includes Doc2Vec.Whenutilizing the Gensim package, Doc2Vec is applied to the datasets which are collected based on the chi-square test for each disease.Doc2Vec generates feature vector values based on the similarity between documents.The K-means++ algorithm is applied to the dataset which is retrieved based on the Doc2Vec feature vector values for each disease.For each disease, the clustering process is repeated with seven clusters.Each cluster document is collected individually for all diseases.The clusters of the three diseases are depicted in Figure 3.

LDA
LDA is applied to each cluster of all three diseases to retrieve the top ten topics.As a result of LDA, the most important keywords are extracted and each cluster is manually labelled based on the keywords for all three chronic diseases.Word cloud is a visualization technique used to visualize high-frequency terms in each cluster for all diseases.Count vectorizer is used to visualize the most frequently occurring words of each cluster of all three diseases in Figure 4.

Discussion
The clusters are labelled manually for all three diseases.The most prominent keywords of each cluster are tabulated for all three diseases.The most important keywords are extracted as a result of the LDA process for all three diseases.The sample terms of each cluster are extracted based on the sample terms inter-relationships between all three diseases.Authors should discuss the results and how they can be interpreted in perspective of previous studies and of the working hypotheses.The

LDA
LDA is applied to each cluster of all three diseases to retrieve the top ten topics.As a result of LDA, the most important keywords are extracted and each cluster is manually labelled based on the keywords for all three chronic diseases.Word cloud is a visualization technique used to visualize high-frequency terms in each cluster for all diseases.Count vectorizer is used to visualize the most frequently occurring words of each cluster of all three diseases in Figure 4.

LDA
LDA is applied to each cluster of all three diseases to retrieve the top ten topics.As a result of LDA, the most important keywords are extracted and each cluster is manually labelled based on the keywords for all three chronic diseases.Word cloud is a visualization technique used to visualize high-frequency terms in each cluster for all diseases.Count vectorizer is used to visualize the most frequently occurring words of each cluster of all three diseases in Figure 4.

Discussion
The clusters are labelled manually for all three diseases.The most prominent keywords of each cluster are tabulated for all three diseases.The most important keywords are extracted as a result of the LDA process for all three diseases.The sample terms of each cluster are extracted based on the sample terms inter-relationships between all three diseases.Authors should discuss the results and how they can be interpreted in perspective of previous studies and of the working hypotheses.The

Discussion
The clusters are labelled manually for all three diseases.The most prominent keywords of each cluster are tabulated for all three diseases.The most important keywords are extracted as a result of the LDA process for all three diseases.The sample terms of each cluster are extracted based on the sample terms inter-relationships between all three diseases.Authors should discuss the results and how they can be interpreted in perspective of previous studies and of the working hypotheses.The findings and their implications should be discussed in the broadest context possible.Future research directions may also be highlighted.Keywords about the diseases are listed in the Tables 1-3.
Side effects, Habits and Healthy Lifestyle are the clusters which are found common in all three diseases.Based on this inference, the relationship between the three chronic diseases is found.

Side effects:
In the Side effects cluster problems faced by each disease patient are grouped.The interpretation is that the patients of all three diseases are facing common health problems even though the cause of all three diseases is different.
Habits: The Habits cluster demonstrates that pre-activity should be carried out by patients to prevent all three chronic diseases.
Healthy Lifestyle: The Healthy Lifestyle cluster describes the activities that should be carried out by patients to recover from diseases and to prevent death.
Side effects, Habits, and Healthy Lifestyle are three clusters which were found common among all three chronic diseases.These three clusters and their respective keywords evidently depict the prominent inter-relationships between diabetes mellitus, tuberculosis and thyroid disease.Venn diagrams are used to analyze common themes among all three diseases.A Venn diagram interprets common themes among diabetes mellitus, tuberculosis and thyroid disease and also illustrates similarity among diabetes mellitus and thyroid disease.Side effects, Habits and Healthy Lifestyle are common themes between all three chronic diseases, which are found from Venn diagram interpretation.It is represented in Figure 5.The common themes identified among the three chronic diseases reveal an occurrence of inter-relationship between them.The cause and impact of the three chronic diseases are different but the cluster similarity among the three diseases evidently describes inter-relationships between the three diseases.Side effects, Habits, and Healthy Lifestyle are three clusters which were found common among all three chronic diseases.These three clusters and their respective keywords evidently depict the prominent inter-relationships between diabetes mellitus, tuberculosis and thyroid disease.Venn diagrams are used to analyze common themes among all three diseases.A Venn diagram interprets common themes among diabetes mellitus, tuberculosis and thyroid disease and also illustrates similarity among diabetes mellitus and thyroid disease.Side effects, Habits and Healthy Lifestyle are common themes between all three chronic diseases, which are found from Venn diagram interpretation.It is represented in Figure 5.The common themes identified among the three chronic diseases reveal an occurrence of inter-relationship between them.The cause and impact of the three chronic diseases are different but the cluster similarity among the three diseases evidently describes inter-relationships between the three diseases.The accuracy score of a keyword is measured based on the number of keywords extracted and is mapped with the world's trusted organization reports.Keywords of each cluster extracted from all diseases are compared with the world's trusted organization reports.The comparison results illustrate accuracy of each keyword of all clusters, which evidently shows the accuracy of each keyword.The World Health Organization (WHO), the National Health Survey (NHS), the National Institute of Health (NIH), the Centre for Disease Control and Prevention (CDC), the European Centre for Disease Control and Prevention (ECDC), the National Centre for Disease Control and Prevention (NCDC), the American Diabetes Association (ADA), the American Thyroid Association (ATA), Women's Health, MedlinePlus, WebM and Healthline are twelve of the world's trusted organizations.The mentioned twelve organization reports are compared to measure accuracy of all keywords for all diseases.Accuracy scores of each cluster keyword, compared with trusted organization reports, are tabulated.
The comparison result evidently illustrates that each keyword of all clusters extracted from all disease datasets are accurate and they can be interpreted to have a factual meaning.The sample The accuracy score of a keyword is measured based on the number of keywords extracted and is mapped with the world's trusted organization reports.Keywords of each cluster extracted from all diseases are compared with the world's trusted organization reports.The comparison results illustrate accuracy of each keyword of all clusters, which evidently shows the accuracy of each keyword.The World Health Organization (WHO), the National Health Survey (NHS), the National Institute of Health (NIH), the Centre for Disease Control and Prevention (CDC), the European Centre for Disease Control and Prevention (ECDC), the National Centre for Disease Control and Prevention (NCDC), the American Diabetes Association (ADA), the American Thyroid Association (ATA), Women's Health, MedlinePlus, WebM and Healthline are twelve of the world's trusted organizations.The mentioned twelve organization reports are compared to measure accuracy of all keywords for all diseases.Accuracy scores of each cluster keyword, compared with trusted organization reports, are tabulated.
The comparison result evidently illustrates that each keyword of all clusters extracted from all disease datasets are accurate and they can be interpreted to have a factual meaning.The sample keywords from each cluster are compared with the mentioned 12 organization reports.Based on the occurrence of the keywords, each cluster accuracy is measured in percentage and the results are tabulated in Tables 4-6.The American Diabetes Association gives 94.8% accuracy for keywords of the disease diabetes mellitus (DM).The majority of diabetes mellitus keywords are matched with ADA reports.The keywords of thyroid are well mapped with the American Thyroid Association reports, which in turn produce 92.3% overall accuracy.The tuberculosis sample keywords are majorly matched with the National Institute of Health reports, which show an overall accuracy of 90.5%.

Conclusions
This framework is helpful for general users and patients to obtain knowledge about all three chronic diseases in the form of causes, medications, side effects and remedies.Lack of effective analysis tools to discover hidden relationships and trends among these diseases led us to propose a model that made use of technological advancements in text mining to develop a prediction, detection and treatment model for the chronic diseases problem.The consummated analysis and findings reduce the burden of physicians and encourage various physicians to conduct numerous health programs, which create tremendous awareness among the society.The dataset is collected for only three years from

Figure 5 .
Figure 5.A Venn diagram summarization to depict inter-relationships between the three chronic diseases.

Figure 5 .
Figure 5.A Venn diagram summarization to depict inter-relationships between the three chronic diseases.
Observed Frequency of words − Expected Frequency of words)2 Observed frequency is the number of observations of words in a document,•Expected frequency is the number of expected observations of words in a document if there is no relationship between features.

Table 4 .
Comparison of DM keywords with different healthcare organization.

Table 5 .
Comparison of thyroid keywords with different healthcare organization.

Table 6 .
Comparison of TB keywords with different healthcare organization.