Classification supporting COVID-19 diagnostics based on patient survey data

Distinguishing COVID-19 from other flu-like illnesses can be difficult due to ambiguous symptoms and still an initial experience of doctors. Whereas, it is crucial to filter out those sick patients who do not need to be tested for SARS-CoV-2 infection, especially in the event of the overwhelming increase in disease. As a part of the presented research, logistic regression and XGBoost classifiers, that allow for effective screening of patients for COVID-19, were generated. Each of the methods was tuned to achieve an assumed acceptable threshold of negative predictive values during classification. Additionally, an explanation of the obtained classification models was presented. The explanation enables the users to understand what was the basis of the decision made by the model. The obtained classification models provided the basis for the DECODE service (decode.polsl.pl), which can serve as support in screening patients with COVID-19 disease. Moreover, the data set constituting the basis for the analyses performed is made available to the research community. This data set consisting of more than 3,000 examples is based on questionnaires collected at a hospital in Poland.


Introduction
The outbreak of the COVID-19 pandemic has created new challenges for the physicians around the world. A new disease requires the development of new drugs, procedures and approaches to treating patients. Additionally, diagnostics may be a challenge, especially when there are few obvious and unambiguous symptoms differentiating the disease from other infections. The list of symptoms associated with COVID-19 does not allow for unequivocal distinction of COVID-19 from other influenza-like illnesses. Limited testing capacity is one of the biggest problems the healthcare systems are facing. Hence, it is crucial to filter out those sick patients who do not need to be tested for SARS-CoV-2 infection.
The diagnosis of the patient's condition requires experience supported by reliable and unambiguous straightforward diagnostic methods. As long as rapid, inexpensive and reliable testing does not support the physicians in their work, experience is imperative. Experience can be gained by analysing cases of subsequent patients arriving to the hospital. Besides, it can be enriched owing to a comprehensive analysis presenting conclusions for a larger set of data.
Literature studies show that machine learning methods were verified and gave good results for a number of diagnostic tasks [1]. Various types of classification models were studied in the past for the diagnosis of patients with diseases such as for example Ebola [2], HIV [3], heart disease [4], cancer [5] or diabetes [6].
Referring to the listed applications of machine learning in diagnostics, it can be concluded that prognostic models can provide a valuable summary of clinical knowledge and can be useful when such expertise is unavailable [2].
The choice of methods supporting patient classification and diagnosis may depend on the type of data analysed. For example, diagnostics based on image analysis currently uses mainly deep learning methods, e.g. [5,7]. Other types of data, including data collected from questionnaires, are analysed using various methods, among which there can be identified a division ( [8]) into statistical approaches represented by logistic regression and data-driven machine learning methods (e.g. decision tree based methods, SVM, Naïve Bayes, kNN). Many of the works, however, verify and compare the quality of several different models, e.g. [4,9,10].
The main motivation for this work was the interest of the medical community in what characterizes COVID-19 (e.g. as opposed to influenza) and whether it is possible to create classifiers of acceptable quality that can support diagnostics based on survey questions to identify people suffering from COVID-19. Such solutions can be especially valuable for new diseases, when there is little experience available and when the number of patients is growing rapidly becoming overwhelming for the healthcare system, as in the case of COVID-19.
Another motivation was the need to learn and understand the COVID-19 disease from the possibly broadest perspective. As the research results show ( [11]), the course of the disease for various reasons may be different for different populations and therefore the analysis of data from Poland may be valuable.
The study aims to provide the medical community, especially family doctors, with a tool to distinguish patients suffering from other common infections from individuals who are suspected for COVID-19 and need to be further diagnosed with more advanced molecular methods. The main intention was thus to limit the number of patients referred to genetic testing and still omit possibly few COVID-19(+) cases.
Therefore, the aim of this study is to verify two classification approaches as screening methods that can support the diagnosis of sick patients with COVID- 19. The classification is intended to be based on the symptoms identified during the initial assessment of the patient condition. The methods utilised in the research are the classifiers that are the most popular and efficient in medical and tabular data analysis. Another goal of this work is to provide a new data set enriching the knowledge on the COVID-19 disease and to share the results of analysis concerning both the data and the generated classification models.
Contribution of this work is threefold. Primarily it consists of generating the classifiers that are tuned to an assumed acceptable threshold of negative predictive values so that the results allow for effective screening of patients for COVID-19. Besides, the contribution contains an explanation of the obtained classification models enabling the users to understand what was the basis of the decision made by the model. Finally, in conjunction with this analysis, a new data set is shared, based on questionnaires collected in a hospital in Poland. The last part of the contribution consists of the data set preparation, processing, characterization and visualization leading to the identification of COVID-19 characteristics. It is assumed that providing a data set along with its characterization and predictive analysis performed on this set will be a valuable opening for further analysis and meta-analysis.
The performed analysis resulted in an online service 1 available for anyone who needs support in COVID-19 diagnostics.
The structure of this paper is as follows. Section 2 presents an overview of previous research related to the presented topics. Section 3 outlines the characteristics of the shared data set being the basis for classification models. Section 4 presents data preparation steps and two applied approaches to classifier generation and tuning. Section 5 focuses on the evaluation of the models created. Section 6 presents the discussion and explanation of the results obtained. Section 7 outlines the developed web application that makes an initial diagnosis based on the given symptoms. Section 8 concludes the paper.

Related work
Due to the great involvement of the scientific community in the research aimed at understanding the SARS-CoV-2 virus and the COVID-19 disease, many studies have recently been published dealing with this issue from different perspectives. Many works on the use of machine learning methods to diagnose COVID-19 have been covered in review articles such as [12], [7] and [13].
The work [12] presents the analysis of the research dynamics in the field between January and May 2020. In this work, based on the frequency of occurrence of various methods, the following classes of solutions were distinguished: Deep Learning approaches (CNN, LSTM and others), Mathematical and Statistical methods, Random Forest, SVM, and Others (e.g. Linear Regression, XGBoost). The described approaches were applied to various data types with a predominance of X-ray images and achieved good results. However, the lack of real data was explicitly highlighted in the work.
An extensive meta-analysis (107 studies with 145 models) of the works published between January and April 2020 is presented in [7]. In this study, three classes of predictive models are distinguished, and these are models for use in the general population, for COVID-19 diagnosis and prognosis. Among 91 diagnostic models, 60 focus on image analysis, while 9 predict the disease severity.
Out of the remaining 22 works presenting diagnostic models not based on imaging, only two studies were based on the data sets containing more than 1000 examples.
The review [13] discusses among others machine learning in COVID-19 screening and treatment. The examples of approaches presented in this work are focused on image and clinical data (e.g. blood test results) based diagnostics.
Among the studies presenting diagnostic machine learning-based models [7], the works presented below can be distinguished as the most interesting due to the size of a data set, questionnaire-based data features and utilised analytical approaches.
The work [14] presents a statistical analysis of data attributes and generation of multivariate logistic regression model on a data set consisting of 1702 individuals (579 were SARS-CoV-2 positive and 1123 negative), whose data were collected through the online application. The attributes describing each exam-ple included personal characteristics (sex, age, BMI) and flu-like symptoms (e.g. fever, persistent cough, loss of taste and smell, etc.).
The study [15] presents an approach where based on routinely collected surveillance data a multiple model using logistic regression was generated. The data set that was the basis of the analysis consisted of 5739 patient records (1468 were SARS-CoV-2 positive and 4271 negative) collected in Brazil.
There are numerous data sets related to COVID-19 reported [16]. Another data set [dataset] [18] was collected at the Hospital Israelita Albert Einstein, at Sao Paulo, Brazil. Hospital patients had samples collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests. The created data set contains 5644 examples of which 10% were SARS-CoV-2 positive. The attributes consist of virus, blood and urinea test results and internal assignment to a hospital ward. This data set was analysed in [19] and its part was analysed in [20]. The study [19]  patients that are likely to require intensive care. Additionally, within each direction, the feature importance was identified.
In response to the growing demand for information on modern IT services developed in the healthcare sector, and in order to coordinate activities undertaken in this regard in Europe, the mHealth Innovation and Knowledge Hub was created on the initiative of WHO ITU/Andalusian Regional Ministry of Health [21]. It was established to collect and share experiences on modern e-medicine solutions and to support countries and regions in implementing large-scale activities in this regard. mHealth initiatives are especially important these days, when the World is facing SARS-CoV-2. Many governments, companies, and citizens movements have developed various mHealth solutions to keep the population informed and help manage the crisis situation. A repository of such solutions developed in Europe can be found in [22]. It is a dynamic resource that is updated as additional tools are reported.
Among the many services created around the world, there can be found those that help to find out whether someone may have been exposed to the coronavirus in order to reduce the spread of the virus [23,24,25]. Others serve patients in a better understanding of the mechanisms of the  or are the channels of updated information on regional regulations including territory-specific restrictions [26,27]. Some of the solutions provide the most up-to-date research findings and information, including all the latest data on COVID-19 diagnosis and treatment thereby helping medical personnel make informed clinical decisions [28].
There are also applications that, similarly to the service presented in this study, support the diagnosis of sick patients with SARS-CoV-2 infection based on the symptoms of the disease. For example, solutions presented in [29,30,31,32,33,34,35,36] allow to self-assess the possible symptoms of this infectious disease and to learn about the recommendations to be followed. Some of them give the estimated risk of SARS-CoV-2 infection as a result, however, they do not disclose the methods used in these estimates. In addition, some mentioned applications are available only in one language version (e.g. in Italian [35]) or are intended for use in a given region or country and they require selecting the name of a specific locality/province when completing the questionnaire [36].
Other ones are not publicly available and need a special social security number assigned by local Health Service to activate the application [37]. Some, in turn, are targeted only for medical staff [38].
A relatively small group of applications [39,40] are those developed as a result of published research [41,42]. The corresponding works concerned COVID-19 diagnostics using various machine learning methods. The common feature of the data analysed by them were attributes describing blood indices.
The state of the art survey presented above shows that there exist numerous approaches proving the usability of classification models in COVID-19 diagnostics. However, to the best of the authors' knowledge, there are no approaches reported as a research paper taking into consideration solely questionnaire data what can be valuable when the healthcare system is overloaded or clinical tests are not available. There are numerous online applications performing symptombased diagnosis, however, no information on the applied methodology is given.
Another conclusion is that there are freely available data sets on the Internet, however the sets containing data of more than 1000 patients are scarce and it is still important to enrich the available data representation. Additionally, to the best of the authors' knowledge, there were no questionnaire-based data sets collected in Poland reported and it is worth filling this gap.

Data set
The data set was collected at Specialised Hospital No. 1 in Bytom, Poland between 21 st February 2020 and 30 th September 2020 thanks to the cooperation of hospital staff and data scientists involved in the project.
Data acquisition was a process starting at the hospital where people arrived in order to be diagnosed. Before any examination or testing, each person was asked to fill in the questionnaire containing questions relevant to COVID-19 diagnosis. Initially, the survey was containing few questions about the basic symptoms consisting of the occurrence of temperatures exceeding 38 o C, cough and dyspnoea. With time and increasing knowledge about the COVID-19 characteristics, the survey was enriched with further questions and options. When the final diagnosis resulting from SARS-CoV-2 test was known, the survey was anonymised, scanned, tagged with SARS-CoV-2 test result and sent to the database of survey images. Next, the information encapsulated in each survey was transformed into a feature vector, i.e. an example in our data set.
The questionnaires were filled in by patients themselves. People waiting to be admitted to the hospital for further tests may feel unwell and anxious, which affects the consistency and quality of completing the questionnaires. Therefore, a synthetic and unambiguous form of the survey was developed, which is now utilised within the web application (see Fig. 5) being one of the results of this research. Nevertheless, due to such manual nature of data collection, the data was subject to additional inspection in order to remove errors introduced while completing the survey.
The collected data set consisted of 3114 patient records. Each patient recorded in the collected data was described by 32 attributes. Within these attributes the following classes of patient characteristics can be identified: • 18 attributes describing symptoms, • 7 attributes listing comorbidities, • 3 attributes representing the patient's condition.
The other attributes include epidemiological attributes such as: age, sex, blood group and contact with infection. Each patient was classified by two conditions: • Symptoms: Healthy / Sick The quantitative characteristics of the collected sample from population on the basis of the classification listed above is presented in Table 1.
In the presented study patients who experienced disease symptoms (Sick) were taken into consideration, i.e. 1941 people. From this group, 577 patients,

Data preparation
The first step of the initial data preparation was related to the aim of the conducted research, which was the screening of sick patients to help identify SARS-CoV-2 infections and, more importantly, to filter out only patients with other infections, who do not require molecular tests. Having the defined goal in mind, the subset of patients identified as sick was selected from the collected data set presented in Table 1 and this subset was used in further analysis.
Next, due to the fact that there were many missing values in the collected data, it was necessary to select the attributes and cases to generate the best possible classification model. In order to perform data selection, the data set was transformed into a binary representation where missing records were marked with the value of 1, while complete records with the value of 0. For the "contact with infected person" variable missing value was treated as a negative answer due to the survey construction: patients were asked to tick "Yes" option only if they knew they were exposed to SARS-CoV-2. Next, hierarchical clustering with the Hamming distance [43] and McQuitty agglomeration [44] was performed for features and patients separately. Finally, the cluster of patients with the highest level of missing information was removed from the data set and the cluster of features with the highest level of missing data was not considered in further analysis. The dendrograms leading to the selection process and the data that was rejected in this selection are illustrated in Fig. 1. The data subjected to the selection are illustrated at the Fig. 1, with missing and complete information marked with red and bright green colors, respectively. The dendrograms serving for the selection are at the sides of the figure. Branches highlighted in red correspond to patients or features excluded from the further analysis due to high missing data levels.

Classification approaches
Within the conducted research, two classifiers developed by two separate research groups were verified. It was decided that one of the model generation methods would be logistic regression [45] representing statistical approaches, while the second would be the XGBoost method [46] implementing Gradient Boosting model that is a leading data-driven machine learning approach. The The initial data set was common to each of the approaches, however, the models were generated in separate processes. Therefore, different data preparation and feature selection steps could be applied and hence different splits into training and test data within the model generation process were possible. However, both generated models were evaluated on a common test data set and additional collection of data available online [17].
In diagnostic test the weight can be equally distributed between NPV and PPV.
However, here the screening test is applied, thus the importance of NPV should be higher than of PPV to maximally reduce the number of undetected COVID-19(+) patients (false negatives). This should be reflected in a value of the weight w ∈ [0, 1].

Logistic regression based approach
For the features remaining after the initial data selection, the effect size was calculated (Cramér's V and Rank Biserial Correlation for discrete and continuous features respectively) [49,50]. Only features with at least small effect were Next, the logistic regression model was built with the forward feature selection method. Therefore, in each step new attribute was added to the model based on the selection criterion, which in this case was the Bayes Factor. New attributes were added until the Bayes Factor value decreased below 1, which is described as a "barely worth mentioning evidence" [51], [52]. The attributes were either previously selected five features or their pairs' interactions.
As the logistic regression model provides the probability that an observation belongs to a particular category (in this case COVID-19(+)), the cut-off value of the probability must be determined to classify each patient. The cut-off probability was identified in a way to maximize the WHM (1) with a weight w equal to 0.85. The enhanced importance of NPV reduces the risk of false negative observations' occurrence for the screening test purposes. Moreover, applied weight reflects standards in design medical screening test were NPV>90 and PPV>30 is expected [53]. For medical reasons, it is crucial to avoid COVID-19(+) patients exclusion from further diagnosis procedures, while molecular testing for

SARS-CoV-2 infection of some COVID-19(-) patients is both acceptable and
inevitable. The cut-off value could only be selected from the interval 0.1-0.9.
Finally, the quality of prediction for the training and validation sets were calculated.
After the MRCV procedure, the feature ranking was prepared based on the feature significance in the model and model's quality characterized by the WHM The parameters of the final model are presented in Table 3 and its performance calculated on the whole train data set is presented in Table 4.

Gradient Boosting based approach
This approach was designed to use all available attributes and binary interactions between binary features. If two features a and b were binary ones then    Table 5 and their interactions (OR) between binary features was obtained. The value of sum of symptoms feature was calculated on the basis of 11 disease symptoms included in the training set.  After selecting the features, the next step in the analysis was optimisation of the XGBoost algorithm parameters. It was carried out using the autoxgboost [54] R package. This time the F1 measure was used as the optimization criterion in order to increase Sensitivity and PPV value of the final classifier.
The performance of the model obtained in this way, that was calculated on the whole train data set is presented in Table 6.

Evaluation of created models
Both created classification models presented in Section 4 were finally evaluated on additional data sets. The characteristic of the first test data set was presented in Section 3. This data set was created as a result of a data split into training and test data ( Table 2) and it consists of 577 examples. This data set will be referred further as PL.
The second data set was presented in Section 2 as the data set available online [17]. This data set required an initial transformation which consisted in removing all patients who did not show any of the diagnostic symptoms used by the generated models and available in this collection (contact with infected person, number of days of symptoms, temperature > 38 o C, cough, dyspnoea, muscle aches, loss of smell or taste, sore throat, headache). The characteristics of the obtained data set are presented in Table 7. This data set will be referred further as US.
The results of applying the XGBoost classifier to the PL test data set are presented in Tables 8 and 11.   The final logistic regression model, presented in Table 3, was tested on the reduced PL data set. In this set, 538 examples had full information for model features what is required by logistic regression classifier. The logistic regression model performance on this data set is presented in Tables 9 and 11.   Tables 13 and 15. The results of XGBoost tested on the reduced US data set    Tables 14 and 15.

Discussion
The It is known from the literature [56,57], that loss of taste or smell is frequently used as an early indicator of SARS-Cov-2 infection and therefore it is not surprising that having it results in the positive decision (COVID-19(+)) in almost all cases for both classifiers. The only exception of this is a case of the tree generated for the logistic regression model, where having more than 13 days of symptoms together with loss of taste or smell and temperature be- In general, the decision tree approximating the logistic regression model is

DECODE service
DECODE is a symptom checker tool to assist patients and family doctors in preliminary screening and early detection of COVID-19.
In DECODE service a patient has the opportunity to fill out a questionnaire regarding his/her health condition and obtain a preliminary assessment of the possibility of being sick with COVID-19 and, consequently, the need to be tested for SARS-CoV-2.
The questions (Fig. 5) cover many problem areas related to possible COVID- The link that allows the patient to return to the survey is active for 14 days.
Using this link the patient can, for example, report the next symptoms, if they occur. In particular, when the patient knows what the PCR test results are, he/she can share this information and help improve the service.
DECODE is available in two language versions (Polish and English) and is integrated with the CIRCA diagnostic service which was also developed at the Silesian University of Technology to support and accelerate COVID-19 imaging diagnostics. disease (referred to in the paper as COVID-19(+)), as well as symptomatic patients but without SARS-CoV-2 infection (suffering from other diseases, e.g. influenza, denoted as COVID-19(-)). In addition to information about symptoms, the data set includes information on comorbidities (e.g. hypertension, diabetes, etc.). However, comorbidities were not analysed in the studies described in this paper. The data set is available to a wide group of researchers and it is a significant data repository describing COVID-19 symptoms in the Slavic population.
Further work will focus on improving classifiers to increase their specificity (PPV and Specificity). This goal will be achieved by gathering a larger set of examples obtained from the DECODE webservice users and through coopera-tion with a greater number of hospitals. The developed classification models will be periodically tuned and verified on newly emerging data. Another intended research will start, in cooperation with hospitals, an analysis of the severity of the course of the COVID-19 disease, as well as a survival analysis of patients with a specific set of comorbidities. Besides, it is intended to study the course of the COVID-19 disease depending on the medications taken by patients (the DECODE webservice enables gathering this information).