# The Hierarchical Classifier for COVID-19 Resistance Evaluation

^{*}

## Abstract

**:**

## 1. Introduction

- The number of tests in Ukraine is insufficient to identify a real picture of the spread of the disease [3], which doubt the adequacy of the data, especially at the beginning of the epidemic—during the first—quarantine period. For comparison, according to the WHO report dated 3rd September 2020, 1 million 621 thousand 697 tests were performed in Ukraine, which is 0.48 tests per 1000 population. In the United States, 2.19 tests were performed per 1000 population. In the UAE, testing reaches almost 8 tests per 1 thousand of population. In Germany, 1.68 tests were performed per 1000 population.
- Many people, recognizing well-known symptoms, do not rush to report it to their doctor, but are treated on their own, continuing to spread the disease, and, accordingly, with a successful recovery, do not get into the official statistics.
- The rate of increase or decrease in daily morbidity should primarily depend on the actual number of active patients. However, it is also necessary to take into account the insufficient amount of laboratory tests, which does not allow timely diagnosis, insufficient effectiveness of diagnostic methods, which also reduces the reliability of official data [4].

- − a dataset from three countries (Ukraine, Germany, and Belarus) was collected, which allowed a more in-depth analysis and generalization;
- − hypothesis that patients with blood group II are more vulnerable to COVID-19;
- − the features affected by COVID-19 cases were selected based on machine learning algorithms and comparison of their results;
- − the proposed hierarchical classifier based on the combined use of unsupervised and supervised machine learning algorithms provide higher accuracy on 4% in comparison with random forest and XGBoost algorithms.

## 2. Literature Review

## 3. Dataset Description

- Age (categorical): 1:<15, 2: 16–22, 3: 23–40, 4: 41–65, 5: >66;
- Sex (categorical): male, female;
- Region (string): Lviv (Ukraine), Chernivtsi (Ukraine), Belarus, Germany, Other;
- Do you smoke (Boolean): 2: yes, 0: no;
- Have you had COVID (categorical): 2: yes, 0: no, 1: maybe;
- IgM level (numerical): [0..0.9) (negative), [0.9..1.1) (indefinite), ≥1.1 (positive);
- IgG level (numerical): [0..0.9) (negative), [0.9..1.1) (indefinite), ≥1.1 (positive);
- Blood group (categorical): 1, 2, 3, 4;
- Do you vaccinate for influenza? (categorical): 2: yes, 0: no, 1: maybe;
- Do you vaccinate for tuberculosis? (categorical): 2: yes, 0: no, 1: maybe;
- Have you had influenza this year? (categorical): 2: yes, 0: no, 1: maybe;
- Have you had tuberculosis this year? (categorical): 2: yes, 0: no, 1: maybe.

## 4. Materials and Methods

#### 4.1. Data Preprocessing

#### 4.2. Cluster Analysis

#### 4.2.1. COVID-19 Dataset Clustering

#### 4.2.2. Analysis of Each Cluster

#### 4.3. Classification

- 1 hidden layer and 7 neurons in hidden layer,
- Biases are used,
- Backpropagation as learning algorithm,
- logistic activation function.

- Support vector machine with lineal kernel shows the accuracy equal to 60.5%.
- Logistic regression for numerical data shows Akaike information criterion (AIC): 37.471. The accuracy is equal to 55.3%.

- whole dataset,
- dataset by countries,
- selected features,
- each cluster separately.

#### 4.4. Hierarchical Classifier

- Using gaps-statistics, the appropriative number of clusters is found. This number is equal to four;
- k-means divides objects by 4 groups; density of distribution is calculated;
- XGboost and random forest are used for each cluster separately;
- Hard voting on the obtained results is provided. Based on it, the class with the highest number of votes will be selected. If votes are the same, the result of the classifier with minimal depth value will be selected.

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

Age | Sex | Region | Smoke |
---|---|---|---|

Min.: 1.000 Max.: 5.000 1st Qu.: 2.000 Median: 2.000 Mean: 2.207 3rd Qu.: 3.000 | Min.: 1.00 Max.: 2.00 1st Qu.: 1.00 Median: 1.00 Mean: 1.46 3rd Qu.: 2.00 | Length: 198 Class: character Mode: character | Min.: 0.000 Max.: 2.000 1st Qu.: 0.000 Median: 0.000 Mean: 0.474 3rd Qu.: 0.000 |

Covid | IgM | IgG | Blood group |

Min.: 0.000 | Min.: 0.000 | Min.: 0.000 | Min.: 1.000 |

Max.: 2.000 | Max.: 9.800 | Max.: 123.300 | Max.: 4.000 |

1st Qu.: 0.000 | 1st Qu.: 0.000 | 1st Qu.: 0.000 | 1st Qu.: 2.000 |

Median: 1.000 | Median: 2.250 | Median: 2.625 | Median: 2.000 |

Mean: 1.106 | Mean: 2.731 | Mean: 41.760 | Mean: 2.145 |

3rd Qu.: 2.000 | 3rd Qu.: 4.825 | 3rd Qu.: 99.675 | 3rd Qu.: 3.000 |

Vaccinated influenza | Vaccinated tuberculosis | Had influenza | |

Min.: 0.000 | Min.: 0.00 | Min.: 0.0000 | |

Max.: 2.0000 | Max.: 2.00 | Max.: 2.0000 | |

1st Qu.: 0.000 | 1st Qu.: 1.00 | 1st Qu.: 0.0000 | |

Median: 0.000 | Median: 2.00 | Median: 0.0000 | |

Mean: 0.424 | Mean: 1.46 | Mean: 0.4949 | |

3rd Qu.: 0.000 | 3rd Qu.: 2.00 | 3rd Qu.: 1.0000 |

Model | Accuracy | Parameters |
---|---|---|

Logistic regression | 0.553 | Coefficients: |

Estimate Std. Error t value Pr(>|t|) | ||

(Intercept) 1.7055 20.7048 0.082 0.938 | ||

Age 1.1441 3.1016 0.369 0.731 | ||

IgG −1.8642 6.9744 −0.267 0.802 | ||

Blood group −0.3521 14.4088 −0.024 0.982 | ||

Had influenza 0.9687 0.5028 1.927 0.126 | ||

IgM −0.1306 9.2942 −0.014 0.989 | ||

Support vector machine | 0.605 | SVM-Type: C-classification |

SVM-Kernel: linear; cost: 10; beta0 = −0.5491702; >svmfit$coefs | ||

[,1] | ||

[1,] −0.0351190476 | ||

[2,] 0.0009065892 | ||

[3,] −0.3970516219 | ||

[4,] −0.0062360063 | ||

##—Detailed performance results: | ||

## cost error dispersion | ||

## 1 1e − 03 0.25 0.120185 | ||

## 2 1e − 02 0.25 0.120185 | ||

## 3 1e − 01 0.25 0.120185 | ||

## 4 1e + 00 0.25 0.120185 | ||

## 5 5e + 00 0.25 0.120185 | ||

## 6 1e + 01 0.25 0.120185 | ||

## 7 1e + 02 0.25 0.120185 | ||

Number of Support Vectors: 144 | ||

Naive Bayes | 0.670 | 95% CI: (0.4829, 0.7658); No Information Rate: 0.6327; p-Value [Acc > NIR]: 0.5639 |

XGBoost | 0.898 | booster = “gbtree”, objective = “binary:logistic”, eta = 0.3, gamma = 0, max_depth = 6, min_child_weight = 1, subsample = 1, colsample_bytree = 1 test error mean #0.1263 |

Random Forest | 0.897 | Type of random forest: classification; Number of trees: 500; No. of variables tried at each split: 2; OOB estimate of error rate: 31.82% |

Neural network | 0.820 | 1 hidden layer and 7 neurons in hidden layer with backpropagation and logistic activation function $neurons $neurons[[1]] ×1 ×2 ×3 ×4 ×5 2 1 0.83886256 0.00 0.02793296 0.03582645 0.1794872 8 1 0.22748815 0.82 0.54748603 0.35181777 0.4487179 11 1 0.09004739 0.92 0.56145251 0.30235390 0.7179487 13 1 0.12322275 0.34 0.83798883 0.50076039 0.9358974 $neurons[[2]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] 2 1 0.4205161 0.9336224 0.6383656 0.054238905 0.9657629 0.6551819 8 1 0.5645997 0.9994024 0.9742761 0.002255419 0.9983305 0.9231495 11 1 0.5289330 0.9996923 0.9852911 0.001609122 0.9989139 0.9578591 13 1 0.6656213 0.9998608 0.9908708 0.001052506 0.9994693 0.9543330 [,8] 2 0.9310445 8 0.9656552 11 0.9696302 13 0.9684360 $neurons[[3]] [,1] [,2] 2 1 1.330679e − 4 8 1 3.060782e − 05 11 1 2.867471e − 05 13 1 2.672559e − 05 |

Decision tree | 0.513 | 95% CI: (0.4829, 0.7658); No Information Rate: 0.6327; p-Value [Acc > NIR]: 0.5639 |

## References

- Roser, M.; Ritchie, H.; Ortiz-Ospina, E.; Hasell, J. Coronavirus Pandemic (COVID-19). Our World in Data.
**2020**. Available online: https://ourworldindata.org/coronavirus?utm_campaign=Optimizando&utm_medium=email&utm_source=Revue%20newsletter (accessed on 15 January 2021). - News. Available online: https://nszu.gov.ua/en/novini/oficijnij-sajt-nacionalnoyi-sluzhbi-zdorovya-ukrayini-staye-19 (accessed on 27 October 2020).
- Тести На Коронавірус—в Україні Зробили Понад Мільйон Тестів ПЛР » Слово і Діло. Available online: https://www.slovoidilo.ua/2020/09/04/infografika/suspilstvo/pandemiya-koronavirusu-skilky-testiv-zrobyly-ukrayini-ta-inshyx-krayinax-svitu (accessed on 5 January 2021).
- Vyklyuk, Y.; Manylich, M.; Škoda, M.; Radovanović, M.M.; Petrović, M.D. Modeling and Analysis of Different Scenarios for the Spread of COVID-19 by Using the Modified Multi-Agent Systems—Evidence from the Selected Countries. Results Phys.
**2020**, 103662. [Google Scholar] [CrossRef] - Izonin, I.; Tkachenko, R.; Verhun, V.; Zub, K. An Approach towards Missing Data Management Using Improved GRNN-SGTM Ensemble Method. JESTECH. in press. [CrossRef]
- Jiang, C.; Yao, X.; Zhao, Y.; Wu, J.; Huang, P.; Pan, C.; Liu, S.; Pan, C. Comparative Review of Respiratory Diseases Caused by Coronaviruses and Influenza A Viruses during Epidemic Season. Microbes Infect.
**2020**, 22, 236–244. [Google Scholar] [CrossRef] [PubMed] - Charpentier, C.; Ichou, H.; Damond, F.; Bouvet, E.; Chaix, M.-L.; Ferré, V.; Delaugerre, C.; Mahjoub, N.; Larrouy, L.; Le Hingrat, Q.; et al. Performance Evaluation of Two SARS-CoV-2 IgG/IgM Rapid Tests (Covid-Presto and NG-Test) and One IgG Automated Immunoassay (Abbott). J. Clin. Virol.
**2020**, 132, 104618. [Google Scholar] [CrossRef] [PubMed] - Muhammad, L.J.; Islam, M.M.; Usman, S.S.; Ayon, S.I. Predictive Data Mining Models for Novel Coronavirus (COVID-19) Infected Patients’ Recovery. SN Comp. Sci.
**2020**, 1. [Google Scholar] [CrossRef] [PubMed] - Ivorra, B.; Ferrández, M.R.; Vela-Pérez, M.; Ramos, A.M. Mathematical Modeling of the Spread of the Coronavirus Disease 2019 (COVID-19) Taking into Account the Undetected Infections. The Case of China. Commun. Nonlinear Sci. Numer. Simul.
**2020**, 88, 105303. [Google Scholar] [CrossRef] [PubMed] - Caruana, G.; Croxatto, A.; Coste, A.T.; Opota, O.; Lamoth, F.; Jaton, K.; Greub, G. Diagnostic Strategies for SARS-CoV-2 Infection and Interpretation of Microbiological Results. Clin. Microb. Infect.
**2020**, 26, 1178. [Google Scholar] [CrossRef] [PubMed] - Ghosal, S.; Sengupta, S.; Majumder, M.; Sinha, B. Linear Regression Analysis to Predict the Number of Deaths in India Due to SARS-CoV-2 at 6 Weeks from Day 0 (100 Cases - March 14th 2020). Diabetes Metab. Syndr. Clin. Res. Rev.
**2020**, 14, 311–315. [Google Scholar] [CrossRef] [PubMed] - Yang, Q.; Wang, J.; Ma, H.; Wang, X. Research on COVID-19 Based on ARIMA ModelΔ—Taking Hubei, China as an Example to See the Epidemic in Italy. J. Infect. Public Health
**2020**, 13, 1415–1418. [Google Scholar] [CrossRef] [PubMed] - Petukhova, T.; Ojkic, D.; McEwen, B.; Deardon, R.; Poljak, Z. Assessment of Autoregressive Integrated Moving Average (ARIMA), Generalized Linear Autoregressive Moving Average (GLARMA), and Random Forest (RF) Time Series Regression Models for Predicting Influenza A Virus Frequency in Swine in Ontario, Canada. PLoS ONE
**2018**, 13, e0198313. [Google Scholar] [CrossRef] [PubMed] - Adhikari, R.; Agrawal, R. An Introductory Study on Time Series Modeling and Forecasting. arXiv
**2013**, arXiv:1302.6613. [Google Scholar] - Ez, M.; Ea, S.; Al, F. A SARIMA Forecasting Model to Predict the Number of Cases of Dengue in Campinas, State of São Paulo, Brazil. Rev. Soc. Bras. Med. Trop.
**2011**, 44, 436–440. [Google Scholar] [CrossRef][Green Version] - Dehesh, T.; Mardani-Fard, H.A.; Dehesh, P. Forecasting of COVID-19 Confirmed Cases in Different Countries with ARIMA Models. medRxiv
**2020**. [Google Scholar] [CrossRef][Green Version] - Martinez, E.Z.; Silva, E.A.S.D. Predicting the Number of Cases of Dengue Infection in Ribeirão Preto, São Paulo State, Brazil, Using a SARIMA Model. Cadernos de Saúde Pública
**2011**, 27, 1809–1818. [Google Scholar] [CrossRef] [PubMed][Green Version] - Anastassopoulou, C.; Russo, L.; Tsakris, A.; Siettos, C. Data-Based Analysis, Modelling and Forecasting of the COVID-19 Outbreak. PLoS ONE
**2020**, 15, e0230405. [Google Scholar] [CrossRef] [PubMed][Green Version] - Silva, P.C.L.; Batista, P.V.C.; Lima, H.S.; Alves, M.A.; Guimarães, F.G.; Silva, R.C.P. COVID-ABS: An Agent-Based Model of COVID-19 Epidemic to Simulate Health and Economic Effects of Social Distancing Interventions. Chaos Solitons Fract.
**2020**, 139, 110088. [Google Scholar] [CrossRef] [PubMed] - Sakai, H.; Okuma, A. An Algorithm for Checking Dependencies of Attributes in a Table with Non-Deterministic Information: A Rough Sets Based Approach. In Proceedings of the PRICAI 2000 Topics in Artificial Intelligence; Mizoguchi, R., Slaney, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2000; pp. 219–229. [Google Scholar]
- Shakhovska, N.; Izonin, I.; Melnykova, N. Dataset for Covid’19 Resistance Evaluation from Ukraine, Germany and Belarus. 2020. Available online: https://www.researchgate.net/publication/344954442_Dataset_for_Covid19_resistance_evaluation_from_Ukraine_Germany_and_Belarus?channel=doi&linkId=5f9aedc8458515b7cfa7ef90&showFulltext=true (accessed on 15 January 2021).
- Stop Covid’19 Project. Available online: https://covid-72b6d.web.app/results (accessed on 29 October 2020).
- Markopoulos, A.P.; Georgiopoulos, S.; Manolakos, D.E. On the Use of Back Propagation and Radial Basis Function Neural Networks in Surface Roughness Prediction. J. Ind. Eng. Int.
**2016**, 12, 389–400. [Google Scholar] [CrossRef][Green Version] - Mbuvha, R.; Marwala, T. Bayesian Inference of COVID-19 Spreading Rates in South Africa. PLoS ONE
**2020**, 15. [Google Scholar] [CrossRef] [PubMed] - (PDF) CoronaTracker: World-Wide COVID-19 Outbreak Data Analysis and Prediction. Available online: https://www.researchgate.net/publication/340032869_CoronaTracker_World-wide_COVID-19_Outbreak_Data_Analysis_and_Prediction (accessed on 27 October 2020).
- Alok, A.K.; Saha, S.; Ekbal, A. A New Semi-Supervised Clustering Technique Using Multi-Objective Optimization. Appl. Intell.
**2015**, 43, 633–661. [Google Scholar] [CrossRef] - Shirkhorshidi, A.S.; Aghabozorgi, S.; Wah, T.Y.; Herawan, T. Big Data Clustering: A Review. In Proceedings of the Computational Science and Its Applications—ICCSA 2014; Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C., Rocha, J.G., Falcão, M.I., Taniar, D., Apduhan, B.O., Gervasi, O., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 707–720. [Google Scholar]

**Figure 5.**Cluster objects distribution: (

**a**) by age; (

**b**) by blood group; (

**c**) by sex; (

**d**) by vaccinated influenza.

**Figure 10.**Naive Bayes plot of density by parameters: (

**a**) had influenza; (

**b**) vaccinated tuberculosis; (

**c**) vaccinated influenza; (

**d**) smoke.

# | Age | n |
---|---|---|

1 | 23–40 | 124 |

2 | 40–65 | 84 |

3 | 16–22 | 82 |

4 | >65 | 19 |

5 | <15 | 4 |

# | Sex | n |

1 | Male | 178 |

2 | Female | 135 |

# | Region | n |

1 | Ukraine, Lviv | 159 |

2 | Ukraine, Chernivtsi | 67 |

3 | Belarus | 56 |

4 | Germany | 27 |

5 | others | 4 |

# | COVID | n |

1 | yes | 105 |

2 | no | 100 |

3 | maybe | 78 |

0 | 1 | 2 | Class Error | |
---|---|---|---|---|

0 | 153 | 9 | 17 | 0.14525139 |

1 | 8 | 88 | 12 | 0.18518518 |

2 | 1 | 3 | 20 | 0.16666666 |

# | Variable | Mean_Min Depth | Times_a_Root |
---|---|---|---|

1 | Age | 1.511688 | 112 |

2 | Blood group | 1.620000 | 111 |

3 | Sex | 1.723688 | 53 |

4 | Had influenza | 1.727688 | 92 |

5 | Smoke | 2.030000 | 79 |

6 | Vaccinated influenza | 2.372752 | 25 |

7 | Vaccinated tuberculosis | 2.164000 | 28 |

# | Region |
---|---|

1 | Age |

2 | IgG |

3 | Blood group |

4 | Had influenza |

5 | IgM |

Model | Full Dataset | Filtered by Ukraine | Filtered by Belarus | Filtered by Germany | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 |
---|---|---|---|---|---|---|---|---|

Logistic regression | 0.553 | 0.572 | 0.534 | 0.544 | 0.601 | 0.592 | 0.610 | 0.589 |

Support vector machine | 0.605 | 0.6327 | 0.570 | 0.584 | 0.621 | 0.694 | 0.635 | 0.637 |

Naive Bayes | 0.670 | 0.693 | 0.655 | 0.655 | 0.674 | 0.693 | 0.672 | 0.692 |

XGBoost | 0.898 | 0.932 | 0.860 | 0.942 | 0.941 | 0.945 | 0.899 | 0.957 |

Random forest | 0.897 | 0.924 | 0.859 | 0.940 | 0.932 | 0.944 | 0.961 | 0.925 |

Neural network | 0.820 | 0.849 | 0.828 | 0.79 | 0.830 | 0.849 | 0.8204 | 0.849 |

Decision tree | 0.513 | 0.542 | 0.517 | 0.492 | 0.553 | 0.631 | 0.612 | 0.642 |

Model * | Age, IgG, Blood_Group, Had_Influenz, IgM | Age, Sex, Blood_Group, Had_Influenz |
---|---|---|

Logistic regression | 0.633 | 0.671 |

Support vector machine | 0.671 | 0.722 |

Naive Bayes | 0.674 | 0.732 |

XGBoost | 0.935 | 0.945 |

Random forest | 0.945 | 0.934 |

Neural network | 0.832 | 0.845 |

Decision tree | 0.553 | 0.631 |

Model | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 |
---|---|---|---|---|

Hierarchical classifier | 0.941 | 0.945 | 0.961 | 0.957 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Shakhovska, N.; Izonin, I.; Melnykova, N. The Hierarchical Classifier for COVID-19 Resistance Evaluation. *Data* **2021**, *6*, 6.
https://doi.org/10.3390/data6010006

**AMA Style**

Shakhovska N, Izonin I, Melnykova N. The Hierarchical Classifier for COVID-19 Resistance Evaluation. *Data*. 2021; 6(1):6.
https://doi.org/10.3390/data6010006

**Chicago/Turabian Style**

Shakhovska, Nataliya, Ivan Izonin, and Nataliia Melnykova. 2021. "The Hierarchical Classifier for COVID-19 Resistance Evaluation" *Data* 6, no. 1: 6.
https://doi.org/10.3390/data6010006