1. Introduction
The Aedes aegypti mosquito handles transmitting the DENV virus, the causative agent of dengue fever, from person to person. There is currently no vaccination that can protect against all virus serologies. This is because there is no such thing as a vaccine. As a direct consequence, trying to reduce the number of mosquitoes in an area has become the primary focus of the fight against the disease. Researchers are using machine learning (ML) and deep learning (DL) to forecast dengue cases and assist governments in their fight against the disease [
1].
Dengue virus is a flavivirus, a genus of flaviviruses, and a family of Flaviviridae [
2,
3]. Arthropods are the primary vectors for the spread of the dengue virus. It can be broken down into four serotypes, referred to by the names DEN 1, DEN 2, DEN 3, and DEN 4. According to the World Health Organization (WHO), dengue fever poses a considerable risk to the public health of countries all over the world, especially those nations that are in tropical or subtropical regions (WHO). There are approximately 2.5 billion people who live in dengue-endemic areas [
4], with an annual infection rate of 400 million individuals and a mortality rate that can vary from 5 to 20 percent in specific locations [
5]. Dengue fever is a disease that may be found all over the world. However, it is particularly prevalent in some regions, such as Europe and the United States of America (USA) [
6]. The first recorded case of dengue fever in India occurred in Madras in the year 1780, and the first pandemic of dengue disease in India was confirmed by virological testing in the years 1963–1964 [
7].
Dengue fever is a dangerous illness that manifests similarly to the flu and can afflict persons of any age, including infants, children, adolescents, and adults [
8]. The Aedes aegypti mosquito is to blame for the transmission of the disease to humans, which takes place more often during the wetter months [
9]. The World Health Organization (WHO) distinguishes between two levels of severity for dengue fever: moderate and severe [
10,
11]. Extreme cases are characterized by an abnormal amount of bleeding, impairment of organ function, or significant loss of plasma, while others are considered to be relatively uncomplicated [
11]. According to the categorization used in 1997, dengue fever can be broken down into three subtypes: undifferentiated fever, D.F., and DHF [
12]. The DHF served as the basis for the creation of grades I–IV. D.F. can develop because of primary diseases as well as secondary diseases, and it is most common in adults and children who have developed into adults. The onset of symptoms is typically accompanied by a high temperature that cycles between phases and can continue anywhere from one week to three weeks [
13]. A metallic taste, loss of appetite, diarrhea, nausea, and stomachache are some of the other symptoms. Other symptoms include severe headaches, particularly retrobulbar migraines, fatigue, myalgia, and painful joints. Dengue fever is also referred to as break-bone fever [
10,
14] because the condition is often accompanied by myalgia and discomfort in the joints.
The worldwide public health system has a continuing need for the early detection of dengue fever, and machine learning algorithms may help medical professionals recognize and prevent infection at an earlier stage. This would save time, money, and the uncomfortable experience of pathology testing [
15,
16]. Diagnostics would benefit significantly from this. In medical diagnostics, machine learning algorithms have been used to diagnose conditions based on clinical and laboratory signs and produce the outcome, as stated in [
17]. This has been carried out so that the outcome will be as accurate as possible. They also note that artificial neural networks (ANN) are one of the most prominent ways of addressing medical diagnostic issues and that support vector machines (SVM) give correct conclusions when evaluating a single ailment. Both concepts are discussed in the article. These two assertions are contained within the same piece of writing altogether.
In another publication [
18], the authors use an artificial neural network to assess meteorological data obtained from Singapore’s National Environment Agency (SNEA). To make a prediction about dengue illness in Thailand, [
19] employed meteorological data collection and implemented feature selection techniques. SVM is another well-known method that can be used to address this matter. Utilizing support vector machines (SVM) on a Singapore meteorological dataset can help with the prediction of dengue fever [
20,
21]. On a small sample from Brazil, Ref. [
22] uses gene expression data and an RBF kernel. Authors of [
23] also use the support vector regression (SVR) method. They use data from the Guangdong region to compare several machine learning algorithms (China). Ref. [
24] use climatic factors to investigate the incidence of dengue fever in the Philippines. They compare random forest, gradient boosting, general additive modeling, and seasonal autoregressive integrated moving average with exogenous variables [
25,
26]. These days, deep learning strategies are getting a lot of attention as a potential solution to a variety of challenges, particularly in the discipline of medical imaging. Some of the features that will be included in this system are the ability to suggest differential diagnoses, the composition of preliminary radiology reports, automatic detection, and quantitative characteristics of the lesion in medical imaging. However, this does not imply that the replacement of radiologists is dangerous; instead, it helps physicians provide more accurate diagnoses to their patients. The subfield of computer vision, known as deep learning, is considered an advanced subfield. The primary aim of computer vision is to carry out a variety of tasks simultaneously, including picture detection and recognition, image analysis, natural language processing, and other similar activities. Over the course of the past few years, interest in computer vision has grown substantially across a variety of academic domains. CNN is used in most computer vision tasks, particularly those involving the classification, recognition, and segmentation of medical pictures. The convolutional neural network (CNN) is a sort of artificial neural network that was developed specifically for processing data related to images and videos. It begins with photographs as input, then extracts and learns features from those images, and then classifies output images depending on the features it has learned from the input images. There have been several different CNN-based model ideas put forward, including AlexNet, SPP-Net, VGGNet, ResNet, GoogleLeNet, and others. Deep convolutional neural network (CNN)-based algorithms have shown promising results in the processing of medical pictures. An introduction to CNN in medical imaging analysis as well as a general discussion of machine learning and deep learning applied to medical pictures are included in this work. The researchers investigate several different machine learning methods, and general additive modeling is just one of them.
The contribution of work:
The purpose of this paper is to pursue an early diagnostic model that helps doctors in the prompt prognosis and diagnosis of dengue disease by using machine learning algorithms. The key steps are as follows:
Using techniques from the field of machine learning, such as the KNN classifier, decision tree, random forest, Gaussian naive Bayes, and support vector classifier (SVC), among others.
Creating a diagnostic model based on machine learning for fast detection and prognosis of dengue disease to aid medical professionals in making decisions.
The K-Fold method is used here for the purpose of result validation.
2. Related Work
The field of computing known as machine learning (ML) enables computers to access information without the requirement for any encoding [
27,
28]. The study of ML falls under the umbrella of the discipline of computer science. MML has become all-pervasive and essential for resolving intricate problems in any science department, but especially in the field of illness diagnostics [
29,
30]. Machine learning algorithms and techniques will soon be able to foresee and differentiate between a wide variety of illnesses in the healthcare field [
30,
31,
32]. This is a direct effect of ongoing technological improvement. Machine learning is often cited as one of the most productive research approaches, mainly when predicting disease occurrence. There are several distinct kinds of ML algorithms, each of which is capable of being applied for the purpose of disease forecasting [
33,
34]. The findings of an investigation into several different machine learning algorithm approaches are shown in
Table 1, along with the research that is pertinent to the topic. According to the findings of the review that was carried out [
35,
36], several distinct machine learning methods, including SVM, KNN, R.F., D.T., and SVC, are utilized and evaluated for the purpose of dengue prediction.
3. Materials and Methods
Within the scope of this publication, we constructed a diagnostic and prognostic model for dengue fever. We broke the task down into steps, beginning with the first phase of data collection, then moving on to data preprocessing, and finally employing ML classifiers to evaluate the output according to the accuracy (mean) of disease prediction (see
Figure 1).
3.1. Data Collection
The objective is to accurately predict the total number of dengue cases present in the test set, which will be labeled against each city, year, and week of the year. This study uses data from the DengAI competition (open data of dengue illness competition: DengAI: Predicting Disease Spread (drivendata.org)). The DengAI competition comprises data for two cities, San Juan and Iquitos, extending from three to five years. Every piece of information contributes its own set of forecasts for these cities. The data are separated into two categories: the training and test datasets, as shown in
Table 2.
3.2. Data Preprocessing
The machine learning pipeline’s most significant component is the step known as “data preprocessing.” Data preprocessing converts unprocessed data into processed (meaningful) data. The dataset needs to be cleaned, normalized, and completely free of noise before it can be used for analysis (see
Figure 2).
3.3. Features Selection
In building a prediction model, one of the most critical steps is called “feature selection.” During this phase, the number of variables (or inputs) is narrowed down to reduce the amount of computing required for the modeling process and, in some cases, to improve the overall performance of the model. The dataset has missing data for certain of its attributes, so we use the mean method to replace those values. After that, we use the fit and transform method to normalize and standardize the data.
We can see that there are several different features that have extreme values by looking at
Figure 3. After investigating the data, it became clear that they are neither outliers nor errors; hence, we are unable to disregard them and will have to take them into consideration. The values of precipitation are taken into consideration here, and given that these are estimates of the amount of rain, it is reasonable to anticipate that the weather can vary significantly depending on the location.
The features reanalysis_avg_temp_k and reanalysis_specific_humidity_g_per_kg appear to be pretty similar in shape; nonetheless, the question that arises here is whether or not they are correlated with one another.
By looking at
Figure 4, we can come to the conclusion that certain features are perfectly associated with one another (1), while other features are practically perfectly correlated with one another (0.9). The same information is presented in
Table 3.
As we want to detect dengue in this manuscript for the same, if features are far in two cities, then it is suitable for ML classification (reanalysis_tdtr_k); otherwise, if they are near and give mixed information about features, then it is not considered suitable for prediction/classification.
From
Figure 5 and
Figure 6, and from the dataset, we can create a new data frame, i.e., X_train plus the total_cases column of y_train.
After applying all the above-mentioned steps, we deduce the features from the dataset, as shown in
Table 4.
In this feature selection, we are dropping out the two features, i.e., reanalysis_sat_precip_amt_mm and reanalysis_specific_humidity_g_per_kg. At this time, we are not considering them because they are almost perfectly correlated (0.9), and we want to achieve good accuracy. However, with the present scenario, if we go for machine learning algorithms, i.e., KNN, D.T., R.F., and GNB, the accuracy comes out to be significantly less. This is due to the total number of cases immensely varying from 0 to 400+. The question arises, “How can we improve this accuracy?” The answer to this question is to divide our dataset into two cities, as shown in
Table 4.
After this, we will find the correlation for two different cities, i.e., San Juan and Iquitos, separately, shown in
Figure 7.
After finding the correlation between the two cities, we can deduce some information, such as that both cities showed promising results for
reanalysis_specific_humidity_g_per_kg
reanalysis_dew_point_temp_k
reanalysis_min_air_temp_k
The fact that they are perfectly correlated with each other (value 1) is a clear sign that they are. This says that mosquitoes live in areas with high humidity. Since temperature plays a vital role in the spread of mosquitoes, it is correlated both with each other and with the total number of cases. Surprisingly, the weakest part of the year is also highly correlated to San Juan City, and as a result, we will be keeping a close eye on that. In addition, if we plot “a number of years” against “week of the year,” we find that there is an outbreak at the end of the year in both cities. We arrived at this conclusion after outlining the plot between the two variables. The number of reported cases grows, and outbreaks often occur over a few weeks, as illustrated in
Table 5 and
Figure 8 and
Figure 9, respectively.
5. Discussion and Conclusions
As a result of its popularity and widespread application in image segmentation, deep learning has developed into a crucial instrument and is able to achieve ever-higher levels of precision. However, the primary concern is centered on the optimization of deep learning, and optimization encompasses multiple levels. Some of these levels include perfecting the deep network architectures and carrying out ensembled learning; hyperparameter tuning, which is an empirical method; optimizing the loss function in accordance with evaluation metrics; and making use of the appropriate optimizer and activation functions.
The purpose of this research is to develop a diagnostic model for the disease dengue by using machine learning techniques such as KNN, D.T., R.B., SVR, and GNB. The model will be able to make correct predictions regarding the progression of the disease as well as allow for early diagnosis of the disease. As a result of these upcoming initiatives, the focus of prioritization should be placed on cause–effect models for the diagnosis of disease. Not only is it vital to diagnose the sickness, but it is also essential to analyze the elements that have the most considerable influence on the infection. It is essential to do both things in order to be successful. A more profound comprehension of the etiology of the disease, along with the creation of more correct diagnostic models, would be of tremendous assistance in the fight against dengue fever, as well as in the reduction of complications and fatalities caused by the disease. The use of modeling for the purpose of minimizing the impact of data uncertainty is another vital area. One of the primary challenges that must be surmounted before the quality of previously developed models can be enhanced is the poor standard of epidemiological data about dengue. As a last consideration, the use of independent loops of data analysis works to automate the decision-making process in disease control. Although the D.T., KNN, SVR, and GNB methods all generate better results, the R.F. method requires significantly more time to compute since it generates superior results. Based on the findings, it appears that the R.F. technique is the one to choose. Because of this, it has been determined that, out of all these various machine learning algorithms, the RF-based diagnostic model is the one that is best suited for accurately diagnosing dengue fever at an earlier stage. This conclusion was reached because of the reasons.
The substantial number of optimization factors and schemes that needed to be conducted empirically in order to give our final design requirements were the primary obstacles that needed to be overcome in this effort. Even if we have scaled back the trainable parameters of the network such that they are more compatible with the hardware, there is still the issue of the significant amount of CPU power that must be present to complete the training.
In conclusion, we can say that reason-based models can help with the analysis and interpretation of dengue disease data. This is something that we can assert. Because there is a severe lack of high-quality data in the field of healthcare, machine learning models that can deal with ambiguity can be highly valuable. In conclusion, data decentralization, in conjunction with aggregated learning, may make it possible to cut the costs of computer modeling and may also make it possible to do so without compromising the data’s integrity. This may be possible.