Improving Classification Accuracy Using Hybrid Machine Learning Algorithm on Malaria Dataset

: Machine Learning Algorithms are integrated into Computer-Aided Design (CAD) methodologies to support medical practitioners in diagnosing patient disorders. This research seeks to enhance the accuracy of classifying malaria-infected erythrocytes (RBCs) through the fusion of Machine Learning Algorithms, resulting in a hybrid classifier. The primary phases involve data pre-processing, segmentation, feature extraction, and RBC classification. This paper introduces a novel hybrid Machine Learning Algorithm, employing two combinations of supervised algorithms. The initial combination encompasses Stochastic Gradient Descent (SGD), Logistic Regression, and Decision Tree, while the second employs Stochastic Gradient Descent (SGD), Xgboost, and Random Forest. The proposed approach, implemented using Python programming, presents an innovative hybrid Machine Learning Algorithm. Through a comparative analysis between individual algorithms and the proposed hybrid algorithm, the paper demonstrates heightened accuracy in classifying malaria data, thus aiding medical practitioners in diagnosis. Among these algorithms, SGD, Logistic Regression, and Decision Tree yield individual accuracy rates of 90.63%, 92.23%, and 93.43% respectively, while the hybrid algorithm achieves 95.64% accuracy on the same dataset. The second hybrid algorithm, combining SGD, Xgboost, and Random Forest, outperforms the initial hybrid version. Individually, these algorithms achieve accuracy rates of 90.63%, 95.86%, and 96.11%. When the proposed hybrid algorithm is applied to the same dataset, accuracy is further enhanced to 96.22%


Introduction
Malaria is a severe illness triggered through a Plasmodium genus blood parasite and becomes a major cause of death in the globe [1].Malaria is caused by female mosquitos Anopheles that directly infects the Red Blood Cells.The parasite enters the blood through the mosquito's saliva.The RBC in the blood is infected with the various kinds of malaria parasites.Malaria is caused by Plasmodium species parasites that spread through infected mosquito bites from individual to individual.It can also be transferred by transfusions of blood.The incubation period is eight to twelve days for the most severe type of malaria.It may be as lengthy as ten months in some rare types of malaria.Under a light microscope, standard malaria detection based on manual microscopic consideration will generally improve the opportunity of incorrect identification and late diagnosis.As an outcome, many scientists have suggested automated malaria diagnosis based on image processing strategy to give timely malaria parasite detection as well as enhancing malaria diagnosis accuracy.[2] proposed a methodology for extracting cell features that involves the use of K-means clustering to segment cell nuclei.This strategy yielded a segmentation error rate of only 6.46%.Author [3] established the Sobel Edge Detection approach as a means of differentiating the borders of the malaria parasite from the surrounding blood corpuscles.Researcher [4] proposed a method for differentiating parasite components from white blood cells (WBCs) by employing adaptive thresholding on the histogram of the V-value in the HSV colour space.A total of 20 test photos were utilised, yielding an accuracy percentage of 60%.[5] employed adaptive histogram thresholding on slide pictures to morphologically detect red blood cells (RBCs) that are infected with malaria parasites.The authors [6] employed a combination of geomorphologic segmentation and color-specific information in order to identify parasites among densely populated blood smears.The researchers performed tests on a dataset consisting of 75 photos in order to evaluate the effectiveness of their methodology at the patch level.
In the latest decades, Malaria parasite detection has been one of the significant interesting research field because of the several proposed new automated detection approaches.In order to design an automatic detection scheme, the parasite of malaria direct infects the red blood cells(RBC), thus separating the red blood cells(RBC) from the artefacts and background in a microscopic image [7].An author proposed that the Breast cancer classification has been suggested in a new system.A hybrid of Support Vector Machine (SVM) and K-Means are used by the current system [8].The experimental outcome showed how the suggested algorithm is effective.[9] concluded that the hybrids algorithm's performance level (Decision tree and Artificial neural network) is better than that of the performance of the algorithms individually.Learning-based techniques are more effective for larger datasets.In this paper, we propose a novel method using Python programming, presents an innovative hybrid Machine Learning Algorithm, through a comparative analysis between individual algorithms and the proposed hybrid algorithm.This paper demonstrates heightened accuracy in classifying malaria data, thus aiding medical practitioners in diagnosis.

Methodology
By using the hybridization technique, the efficiency of individual algorithms can be improved.These are the steps, which is followed by the suggested scheme are image acquisition, preprocessing of images, segmentation of images, extraction of features, and the classification of images using five distinct kinds of machine learning algorithms i.e.Stochastic Gradient Descent (SGD), Decision Tree, Logistic Regression, Xgboost, and Random Forest.Hybrid classifier's main objective is to increase accuracy of classification compared to other individual classifiers.In the hybrid classifier, the misclassifications of every individual classifier are often overcome, thus increasing the accuracy rate.The classifiers are combined for the ending classification of infected and uninfected erythrocytes(RBC) using the majority voting technique.It is an important and novel area of research as compared to discrete learning approaches.

Data acquisition
For the detection of malaria parasites, the data is collected from National Institute of Health(NIH).For machine learning, the dataset source for the suggested algorithm is available on kaggle.com.This dataset was created by NIH that has a 27,558 cell image malaria dataset with the same number of parasitized and uninfected cells.To train and test our model, this is an excellent dataset of labeled image.Training set of 24802 images and testing set of 2756 images have been created by dividing the Malaria cell data set in the ratio of 9:1 for the training and validation of the proposed.To detect and segment red blood cells, a level-set based algorithm was implemented.
The entire dataset used consists of images of normal Red Blood Cellsmear sample as shown in figure-1 and the infected Red Blood Cell as shown in figure -2.

Preprocessing
Preprocessing is primarily implemented to enhance image quality and decrease picture differences that would unnecessarily complicate subsequent processing steps.First, the input images are resized to 256 / 256 images in this module.This re-dimensioned image is then transformed into a gray image.Then, by applying suitable contrast stretching boundaries, the grayscale is applied to make the infected cell or parasites darker and the unwanted background lighter.Contrast stretching methods and, in specific, histogram equalization are the most popular for enhancement.Color normalization methods, including the common use of grayscale colors, have been applied for illumination and staining differences.

RGB to Gray-scale Conversion
The transformation from an RGB image to a gray image includes easy matrix manipulation and is taken to decrease the quantity of learning and detection information involved.The RGB colour model is an additional colour model that adds different types of red, green and blue light.
In the RGB colour system, a colour is described by displaying the amount included of each red, yellow, and blue.The values of its red, blue and green parts must first be obtained to transform any RGB image into a gray-scale image of its luminance (fig.3).

Feature Extraction
Extraction of feature is the method of identifying and classifying some of the specific points on an image [10].

Texture Based feature
Texture-based characteristics help identify the distinct Haralick textures of the infected and uninfected Red Blood Cell While determine the texture characteristics, [11] the matrix-based elements are determine using the following formulas.

Hu Moment shape based features
In 1962, Hu-Moment proposed invariant to rotation, scaling and translation (RTS) invariant seven linked area features and also known as Algebraic Moment Invariants.Moments invariants are the area descriptor which is used to characterize the shape and size of an image.Moments are used to describe the properties of an object with respect to the area, position and orientation.
For an N×N images ; moment of the image f(x,y) is m(i,j) = ∑    ∑    (, )

Classification
This is the final and important phase where the input image characteristics are categorized and compared to the characteristics of the infected Red Blood Cells image [12].In this research, five types of Machine Learning Algorithm are used as a classifier ie.SGD, Random Forest, Decision Tree, Logistic Regression and Xgboost.The classification accuracies of Decision Tree, SGD, Logistic Regression, Xgboost, Random Forest and Hybrid of these three classifiers were compared.Compared to the other classifiers, the hybrid classifiers' accuracy is high.
Based on the outcomes of the classification acquired in the job where diagnostic accuracy is calculated using the following technique-ensemble of classifiers is more than the individual.For the final classification of infected and uninfected Red Blood C, the classifiers are therefore paired using majority voting techniques.And then the boosting method improves the accuracy of hybrid machine learning algorithm.

Classification based on hybrid classifier
Hybrid classifier pairs various classifiers' choice to achieve the final outcome of classification.Hybrid classifier's primary objective is to enhance classification accuracy.In the hybrid classifier, misclassifications of individual classif-ier are often overcome, thus improving the precision rate.

Results
This study set out to enhance the accuracy of the hybrid machine learning algorithm.Hence, this section shows the analysis, results, and discussion about individual and proposed hybrid machine learning classifiers' accuracies on selected malaria dataset.Here we have selected two combinations of supervised machine learning algorithms, the first is SGD, Logistic Regression, and Decision Tree algorithm and the second is SGD, Xgboost and Random Forest algorithm.The following results shown in the table are the outcomes of both ways of implementing our proposed hybrid.Using the combination of SGD, Xgboost and Random Forest average probabilities combination rule of hybrid is done, as it gives better performance on selected datasets.

Comparative Analysis
The classification accuracies of Decision Tree, SGD, Logistic Regression, Xgboost, Random Forest and Hybrid classifiers were compared.The accuracy of the Hybrid classifier was the highest as compared to the other classifiers.

Conclusion
In this study, the proposed ensemble and hybrid algorithm demonstrate that hybrid machine-learning techniques perform better than the individual algorithms on the selected medical dataset of Malaria Cell.In our proposed system, there are two combinations of classifiers.First, the individual classifiers SGD, Logistic regression, and Decision Tree are combined and the second combination is of SGD, Random Forest and Xgboost using the majority voting technique for the final classification of infected and uninfected Red Blood Cells.These three algorithms SGD, Logistic Regression, and Decision tree detect the parasites individually and their accuracies are 90.63%,92.23%, and 93.43%.The proposed hybrid algorithm is applied to the same dataset and then the accuracy is increased to 95.64%.The second proposed hybrid algorithm using combinations of SGD, Xgboost and Random Forest outperform over the proposed hybrid algorithm.They also detect the parasites individually and their accuracies are 90.63%,95.86%, and 96.11%.And again the proposed hybrid algorithm applied on the same data set and the accuracy is increased to 96.22%.The result shows that the hybrid machine-learning algorithm is the key to improve classification accuracy.And the performance of the combination of second hybrid algorithms is better than the combination of the first hybrid algorithm.

Supplementary Materials: Not Applicable
Author Contributions: Conceptualization, methodology and editing by Shahzad Alam, and software, validation, data curation, writing-original draft preparation by Rashke Jahan.

Figure 1 .
Figure 1.Normal Red Blood Cells microscopic images.

Figure 2 .
Figure 2. Malaria parasite infected Red Blood Cells microscopic images.

Figure 3 .
Figure 3. RGB and Gray scale image.

ASM
Statistical values of the image are also called as the second order of the histogram.Statistical features of Haralick are determined using the following formulas.

Figure 5 .
Figure 5.Comparison of individual classifier in terms of accuracy, precision and recall.

Figure 6 .
Figure 6.Comparison of individual classifier in terms of accuracy, precision and recall.

Funding
of SGD, Logistic Regression and Decision Tree Hybrid of SGD,Xgboost and Random Forest VALUES Classification accuracy

Table 1 .
Results of hybrid of SGD, Logistic Regression and Decision Tree algorithm.

Table 2 .
Results of ensemble of Random Forest, Xgboost and SGD algorithm.

Table 3 .
Comparison of classification accuracy.