A Comprehensive Study on Healthcare Datasets Using AI Techniques

Mistry, Sunit; Wang, Lili; Islam, Yousuf; Osei, Frimpong Atta Junior

doi:10.3390/electronics11193146

Open AccessArticle

A Comprehensive Study on Healthcare Datasets Using AI Techniques

¹

School of Mathematics and Big Data, Anhui University of Science and Technology, Huainan 232000, China

²

Anhui Province Engineering Laboratory for Big Data Analysis and Early Warning Technology of Coal Mine Safety, Huainan 232001, China

³

School of Physics and Electronics, Central South University, Changsha 410083, China

⁴

Department of Computer Science, University of Oregon, Eugene, OR 97403, USA

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(19), 3146; https://doi.org/10.3390/electronics11193146

Submission received: 31 August 2022 / Revised: 26 September 2022 / Accepted: 26 September 2022 / Published: 30 September 2022

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Due to greater accessibility, healthcare databases have grown over the years. In this paper, we practice locating and associating data points or observations that pertain to similar entities across several datasets in public healthcare. Based on the methods proposed in this study, all sources are allocated using AI-based approaches to consider non-unique features and calculate similarity indices. Critical components discussed include accuracy assessment, blocking criteria, and linkage processes. Accurate measurements develop methods for manually evaluating and validating matched pairs to purify connecting parameters and boost the process efficacy. This study aims to assess and raise the standard of healthcare datasets that aid doctors’ comprehension of patients’ physical characteristics by using NARX to detect errors and machine learning models for the decision-making process. Consequently, our findings on the mortality rate of patients with COVID-19 revealed a gender bias: female 15.91% and male 22.73%. We also found a gender bias with mild symptoms such as shortness of breath: female 31.82% and male 32.87%. With congestive heart disease symptoms, the bias was as follows: female 5.07% and male 7.58%. Finally, with typical symptoms, the overall mortality rate for both males and females was 13.2%.

Keywords:

healthcare; machine learning; NARX; logistic regression; data matching

1. Introduction

Data records present real-world observations, entities, and ideas in numerical format [1]. Therefore, across healthcare data points, statements belonging to a singular entity or event essentially connect [2,3,4]. A record may not have any corresponding real-world comments, and an entity or event may possess more than one record in the database [5]. More often, regular databases require a restoration to their clean state by removing duplicate records [6].

Several studies from the past provide a diverse array of sequence and set-based similarity approving functions that can deliver purely precise findings [7,8,9,10]. In contrast, no presumptive norms may be used to evaluate the correctness of objects identified in Section 2. Because generated data events or entities may be domain specific and impacted by a variety of circumstances, including data quality [11,12], data collecting antiquity [13,14], and manual evaluation are routinely utilized in such situations depending on knowledge of the database. The observer’s experiences might be prone to misunderstanding and subjectivity. Thus, the concern raised here regarding choosing the appropriate techniques is solved through various experiments with different AI-based methods, as presented in Section 3. These methods provided valuable but different results, yet are relevant to the best-known and best-suited situations [15]. Therefore, the system consists of a dynamics simulation using the nonlinear auto-regressive network [16,17,18,19,20,21]. Produced by inserting several error types into the distillation column, faults were detected and predicted using the NARX network [21,22]. In this study, a neural network time-series model has been applied for error identification in the distillation column of datasets [23]. The prediction model was created using a pilot-scale distillation column, as presented in several studies [22,24,25,26]. However, the proposed methods might be able to explore and understandable a complex healthcare database combined with NARX and machine learning models:

Data monitoring and fault detection using neural network models are performed with three different algorithms: Levenberg–Marquardt, Bayesian regularization, and Scaled Conjugate Gradient as shown in Table 1. Moreover, in Figure 1 presented process of fault detection, and in the prediction model as shown in Figure 2, there is one input node, one output node, and ten hidden layers. Two delay orders show how the results of the previous two interactions predict incoming data. The number of hidden layers chosen is crucial, since it determines the NARX neural network. It is crucial to note this method in order to access NARX in the hidden layer for the prediction of error data monetization. Additionally, this is a different method of normalizing data.
The decision-making process uses a machine learning approach to check linking factors between common symptoms and cross-check them with mortality rate. However, the symptoms should be interpreted with a numerical value. It is meaningful to perform mathematical operations on the values of attributes.

In addition, the rest of the paper is structured in the following sections. In Section 2, we look at relevant background studies and describe the healthcare system’s challenges. Several AI models are theoretically discussed in Section 3. Comparative analysis in Section 4 gives details about experimental models. The results of the experiments are presented in Section 5. Finally, we summarized the results with a statistical discussion in Section 6, and the conclusions of our findings are presented in Section 7.

2. Related Work

In this section, we discuss various background research on healthcare issues. For instance, AI models are considered for improving complex healthcare databases.

2.1. Transformation

Managing healthcare records for various purposes and needs is a key way to create enriched data sets for analysis [27]. Accordingly, an AI methodology is used to assemble data from different sources that have been acknowledged as belonging to the same entity [28,29]. This data management could be carried out using deterministic and probabilistic approaches, depending on how they exist in the first place. This is because there are no common identifiable attributes within all of these data sources. Both of these concerns, these attributes which are present in data sources, are matched with similar functions that determine whether objects match or not [30]. Consider the following scenario: a patient with symptoms could undergo a physical history to help guide their clinical care [31,32]. The present COVID-19 epidemic is special, though. Compared to earlier epidemic outbreaks like the SARS epidemic of 2003 or the H1N1 epidemic of 2009, it has had effects that are even more severe, varied, and dynamic [33]. The global effort to combat the COVID-19 outbreak and upcoming pandemics should include information systems and technology [34].

The pandemic has highlighted the urgent need to transform the public health system from reactive to proactive and create innovations that will give real-time information for proactive decision making. In addition, at the general level, machine learning and deep learning algorithms, among other artificial intelligence technologies, can be employed for the early detection and diagnosis of infection more quickly than drug development for generating novel treatments [35,36,37,38,39,40,41]. However, making a machine analyze all accessible data is challenging.

2.2. Healthcare Challenges

Statistically, USD 3 billion are used in US hospitals to hasten the implementation of electronic medical records, and in Europe’s nonfederal hospitals, rates grew from 9.4% in 2008 to 96% in 2017 [42]. The use of EHR (Electronic Health Records) in German hospitals increased from 39.9% in 2007 to 68.4% in 2017. In 2015, the Chinese central government committed more than USD 3.5 billion, according to an earlier study [42]. However, since the COVID-19 pandemic outbreak, doctors are more reliant on electronic healthcare records. In addition, research has taken a revolutionary turn with machine learning and deep learning algorithms to explore complex healthcare databases, which requires a learning curve compared to obtaining patients’ physical history.

In order to create successful interventions, more predictive and prognostic traits should be identified, as well as the underlying molecular mechanisms of disease [43]. The variable grouping relationships and the variable outcome associations with an average period from exposure to symptoms are used to match the probable confounders in the propensity score model [44,45]. Several earlier studies explain that long COVID-19 symptoms can change or relapse over time. Although patients with lengthy COVID-19 have reported a variety of symptoms, weariness (about 58%), shortness of breath (24%), joint pain (19%), and chest pain (16%) are the most common [46,47]. Also prevalent are headache (44%), palpitations (11%), physical limitations, depression (12%), and insomnia (11%), as presented in [47]. These symptoms may develop during the initial recovery from an acute COVID-19 episode or they may remain after the original COVID-19 illness and do not go away. Moreover, there are many symptom profiles that are associated with post-COVID syndrome [48]. The shift in the COVID-19 mortality rate can be effectively understood by examining this relationship between two or more common symptoms [49,50].

Consequently, our study presented models that assist healthcare data processing. We studied the mortality rate of COVID-19 patients admitted to hospitals with cardiac, lung, and kidney diseases. We have analyzed survival patterns and deaths due to cardiac, lung, and kidney diseases per the COVID-19 symptoms, as presented in Section 6.

Data Matching

Regarding the significance of comparing variables to connect two variables, there is a requirement to create a separate variable that can reach all entries in the master variable with those in the file of interest. We comprehend that each matching field must be distinct and straightforward to locate in both variables. Then, in the merge phase, the matching scenario between the two sets should be separately defined by assigning these codes while comparing data from the master variable. The first digit of this matching scenario should determine whether the first variable field agrees or disagrees. The second digit defines the agreement on the second uneven field. The observer is aware of and knows relatively well-informed perspectives. Let us suppose the observer must choose just those records with similar variables at the beginning and end. In that case, this will result in a purely matched dataset identical to the one obtained by deterministic linkage [51].

Therefore, all such methods propose that the master variable would be two separations and there might be some common elimination of both sets [52]. When we make a set of possible matches theoretically, those sets could be partitioned or separated with regard to exact matches [53], indicated by

M_{j}

; separate codes for matched and unmatched records, where j is describing the comparisons from 1 to

J;

and truly unmatched records are denoted by

U_{j} = (1 - M_{j}

). This could be better understood by an example involving matching two variables on first and last, and the agreement pattern is defined for the

j t h

comparison by

γ_{j}

. The binary agreement indicator for the

i

th linkage field of the

j

th comparison is denoted by

γ_{i j}

, with 1 as the assigned code for agreement and 0 the assigned code for disagreement. In such a case, the agreement indicator for the first name

(i = 1)

and last name

(i = 2)

will give an agreement yield

γ_{j} = [γ_{1 j}, γ_{2 j}] .

Assuming that the exact match has been found, the conditional probability for matching records has an agreement pattern

γ_{j}

, being a true match, and is denoted by

m_{j} = P (γ_{j} = 1 | M_{j} = 1) \equiv P (γ_{j} | M_{j})

. Similarly, the conditional probability that a pair of records has an agreement pattern

γ_{j}

, given they are truly unmatched records, is denoted by

u_{j} = P (γ_{j} = 1 | U_{j} = 1) \equiv P (γ_{j} | U_{j})

. The ratio of

m_{j}

and

u_{j}

(m_{j} / u_{j})

is a likelihood ratio and would be the base of the match weight. With such conditionally independent criteria, these are written as

m_{j} = P (γ_{1 j} | M j) P (γ_{2 j} | M j)

and

u_{j} = P (γ_{1 j} | U_{j}) P γ_{2 j} | U_{j})

. Conditional independence between the linkages of these records of interest were used as a key assumption in the Fellegi and Sunter formulation [54].

3. Methods Employed

The methods employed and implemented through non-linear autoregressive exogenous (NARX) models followed several related studies in [55,56,57,58]. Section 3.2 explains the applicable machine learning models.

In addition, using training data counts for comparable properties to do categorization. The trainable model employed in this case was created utilizing supervised classification methods. However, as presented in Figure 1, we are comparing the model outputs under normal and fault conditions that allow for defect detection.

3.1. Neural Network Models

In this subsection, non-linear autoregressive with exogenous input is used as the predictive model. This model is commonly used to identify fault data in dynamic systems. It has significant application in simulation, monitoring, and analysis control of various systems [59,60]. The model is able to predict the next value of the input variable [17]. The NARX model is a nonlinear discrete time model structure that can be used for univariate as well as multivariate analysis [21,22,61]. In addition, the classification aspect of model structure selection is crucial for the effectiveness of the identification process. A more effective strategy minimizes the simulation model prediction error [21,22]. The NARX desired outcomes can be achieved by adjusting the number of layers, neurons in each layer, learning mechanism, and activation function [62,63]. Activation mechanisms link layers following the mathematical model represented by Equations (1) and (2), which indicate the relationship between input and output as well as the link between the most current value of the time-series and previous values of external input. An earlier study in [64], provides the following equation:

y (t) = f [u (t - 1), u (t - 2), \dots, u (t - n Δ t), y (t - 1), \dots, y (t - n Δ t)]

(1)

y (t) = f^{2} \{b^{2} + \sum_{i = 1}^{m} L W_{i} f^{1} (b^{1} + \sum_{j = 1}^{m} I W_{i j} u_{j t})\}

(2)

where,

f^{1}

and

f^{2}

are activation function of

y (t)

.

b^{1}

is the bias of the first layer representing the hidden layer, and

b^{2}

is the second layer, representing the output layer.

I W

is the input weight,

L W

is the output weight, and

t

is the time step.

The production of the data has been carried out using APD-MATLAB co-simulation. The experimental public data is available in Section 5.2, where simulation finds 20,000 data points, and 60% of the total data set was used for training; 12,000 data points for validation, which represents 20% of the data set; and 12,000 data points for testing 20% of the data set. The validation set further optimizes the model by modifying it, whereas the training set is used to teach the model to be optimized repeatedly. It was necessary to keep an additional check on the model effect without taking part in the training process of test sets. In addition, data used for training, testing, and verification are generated at random. Figure 3 shows that the activation function of the output layer is linear, whereas that of the hidden layer is a hyperbolic tangent following Equations (1) and (2).

Initially, the nonlinear model is run using oversampled input and output data, and the structure is subsequently refined using a decimated version of the original datasets. Figure 3 shows the response of elements for time-series, correlation, error target output, and observed prediction error. The efficiency and test performance are displayed in Table 1.

3.2. Machine Learning Models

A decision tree was utilized to display if-then learned facts to offer disjunctions of conjunctions on the attribute values and classify observations or occurrences by splitting their characteristics from the root to some leaf nodes.

f i

is the frequency of a particular class in a given node and let

C

represent a number of classes. Equations (3)–(5), following [65], are given by:

G i n i = \sum_{i = 1}^{C} f i (1 - f i)

(3)

This indicates the probability of certain samples being correctly categorized. The entropy is given by:

E n t r o p y = \sum_{i = 1}^{C} - f i l o g 2 (f i)

(4)

This measures how the impurity around the particulates of examples of information influences the efficacy. Particular characteristics in classifying the training data using this technique to split data may help to minimize sample impurity.

A

certain property

A

in a sample

S

is taken into consideration when computing information gain, where

I m p

can be either the Gini or entropy impurity measure of

S

. Values

A

represent all possible values of

A

, or

S v

is the subset of

S

that has the attribute

A

with the value

v

. As a result, knowledge gain can be acquired via:

I G (S, A) = I m p (S) - Σ_{v \in V a l u e s (A)} \frac{|S v|}{|S|} I m p (S v)

(5)

Gradient descent tree (GDT) is the process of building an ensemble of decision trees and reducing loss functions by iteratively training different random categories of training data. Let

y i

be the label of an instance and let

N

be the number of occurrences in a subsample. The features of an instance are preserved in

x i

, and

F (x i)

adds a predicted label for

i

by the model to follow [66]. Thus, we get the following equation:

l o g l o s s = 2 - \sum_{i = 1}^{N} \log (1 + \exp (- 2 y i F (x i)))

(6)

The log loss function is used by GDT for classification issues. According to the naive Bayes theorem, a goal value may be the sum of the probabilities of the individual attributes because their values are conditionally independent. Random forests identify the most common class of observations or occurrences using a collection of tree-structured classifiers. To decide on the vote, these classifiers must be trained on independent, identically distributed random observations or events from this training data. This randomization reduces overfitting and delivers competitive classification results when compared to other approaches.

The support vector machine (SVM) seeks to discover a hyperplane that separates these points with various values of

y

given a training data set of

n (\vec{- x_{1}}, y_{1}), \dots, (\vec{- x_{1}}, y_{n}),

where

y_{1}

may assume

1

or

- 1

values to indicate which class the point

\vec{- x_{1}}

belongs to, and

\vec{- x_{1}}

is a p dimensional vector

\in R

.

Regression modeling is based on the values of the independent variables

x

. A regression classifier of this kind may and is able to forecast the probability that an event E will occur. Equations (7) and (8), following [67], are as follows:

p (x) = P r {E | x} = 1 / [1 + e x p \{- α - β' x\}]

(7)

This categorizes a probabilistic data point

x

using a vector of independent variables

w,

and

(α, β)

estimated from the training data. Let

z

be the odds ratio of a positive or negative outcome class given x and w. If

z > 0.5

, the outcome class is positive; otherwise, it is negative:

F (z) = \frac{1}{1 + e^{- z}}

(8)

4. Comparative Analysis

With all of the approaches and talks discussed above, one may presume that such opposing ways have distinct pros and cons when applied to different situations and when applying the NARX neural network to data in fault or error detection [18,24,25,57]. Furthermore, decision trees are simple to evaluate, build, and alter, and they may not necessitate a massive training framework for datasets. Gradient descent trees might give a high-quality performance [68]. Trees function sequentially and need more work and time to understand. Typically, they are prone to more overfitting and it is critical to be watchful and cautious throughout the preprocessing step. Random forests, on the other hand, would provide superior performance and be more vital than single or essential decision trees in terms of accuracy [69]. They also have a lower influence on overfitting issues and can digest or process hundreds of input variables without variable elimination. However, random forests may have a biased structure for features with many levels when dealing with categorical data with more than one level [70].

Following phrases and techniques, naive Bayes classification is the fast and easy-to-measure method [71]. This classifier guarantees better results and is easy to implement. It is most definitely best suited to all genuine and discrete data [72] and may not be influenced by irrelevant aspects within the dataset [73]. One of its primary downsides is that this classifier will evaluate the independence of characteristics on training data, which means it may not be able to understand relationships between features. However, Sequential Minimal Optimization (SMO) is an effective technique for dealing with a convex optimization problem [74].

Therefore, logistic regression which is often a basic procedure that is quick enough for application [75,76,77] usually requires larger datasets than other methods to establish stability to perform best; if the dataset only has a single decision boundary, then logistic regression is less likely to overfit [66].

5. Experiments and Results

In this section, we describe how the experiment followed two steps, as we mentioned earlier. In Section 5.3 we discuss exhibit equality, dissimilarity, and related items to indicate and begin this verification. Qualitative and numerical qualities will both provide a binary answer. In Section 5.4, different methods are used concerning the nominal values for which the measure of dissimilarity would be meaningful for patients hospitalized with mild symptoms.

5.1. Trainable Model

The data inputs for the trainable model have specific characteristics that can imitate what a statistician commonly uses to assess linking outcomes. The technique includes the creation of a dataset to demonstrate how distinct nominal is, as well as the equality of either category or numerical properties needed by the linking algorithm.

However, to provide some data balance, the first phase will use descriptive analysis of the dataset to find standard behavior between pairings and reject missing values shown in Figure 3. The second will be data normalization to offer much-needed attributes, often the same characteristics employed by the linking algorithm, and the third will be data classification during the accuracy evaluation stage, shown in Table 1. Even though they already have identical values, each pair transitions to verifying the discrepancies between characteristics. In addition, median categorization is applied to improve data balance and distance between two variables. However, to select the most effective classifier for our dataset, cross-validation is one of the best techniques for evaluating and choosing classifiers. The next step is the model execution phase, which enables the use of a tested technique with a fresh data mart.

5.2. Data Source

In this study, all experimental public datasets were collected from healthdata.gov regarding mortality rate of cardiac, lung, and kidney diseases, and other symptoms. However, patients were hospitalized for COVID-19 symptoms with a certain physical history, such as a history of smoking and asthma. The experimental dataset example is available in Google Drive.

5.3. Data Monitoring and Detection

The effectiveness of the three distinct training algorithms is compared by choosing various numbers of neurons. Table 1 displays the network’s performance (mean square error) for various learning techniques and hidden layer neuron densities. The best outcomes are obtained when the hidden layer has ten neurons, and the Levenberg–Marquardt algorithm performance (mean square error) and error histograms of the tested network are shown in Table 1.

NARX has been trained offline utilizing both accurate and inaccurate detection patterns. Time-series of data for the input and output were collected through simulation. However, the target series has been predicted using the historical values of these time series. Three different target-time step types—training, validation, and testing—have been employed for the validation and testing of datasets. Input and target vectors are partition by:

Set 1: For training: 60% of data points;
Set 2: For network generalization: 20% of the data was used to test the network, this dataset must be independent of everything;
Set 3: In order to prevent the network from overtraining, the training must be interrupted. In total, 20% of the data was used to validate the network.

During the network training process, Levenberg–Marquardt [78], Bayesian Regularization [79], and Scaled Conjugate Gradient [80] are three different learning techniques to normalize several data concerning common symptoms presented in Section 6, Therefore, we presented the significance of this study following Equations (9)–(15), which is visualized in Figure 4.

Consequently, accumulation of previous values for the variable

u_{(t)}

u

and

y_{(t)}

sequences are performed by NARX using the tapped delay lines. This network’s output,

y (t)

, is feedback to the input that was collected through delays following an earlier study [81]. The system may be displayed as:

Y = f (Y_{1}, Y_{2}, \dots Y_{n}, U_{1}, U_{2}, \dots, U_{m},)

(9)

where,

Y = [\begin{matrix} y_{1}^{(k + 1)} \\ y_{2}^{(k + 1)} \\ \begin{matrix} ⋮ \\ y_{n}^{(k + 1)} \end{matrix} \end{matrix}]

(10)

\{\begin{matrix} Y_{1} = [y_{1}^{(k)}, y_{1}^{(k - 1)}, \dots y_{1}^{(k - n_{y})}] \\ Y_{2} = [y_{2}^{(k)}, y_{2}^{(k - 1)}, \dots y_{2}^{(k - n_{y})}] \\ \begin{matrix} ⋮ \\ Y_{n} = [y_{n}^{(k)}, y_{n}^{(k - 1)}, \dots y_{n}^{(k - n_{y})}] \end{matrix} \end{matrix}

(11)

model outputs and inputs are

y_{1}^{(k)}, y_{2}^{(k)}, \dots y_{n}^{(k)} u_{1}^{(k)}, u_{2}^{(k)}, \dots u_{m}^{(k)} \in R

, respectively, at time in separate steps.

k . n_{u}, n_{y} \geq 1

are memory ordering for input and output, respectively,

m

and

n

are the number of the input and output variables, respectively.

The following can be used to present the general form of a neural network for a single layer:

\dot{y} = f^{1} (I W^{1, 1} p + b^{1})

(12)

In layer 1 weight

I W

,

p

inputs to the neural network,

y

outputs from the neural network. In layer

i

, the activation function is

f

and

b^{1}

is the bias value at layer 1. By multiplying the bias values and input layer of the matrix with the hidden layer of the matrix, the

f^{1}

function can be made simpler. Utilizing the idea of backward shift operators, the above form for liner activation function can be made simpler in vector form, i.e.,

\{\begin{matrix} y_{1}^{(k - i)} = y_{1}^{(k)} q^{- i} \\ y_{1}^{(k - 1)} = y_{1}^{(k)} q^{- 1} \\ y_{1}^{(k - 2)} = y_{1}^{(k)} q^{- 1} \end{matrix}

(13)

\{\begin{array}{l} Y_{1} = [y_{1}^{(k)}, y_{1}^{(k)} q^{- 1}, y_{1}^{(k)} q^{- 2} \dots y_{1}^{(k)} q^{- n_{y}}] \\ Y_{1} = y_{1}^{(k)} [1, q^{- 1}, q^{- 2} \dots q^{- n_{y}}] \end{array}

(14)

The condensed vector form is as follows:

\dot{y} = [\begin{matrix} {\dot{y}}_{1} \\ \begin{matrix} {\dot{y}}_{2} \\ ⋮ \end{matrix} \\ {\dot{y}}_{2} \end{matrix}] = f^{1} [I W^{1, 1} p + b^{1}]

(15)

the matrix definition is located

I W^{1, 1}, b^{1}

are given as

I W^{1, 1}

weight at layer 1, and the

b^{1}

bias value at layer 1. According to the above Equations (1), (2), and (9)–(15), this system can forecast both the present and future output value, ensuring the system is also capable of monitoring or detecting errors. This method is suitable for early error identification of input and output values.

Figure 4 shows correlation coefficients between the model’s predicted value and the actual value, regardless of whether it is the training set, validation set, or test set. It indicated that each regression mode shows a value very close to 1. Moreover, in Section 5.3, we discuss several machine learning approaches for correlating data and analyzing correlations between variables, considering patients admitted with COVID-19 symptoms.

We forecast the current and future values of the system’s outputs using the inputs and outputs corresponding values. However, as in Figure 3, the accuracy of the predictions is directly proportional to the amount of training data used, and it has an inverse relationship with how far in advance the forecast is.

Figure 4 displays the tested network’s correlation performance, and the correlation between the output data and the target data is measured where R = 1 indicates that the output and target are closely related, as we mentioned earlier. In the experimental simulation process, the obtained mean square error value is

5.867 \times 10^{2}

. Therefore, based on this mean square error, value can be forecast. In contrast, this approach is chosen to forecast the attractor phase route of this method with the best training effect and the number of related hidden layer neurons.

5.4. Machine Learning Approach

This subsection was implemented using IBM SPSS Statistics Campus Edition to perform machine learning models, which are theoretically discussed in Section 3.2. In addition, we present evidence relating mild symptoms of congestive heart failure, shortness of breath, chronic obstructive disease, acute cardiac injury, and acute kidney-injury-related factors to COVID-19 patients. As mentioned earlier, this stage generates categorization depending on the selected learned model, as shown in Table 2.

The statistic of 85.8% of smoking history contributing to mortality rate exhibits that lung disease has a higher percentage of mortality within the symptoms of COVID-19 patients admitted to hospital. The statistic shows that many patients have severe damage due to smoking. A total of 92.1% of patients expired when admitted to hospitals suffering from chronic obstructive lung disease.

In Table 3, model fitting criteria also suggest the algorithm due to its higher contribution percentage among other algorithms and a higher percentage of linkage success.

In Table 4, the chi-square statistic quantifies the difference in −2 log-likelihoods between the final model. The final model’s effects are taken out to make the reduced model. The null hypothesis is that each effect parameter is 0. This reduced model is equivalent to the complete model because reducing the effect does not increase the result.

Smoking history had a 0.496 significance with respect to the mortality rate of patients shows that many patients have severe damage due to smoking, as shown in Table 5. In contrast, acute cardiac injury had a 0.000 significance of patients with respect to mortality. It also exhibits that a higher percentage of mortality is attributable to the symptoms of COVID-19.

A whole 1.00 was the contribution of smoking history towards the mortality rate of patients. Table 6 exhibits the damage from smoking history.

The cluster analysis algorithm gives less weightage to smoking history and forbids every other factor, as provided by the cluster analysis algorithm in Table 7. However, it gives a higher weightage to shortness of breath.

6. Discussion

This paper has discussed the subject using a variety of prediction methods, as we mentioned earlier in Section 3.1, demonstrating a close connection between the target and the outcome. The measured mean square error during the experimental simulation of intelligent detection methods can also be applied to several algorithms regarding this study purpose, as we presented in Table 1, where there were noise and measurement interruptions. The use of diagnostics tools to isolate and identify plant problems may be added to this strategy. Research indicates that machine learning algorithms considerably enhance decision-making, giving them an edge over alternative approaches. This study shows that people with COVID-19 symptoms who also have other illnesses, including kidney, heart, or lung problems, are more likely to perish.

The fact that 85.8% of COVID-19 patients with a smoking history died indicates that COVID-19 patients are more likely to die from lung disease. Additionally, our utilization of multiple methodologies and processes resulted in logistic regression outperforming other classification algorithms when utilizing the supplied dataset and a cross-validation strategy, as shown in Table 7. In spite of preprocessing, modification, and categorization techniques, other models might produce superior results. This study involved patients admitted to hospitals due to COVID-19 symptoms and with a major disease. Their mortality and survival rates are discussed with the contribution of different diseases, and symptoms.

Figure 5a differentiates survived and expired patients with corresponding cognitive heart disease symptoms, exhibiting 572 total patients with congestive heart disease symptoms. The results show that 89.22% of females have no congestive heart disease symptoms. In contrast, only 10.78% of females and 14.12% of males show heart disease. There are several study reviews treating this subject [34,47,50,52,82]. This trend indicates that heart disease was not very common in COVID-19 diagnosed patients. In females, 32.34% depicted no shortness of breath symptoms, while 182 have shown this symptom as COVID-19 symptoms. A total of 115 out of the total males have described no shortness of breath symptom, whereas 188 had COVID-19 symptoms. In Figure 5c, Out of the total, 182 females have survived, while 85 expired; out of the total males, 174 have survived and 129 expired due to COVID-19 symptoms. However, Figure 5d shows that 226 patients out of the total females survived while 43 expired due to COVID-19 symptoms, while 236 out of the total males survived and 67 expired due to acute cardiac injury with COVID-19 symptoms.

Figure 6a shows ICU admission with a gender-based chart, which indicates that 227 out of the total females survived while 42 expired with COVID-19 symptoms, while 235 out of the total males survived and 66 expired with COVID-19 symptoms. In Figure 6b, 236 female patients with chronic kidney disease survived while 12.26% expired with COVID-19 symptoms, and 252 males survived and 16.83% expired. In Figure 6c, 251 females with chronic obstructive lung disease survived while 18 expired, whereas 276 out of the total males survived and 27 expired due to COVID-19 symptoms. In Figure 6d, exhibiting mortality rate with a gender-based chart, 178 out of the total females survived while 91 expired with COVID-19 symptoms, while 173 out of the total males survived and 130 expired.

The mortality rate for females can be was comparable to that of males. The magnitude of the data may have an impact on the results. Such models and their methods contain initial stages that will allow to create a dataset with features that will be utilized to develop and assess models. In contrast, the fault detection method has proven to be quite successful to track the column’s unusual behavior. All measurements included the addition of the zero mean normal distributed sensor noise simulation, as shown in Figure 4. The neural network model was trained using the generated data. The network’s performance has been verified using error autocorrelation in Figure 3, employing different machine learning classifiers followed by their configuration for discussion and validation of new data marts. However, a cluster algorithm will have a biased opinion rather than a multinomial logistic algorithm, as shown in Table 7. Increasing the number of iterations of gradient boosted trees, random forest, and SVM might potentially produce beneficial outcomes, as mentioned earlier. New characteristics and the phenomenon of dissimilarity may hold the key to more precise results.

7. Conclusions

Our findings do establish the higher accuracy and time-consuming operation that becomes impractical when dealing with large healthcare datasets. This validation procedure is a binary classification problem and manual process evaluations might be removed by employing the trainable models presented in Figure 3. According to the results, 60% of the data was utilized to train the neural network model, and the remaining 40% was used for verification. The proposed model was discovered to be effective at accurately representing system behavior and ideal for defect identification of internal and external errors in extract columns of health datasets.

However, the results outperformed expectations. Regarding the mortality rate of COVID-19-affected patients, 15.91% of females and 22.73% of males expired. A total of 31.82% of females and 32.87% of males had mild symptoms of shortness of breath. A total of 5.07% of females and 7.58% of males had congestive heart disease symptoms. However, 13.2% of males and females expired with common symptoms such as cognitive heart disease, shortness of breath symptom, and chronic obstructive lung disease.

Future work should explore the reliability of matching in different data generating circumstances and outcome kinds, as this study followed earlier work [83]. There may be additional ways to divide and evaluate risk factors, and we are considering future work confounders for the simulation [84]. Additionally, more investigation into the effectiveness of these methods is required when there is significant unmeasured confounding or uncertainty in model specification and the accuracy of standard error estimators for these methods is uncertain.

Author Contributions

Direction by L.W.; original manuscript, S.M., L.W. and Y.I.; formal analysis, and F.A.J.O.; The study’s design, gathering, analyzing, and interpreting datasets; the preparation of the article; and the choice to submit it for publication were under the purview of all authors. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation, China (No. 61572035, 61402011), the Leading Backbone Talent Project in Anhui Province, China (No. 2020-1-12), the Natural Science Foundation of Anhui Province, China (No. 2008085QD178), Anhui Province Academic and Technical Leader Foundation (No. 2019H239), and Anhui Province College Excellent Young Talents Fund Project of China (No. gxyqZD2020020), Open Research Fund of Anhui Province Engineering Laboratory for Big Data Analysis and Early Warning Technology of Coal Mine Safety (NO. CSBD2022-ZD03).

Data Availability Statement

The article contains the original contributions made for this study; further questions should be addressed to the corresponding author.

Acknowledgments

The authors are grateful to Lili Wang, Lei Pengfei, Mohammad Arfan Ali, Md. Jalal Uddin, Abdulkadir Abdulahi Hasan, Abdiaziz Omar Hassan and Atira Cesare Mutia, for encouragement and inspiration.

Conflicts of Interest

The authors declare no conflict of interest.

References

Berger, M.L.; Sox, H.; Willke, R.J.; Brixner, D.L.; Eichler, H.G.; Goettsch, W.; Madigan, D.; Makady, A.; Schneeweiss, S.; Tarricone, R.; et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Value Health 2017, 20, 1003–1008. [Google Scholar] [CrossRef] [PubMed]
Franz, L.; Shrestha, Y.R.; Paudel, B. A deep learning pipeline for patient diagnosis prediction using electronic health records. arXiv 2020, arXiv:2006.16926. [Google Scholar]
Xu, J.; Glicksberg, B.S.; Su, C.; Walker, P.; Bian, J.; Wang, F. Federated learning for healthcare informatics. J. Healthc. Inform. Res. 2021, 5, 1–19. [Google Scholar] [CrossRef] [PubMed]
Pavlopoulou, N.; Curry, E. PoSSUM: An Entity-centric Publish/Subscribe System for Diverse Summarization in Internet of Things. ACM Trans. Internet Technol. TOIT 2022, 22, 1–30. [Google Scholar] [CrossRef]
Liu, X.; Xu, L.Q. Knowledge Graph Building from Real-world Multisource “Dirty” Clinical Electronic Medical Records for Intelligent Consultation Applications. In Proceedings of the 2021 IEEE International Conference on Digital Health (ICDH), Chicago, IL, USA, 5–10 September 2021. [Google Scholar]
Steorts, R.C.; Ventura, S.L.; Sadinle, M.; Fienberg, S.E. A comparison of blocking methods for record linkage. In Proceedings of the International Conference on Privacy in Statistical Databases, Ibiza, Spain, 17–19 September 2014. [Google Scholar]
Pérez-Moraga, R.; Forés-Martos, J.; Suay-García, B.; Duval, J.L.; Falcó, A.; Climent, J. A COVID-19 drug repurposing strategy through quantitative homological similarities using a topological data analysis-based framework. Pharmaceutics 2021, 13, 488. [Google Scholar] [CrossRef]
Hung, T.N.K.; Le, N.Q.K.; Le, N.H.; Van Tuan, L.; Nguyen, T.P.; Thi, C.; Kang, J.H. An AI-based Prediction Model for Drug-drug Interactions in Osteoporosis and Paget’s Diseases from SMILES. Mol. Inform. 2022, 41, 2100264. [Google Scholar] [CrossRef]
Ouyang, D.; He, B.; Ghorbani, A.; Yuan, N.; Ebinger, J.; Langlotz, C.P.; Heidenreich, P.A.; Harrington, R.A.; Liang, D.H.; Ashley, E.A.; et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 2020, 580, 252–256. [Google Scholar] [CrossRef]
Rahman, M.M.; Saha, T.; Islam, K.J.; Suman, R.H.; Biswas, S.; Rahat, E.U.; Hossen, M.R.; Islam, R.; Hossain, M.N.; Mamun, A.A.; et al. Virtual screening, molecular dynamics and structure–activity relationship studies to identify potent approved drugs for COVID-19 treatment. J. Biomol. Struct. Dyn. 2021, 39, 6231–6241. [Google Scholar] [CrossRef]
Persson, R.; Vasilakis-Scaramozza, C.; Hagberg, K.W.; Sponholtz, T.; Williams, T.; Myles, P.; Jick, S.S. CPRD Aurum database: Assessment of data quality and completeness of three important comorbidities. Pharmacoepidemiol. Drug Saf. 2020, 29, 1456–1464. [Google Scholar] [CrossRef]
Schmidt, M.; Schmidt, S.A.J.; Adelborg, K.; Sundbøll, J.; Laugesen, K.; Ehrenstein, V.; Sørensen, H.T. The Danish health care system and epidemiological research: From health care contacts to database records. Clin. Epidemiol. 2019, 11, 563. [Google Scholar] [CrossRef]
Singh, R.P.; Javaid, M.; Haleem, A.; Suman, R. Internet of things (IoT) applications to fight against COVID-19 pandemic. Diabetes Metab. Syndr. Clin. Res. Rev. 2020, 14, 521–524. [Google Scholar] [CrossRef] [PubMed]
Xiao, W.; Jing, L.; Xu, Y.; Zheng, S.; Gan, Y.; Wen, C. Different Data Mining Approaches Based Medical Text Data. J. Healthc. Eng. 2021, 2021, 11. [Google Scholar] [CrossRef] [PubMed]
Ramadan, B.; Christen, P.; Liang, H.; Gayler, R.W. Dynamic sorted neighborhood indexing for real-time entity resolution. J. Data Inf. Qual. JDIQ 2015, 6, 1–29. [Google Scholar] [CrossRef]
Rad, M.A.A.; Yazdanpanah, M.J. Designing supervised local neural network classifiers based on EM clustering for fault diagnosis of Tennessee Eastman process. Chemom. Intell. Lab. Syst. 2015, 146, 149–157. [Google Scholar]
Nozari, H.A.; Shoorehdeli, M.A.; Simani, S.; Banadaki, H.D. Model-based robust fault detection and isolation of an industrial gas turbine prototype using soft computing techniques. Neurocomputing 2012, 91, 29–47. [Google Scholar] [CrossRef]
Lin, T.; Horne, B.G.; Giles, C.L.; Kung, S.Y. What to remember: How memory order affects the performance of NARX neural networks. In Proceedings of the IEEE International Joint Conference on Neural Networks Proceedings, IEEE World Congress on Computational Intelligence (Cat. No. 98CH36227), Anchorage, AK, USA, 4–9 May 1998. [Google Scholar]
Isqeel, A.A.; Eyiomika, S.M.J.; Ismaeel, T.B. Consumer Load Prediction Based on NARX for Electricity Theft Detection. In Proceedings of the International Conference on Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 26–27 July 2016. [Google Scholar]
Dzielinski, A. NARX Models Application to Model Based Nonlinear Control. 1999. Available online: https://mathweb.ucsd.edu/~helton/MTNSHISTORY/CONTENTS/2000PERPIGNAN/CDROM/articles/SI20A_2.pdf (accessed on 31 August 2022).
Lin, T.N.; Giles, C.L.; Horne, B.G.; Kung, S.Y. A delay damage model selection algorithm for NARX neural networks. IEEE Trans. Signal Process. 1997, 45, 2719–2730. [Google Scholar]
Menezes, J.M.P., Jr.; Barreto, G.A. Long-term time series prediction with the NARX network: An empirical evaluation. Neurocomputing 2008, 71, 3335–3343. [Google Scholar] [CrossRef]
Rusinov, L.A.; Rudakova, I.V.; Remizova, O.A.; Kurkina, V.V. Fault diagnosis in chemical processes with application of hierarchical neural networks. Chemom. Intell. Lab. Syst. 2009, 97, 98–103. [Google Scholar] [CrossRef]
Diaconescu, E. The use of NARX neural networks to predict chaotic time series. Wseas Trans. Comput. Res. 2008, 3, 182–191. [Google Scholar]
Inaoka, H.; Kobayashi, K.; Nebuya, S.; Kumagai, H.; Tsuruta, H.; Fukuoka, Y. Derivation of NARX models by expanding activation functions in neural networks. IEEJ Trans. Electr. Electron. Eng. 2019, 14, 1209–1218. [Google Scholar] [CrossRef]
Banihabib, M.E.; Ahmadian, A.; Valipour, M. Hybrid MARMA-NARX model for flow forecasting based on the large-scale climate signals, sea-surface temperatures, and rainfall. Hydrol. Res. 2018, 49, 1788–1803. [Google Scholar] [CrossRef]
Gale, N.K.; Heath, G.; Cameron, E.; Rashid, S.; Redwood, S. Using the framework method for the analysis of qualitative data in multi-disciplinary health research. BMC Med. Res. Methodol. 2013, 13, 1–8. [Google Scholar] [CrossRef]
Sun, W.; Cai, Z.; Li, Y.; Liu, F.; Fang, S.; Wang, G. Data processing and text mining technologies on electronic medical records: A review. J. Healthc. Eng. 2018, 2018, 9. [Google Scholar] [CrossRef] [PubMed]
Zeng, Z.; Deng, Y.; Li, X.; Naumann, T.; Luo, Y. Natural language processing for EHR-based computational phenotyping. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018, 16, 139–153. [Google Scholar] [CrossRef] [PubMed]
Jutte, D.P.; Roos, L.L.; Brownell, M.D. Administrative record linkage as a tool for public health research. Annu. Rev. Public Health 2011, 32, 91–108. [Google Scholar] [CrossRef]
Hellewell, J.; Abbott, S.; Gimma, A.; Bosse, N.I.; Jarvis, C.I.; Russell, T.W.; Munday, J.D.; Kucharski, A.J.; Edmunds, W.J.; Sun, F.; et al. Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts. Lancet Glob. Health 2020, 8, e488–e496. [Google Scholar] [CrossRef]
Nakano, J.; Hashizume, K.; Fukushima, T.; Ueno, K.; Matsuura, E.; Ikio, Y.; Ishii, S.; Morishita, S.; Tanaka, K.; Kusuba, Y. Effects of aerobic and resistance exercises on physical symptoms in cancer patients: A meta-analysis. Integr. Cancer Ther. 2018, 17, 1048–1058. [Google Scholar] [CrossRef]
Koonin, L.M. Novel coronavirus disease (COVID-19) outbreak: Now is the time to refresh pandemic plans. J. Bus. Contin. Emerg. Plan. 2020, 13, 298–312. [Google Scholar]
Ågerfalk, P.J.; Conboy, K.; Myers, M.D. Information systems in the age of pandemics: COVID-19 and beyond. Eur. J. Inf. Syst. 2020, 29, 203–207. [Google Scholar] [CrossRef]
Mosavi, N.S.; Santos, M.F. How prescriptive analytics influences decision making in precision medicine. Procedia Comput. Sci. 2020, 177, 528–533. [Google Scholar] [CrossRef]
Harper, A.; Mustafee, N. Proactive service recovery in emergency departments: A hybrid modelling approach using forecasting and real-time simulation. In Proceedings of the SIGSIM Principles of Advanced Discrete Simulation, Chicago, IL, USA, 29 May 2019. [Google Scholar]
Mbuh, M.J.; Metzger, P.; Brandt, P.; Fika, K.; Slinkey, M. Application of real-time GIS analytics to support spatial intelligent decision-making in the era of big data for smart cities. EAI Endorsed Trans. Smart Cities 2019, 4, e3. [Google Scholar] [CrossRef]
Liu, J.; Khattak, A. Informed decision-making by integrating historical on-road driving performance data in high-resolution maps for connected and automated vehicles. J. Intell. Transp. Syst. 2020, 24, 11–23. [Google Scholar] [CrossRef]
Konchak, C.W.; Krive, J.; Au, L.; Chertok, D.; Dugad, P.; Granchalek, G.; Livschiz, E.; Mandala, R.; McElvania, E.; Park, C.; et al. From testing to decision-making: A data-driven analytics COVID-19 response. Acad. Pathol. 2021, 8, 23742895211010257. [Google Scholar] [CrossRef] [PubMed]
Bousdekis, A.; Lepenioti, K.; Apostolou, D.; Mentzas, G. A review of data-driven decision-making methods for industry 4.0 maintenance applications. Electronics 2021, 10, 828. [Google Scholar] [CrossRef]
Lugaresi, G.; Matta, A. Real-time simulation in manufacturing systems: Challenges and research directions. In Proceedings of the Winter Simulation Conference (WSC), Gothenburg, Sweden, 9–12 December 2018. [Google Scholar]
Liang, J.; Li, Y.; Zhang, Z.; Shen, D.; Xu, J.; Zheng, X.; Wang, T.; Tang, B.; Lei, J.; Zhang, J. Adoption of Electronic Health Records (EHRs) in China during the past 10 years: Consecutive survey data analysis and comparison of sino-american challenges and experiences. J. Med. Internet Res. 2021, 23, e24813. [Google Scholar] [CrossRef]
Wynberg, E.; van Willigen, H.D.; Dijkstra, M.; Boyd, A.; Kootstra, N.A.; van den Aardweg, J.G.; van Gils, M.J.; Matser, A.; de Wit, M.R.; Leenstra, T.; et al. Evolution of coronavirus disease 2019 (COVID-19) symptoms during the first 12 months after illness onset. Clin. Infect. Dis. 2022, 75, e482–e490. [Google Scholar] [CrossRef]
Austin, P.C. A critical appraisal of propensity—score matching in the medical literature between 1996 and 2003. Stat. Med. 2008, 27, 2037–2049. [Google Scholar] [CrossRef]
Brookhart, M.A.; Schneeweiss, S.; Rothman, K.J.; Glynn, R.J.; Avorn, J.; Stürmer, T. Variable selection for propensity score models. Am. J. Epidemiol. 2006, 163, 1149–1156. [Google Scholar] [CrossRef]
Carfì, A.; Bernabei, R.; Landi, F. Persistent symptoms in patients after acute COVID-19. JAMA 2020, 324, 603–605. [Google Scholar] [CrossRef]
Lopez-Leon, S.; Wegman-Ostrosky, T.; Perelman, C.; Sepulveda, R.; Rebolledo, P.A.; Cuapio, A.; Villapol, S. More than 50 long-term effects of COVID-19: A systematic review and meta-analysis. Sci. Rep. 2021, 11, 16144. [Google Scholar] [CrossRef]
Salamanna, F.; Veronesi, F.; Martini, L.; Landini, M.P.; Fini, M. Post-COVID-19 syndrome: The persistent symptoms at the post-viral stage of the disease. A systematic review of the current data. Front. Med. 2021, 8, 653516. [Google Scholar] [CrossRef] [PubMed]
Law, T.H.; Ng, C.P.; Poi, A.W.H. The sources of the Kuznets relationship between the COVID-19 mortality rate and economic performance. Int. J. Disaster Risk Reduct. 2022, 81, 103233. [Google Scholar] [CrossRef] [PubMed]
Sze, S.; Pan, D.; Nevill, C.R.; Gray, L.J.; Martin, C.A.; Nazareth, J.; Minhas, J.S.; Divall, P.; Khunti, K.; Abrams, K.R.; et al. Ethnicity and clinical outcomes in COVID-19: A systematic review and meta-analysis. E Clin. Med. 2020, 29, 100630. [Google Scholar] [CrossRef]
Sayers, A.; Ben-Shlomo, Y.; Blom, A.W.; Steele, F. Probabilistic record linkage. Int. J. Epidemiol. 2016, 45, 954–964. [Google Scholar] [CrossRef]
Zhao, Y.J.; Xing, X.; Tian, T.; Wang, Q.; Liang, S.; Wang, Z.; Cheung, T.; Su, Z.; Tang, Y.L.; Ng, C.H.; et al. Post COVID-19 mental health symptoms and quality of life among COVID-19 frontline clinicians: A comparative study using propensity score matching approach. Transl. Psychiatry 2022, 12, 1–7. [Google Scholar] [CrossRef] [PubMed]
Jaro, M.A. Probabilistic linkage of large public health data files. Stat. Med. 1995, 14, 491–498. [Google Scholar] [CrossRef] [PubMed]
Fellegi, I.P.; Sunter, A.B. A theory for record linkage. J. Am. Stat. Assoc. 1969, 64, 1183–1210. [Google Scholar] [CrossRef]
Yassin, I.M.; Zabidi, A.; Ali, M.S.; Baharom, R. PSO-Optimized COVID-19 MLP-NARX Mortality Prediction Model. In Proceedings of the IEEE Industrial Electronics and Applications Conference (IEACon), Penang, Malaysia, 22–23 November 2021. [Google Scholar]
Peng, C.C.; Yeh, C.W.; Wang, J.G.; Wang, S.H.; Huang, C.W. Prediction of LME lead spot price by neural network and NARX model. In Proceedings of the 2nd Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability (ECBIOS), Tainan, Taiwan, 29–31 May 2020. [Google Scholar]
Bhattacharjee, U.; Chakraborty, M. NARX-Wavelet Based Active Model for Removing Motion Artifacts from ECG. In Proceedings of the International Conference on Computer, Electrical & Communication Engineering (ICCECE), Kolkata, India, 17–18 January 2020. [Google Scholar]
Wei, H.L. Sparse, interpretable and transparent predictive model identification for healthcare data analysis. In Proceedings of the International Work-Conference on Artificial Neural Networks, Gran Canaria, Spain, 12–14 June 2019. [Google Scholar]
Chen, S.; Billings, S.A.; Grant, P.M. Non-linear system identification using neural networks. Int. J. Control 1990, 51, 1191–1214. [Google Scholar] [CrossRef]
Kumpati, S.N.; Kannan, P. Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Netw. 1990, 1, 4–27. [Google Scholar]
Ljung, L.; Söderström, T. Theory and Practice of Recursive Identification; MIT Press: Cambridge, MA, USA, 1983. [Google Scholar]
Zemouri, R.; Gouriveau, R.; Zerhouni, N. Defining and applying prediction performance metrics on a recurrent NARX time series model. Neurocomputing 2010, 73, 2506–2521. [Google Scholar] [CrossRef]
Gao, Y.; Liu, S.; Li, F.; Liu, Z. Fault detection and diagnosis method for cooling dehumidifier based on LS-SVM NARX model. Int. J. Refrig. 2016, 61, 69–81. [Google Scholar] [CrossRef]
Kong, S.; Li, C.; He, S.; Çiçek, S.; Lai, Q. A memristive map with coexisting chaos and hyperchaos. Chin. Phys. B 2021, 30, 110502. [Google Scholar] [CrossRef]
Polato, M.; Lauriola, I.; Aiolli, F. A novel boolean kernels family for categorical data. Entropy 2018, 20, 444. [Google Scholar] [CrossRef] [PubMed]
Bisong, E. Logistic regression. In Building Machine Learning and Deep Learning Models on Google Cloud Platform; Apress: Berkeley, CA, USA, 2019; pp. 243–250. [Google Scholar]
Du Toit, C.F. The numerical computation of Bessel functions of the first and second kind for integer orders and complex arguments. IEEE Trans. Antennas Propag. 1990, 38, 1341–1349. [Google Scholar] [CrossRef]
Flynn, J.; Broxton, M.; Debevec, P.; DuVall, M.; Fyffe, G.; Overbeck, R.; Snavely, N.; Tucker, R. Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Al Hamad, M.; Zeki, A.M. Accuracy vs. cost in decision trees: A survey. In Proceedings of the International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), Sakhier, Bahrain, 18–20 November 2018. [Google Scholar]
Krauss, C.; Do, X.A.; Huck, N. Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500. Eur. J. Oper. Res. 2017, 259, 689–702. [Google Scholar]
Bhuvaneswari, R.; Kalaiselvi, K. Naive Bayesian classification approach in healthcare applications. Int. J. Comput. Sci. Telecommun. 2012, 3, 106–112. [Google Scholar]
Vembandasamy, K.; Sasipriya, R.; Deepa, E. Heart diseases detection using Naive Bayes algorithm. Int. J. Innov. Sci. Eng. Technol. 2015, 2, 441–444. [Google Scholar]
Jadhav, S.D.; Channe, H.P. Comparative study of K-NN, naive Bayes and decision tree classification techniques. Int. J. Sci. Res. IJSR 2016, 5, 1842–1845. [Google Scholar]
Sornalakshmi, M.; Balamurali, S.; Venkatesulu, M.; Navaneetha Krishnan, M.; Ramasamy, L.K.; Kadry, S.; Manogaran, G.; Hsu, C.H.; Muthu, B.A. Hybrid method for mining rules based on enhanced Apriori algorithm with sequential minimal optimization in healthcare industry. Neural Comput. Appl. 2020, 34, 10597–10610. [Google Scholar] [CrossRef]
Jothi, N.; Husain, W. Data mining in healthcare—A review. Procedia Comput. Sci. 2015, 72, 306–313. [Google Scholar] [CrossRef] [Green Version]
Manogaran, G.; Lopez, D. Health data analytics using scalable logistic regression with stochastic gradient descent. Int. J. Adv. Intell. Paradig. 2018, 10, 118–132. [Google Scholar] [CrossRef]
Demir, E. A decision support tool for predicting patients at risk of readmission: A comparison of classification trees, logistic regression, generalized additive models, and multivariate adaptive regression splines. Decis. Sci. 2014, 45, 849–880. [Google Scholar] [CrossRef]
Khan, N.; Gaurav, D.; Kandl, T. Performance evaluation of Levenberg-Marquardt technique in error reduction for diabetes condition classification. Procedia Comput. Sci. 2013, 18, 2629–2637. [Google Scholar] [CrossRef]
McCormick, M.; Rubert, N.; Varghese, T. Bayesian regularization applied to ultrasound strain imaging. IEEE Trans. Biomed. Eng. 2011, 58, 1612–1620. [Google Scholar] [CrossRef]
Paul, B.; Karn, B. Heart Disease Prediction Using Scaled Conjugate Gradient Back Propagation of Artificial Neural Network. 2022. Available online: https://www.researchsquare.com/article/rs-1490110/latest.pdf (accessed on 31 August 2022).
Taqvi, S.A.; Tufa, L.D.; Zabiri, H.; Maulud, A.S.; Uddin, F. Fault detection in distillation column using NARX neural network. Neural Comput. Appl. 2020, 32, 3503–3519. [Google Scholar] [CrossRef]
Nishiga, M.; Wang, D.W.; Han, Y.; Lewis, D.B.; Wu, J.C. COVID-19 and cardiovascular disease: From basic mechanisms to clinical perspectives. Nat. Rev. Cardiol. 2020, 17, 543–558. [Google Scholar] [CrossRef]
Mistry, S.; Wang, L. Efficient Prediction of Heart Disease Using Cross Machine Learning Techniques. In Proceedings of the IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 14–16 April 2022. [Google Scholar]
Shima, D.; Ii, Y.; Yamamoto, Y.; Nagayasu, S.; Ikeda, Y.; Fujimoto, Y. A retrospective, cross-sectional study of real-world values of cardiovascular risk factors using a healthcare database in Japan. BMC Cardiovasc. Disord. 2014, 14, 120. [Google Scholar] [CrossRef] [Green Version]

Figure 1. NARX network-based fault detection scheme.

Figure 2. The NARX neural network model.

Figure 3. Performance of the neural network: (a) input error correlation, (b) error autocorrelation, (c) correlation plot response, and (d) error histogram.

Figure 4. The correlation of target and output value.

Figure 5. Differentiating survived and expired COVID-19 patients with corresponding: (a) cognitive heart disease symptoms, (b) shortness of breath symptoms, (c) acute kidney injury, and (d) acute cardiac injury.

Figure 6. Differentiating survived and expired COVID-19 patients: (a) ICU admission, (b) chronic obstructive lung disease, (c) chronic kidney disease, (d) mortality rate.

Table 1. The effectiveness of the network using different learning algorithms and hidden layer neurons.

Algorithm	Number of Neurons in the Hidden Layer	Test Performance (MSE)
Levenberg–Marquardt	5	$6.248 \times 10^{2}$
Levenberg–Marquardt	10	$7.247 \times 10^{2}$
Bayesian Regularization	5	$7.687 \times 10^{2}$
Bayesian Regularization	10	$8.667 \times 10^{2}$
Scaled Conjugate Gradient	5	$2.471 \times 10^{2}$
Scaled Conjugate Gradient	10	$2.891 \times 10^{2}$

Table 2. Case processing summary.

Case		N	Marginal Percentage
Mortality	Survivor	351	61.6%
Mortality	Expired	219	38.4%
Congestive Heart Disease	CHF−	498	87.4%
Congestive Heart Disease	CHF+	72	12.6%
Chronic Obstructive Lung Disease	COPD(−)	525	92.1%
Chronic Obstructive Lung Disease	COPD(+)	45	7.9%
Asthma	Asthma−	523	91.8%
Asthma	Asthma+	47	8.2%
Smoking History	SmokeHx−	489	85.8%
Smoking History	SmokeHx+	81	14.2%
Acute cardiac injury	MycardiolInj−	460	80.7%
Acute cardiac injury	MyocardioInj+	110	19.3%
Acute Kidney Injury	AKI(−)	356	62.5%
Acute Kidney Injury	AKI(+)	214	37.5%
Valid		570	100.0%
Missing		2	-
Total		572	-
Subpopulation		42 ^a	-

^a. In 17 (40.5 %) of the subpopulations, the dependent variable had just one value.

Table 3. Model fitting details.

Model	Model Fitting Criteria	Likelihood Ratio Tests
Model	−2 Log Likelihood	Chi-Square	df	Sig.
Intercept Only	184.687
Final	107.986	76.701	6	0.000

Table 4. Likelihood ratio tests.

Effect	Model Fitting Criteria	Likelihood Ratio Tests
Effect	−2 Log Likelihood of Reduced Model	Chi-Square	df	Sig.
Intercept	107.986 ^a	0.000	0	-
CHF	109.536	1.551	1	0.213
COPD	110.377	2.391	1	0.122
ASTHMA	107.986	0.000	1	0.988
Smoking	108.453	0.467	1	0.494
Acute cardiac injury	131.434	23.449	1	0.000
AKI	146.448	38.463	1	0.000

^a indicated intercept values.

Table 5. Parameter estimates.

Mortality ^a		B	Std. Error	Wald	df	Sig.	Exp(B)	95% Confidence Interval for Exp(B)
Mortality ^a		B	Std. Error	Wald	df	Sig.	Exp(B)	Lower Bound	Upper Bound
Survivor	Intercept	−1.131	0.525	4.649	1	0.031	-	-	-
	[CHF = 0.00]	−0.366	0.298	1.515	1	0.218	0.693	0.387	1.242
	[CHF = 1.00]	0 ^b	-	-	0	-	-	-	-
	[COPD = 0.000]	0.549	0.355	2.391	1	0.122	1.732	0.863	3.473
	[COPD = 1.00]	0 ^b	-	-	0	-	-	-	-
	[ASTHMA = 0.00]	0.005	0.338	0.000	1	0.988	1.005	0.518	1.949
	[ASTHMA = 1.00]	0 ^b	-	-	0	-	-	-	-
	[Smoking = 0.00]	−0.195	0.286	0.463	1	0.496	0.823	0.470	1.442
	[Smoking = 1.00]	0 ^b	-	-	0	-	-	-	-
	[Acutecardiacinjury = 0.00]	1.124	0.236	22.720	1	0.000	3.077	1.938	4.884
	[Acutecardiacinjury = 1.00]	0 ^b	-	-	0	-	-	-	-
	[AKI = 0.00]	1.152	0.188	37.693	1	0.000	3.164	2.190	4.569
	[AKI = 1.00]	0 ^b	-	-	0	-	-	-	-

There are differences in intercept values for different symptoms given by ^a and ^b.

Table 6. Initial cluster centers.

Case	Cluster
Case	1	2
Shortness of Breath	0.00	1.00
Chronic Obstructive Lung Disease	1.00	0.00
Asthma	0.00	1.00
Smoking History	1.00	0.00
Acute cardiac injury	0.00	1.00
Ventilation	0.00	1.00
ICU Admission	0.00	1.00
Clinical status on last day 5/12	0.00	5.00

Table 7. Final cluster centers.

Case	Cluster
Case	1	2
Shortness of Breath	0.63	0.80
Chronic Obstructive Lung Disease	0.00	0.00
Asthma	0.08	0.11
Smoking History	0.14	0.11
Acute cardiac injury	0.00	0.00
Ventilation	0.22	0.48
ICU Admission	0.17	0.50
Clinical status on last day	0.42	5.00

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mistry, S.; Wang, L.; Islam, Y.; Osei, F.A.J. A Comprehensive Study on Healthcare Datasets Using AI Techniques. Electronics 2022, 11, 3146. https://doi.org/10.3390/electronics11193146

AMA Style

Mistry S, Wang L, Islam Y, Osei FAJ. A Comprehensive Study on Healthcare Datasets Using AI Techniques. Electronics. 2022; 11(19):3146. https://doi.org/10.3390/electronics11193146

Chicago/Turabian Style

Mistry, Sunit, Lili Wang, Yousuf Islam, and Frimpong Atta Junior Osei. 2022. "A Comprehensive Study on Healthcare Datasets Using AI Techniques" Electronics 11, no. 19: 3146. https://doi.org/10.3390/electronics11193146

APA Style

Mistry, S., Wang, L., Islam, Y., & Osei, F. A. J. (2022). A Comprehensive Study on Healthcare Datasets Using AI Techniques. Electronics, 11(19), 3146. https://doi.org/10.3390/electronics11193146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comprehensive Study on Healthcare Datasets Using AI Techniques

Abstract

1. Introduction

2. Related Work

2.1. Transformation

2.2. Healthcare Challenges

Data Matching

3. Methods Employed

3.1. Neural Network Models

3.2. Machine Learning Models

4. Comparative Analysis

5. Experiments and Results

5.1. Trainable Model

5.2. Data Source

5.3. Data Monitoring and Detection

5.4. Machine Learning Approach

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI