4. Data Classification
We used dataset collected by or through Health Insurance Portability and Accountability Act, a USA law designed to afford privacy standards to guard patients’ medical records and other health information given to health plans, doctors, hospitals, and other health care providers [
47]. The dataset consists of over 1600 recorded cases of data breaches, specifying the location of the breach, name of the covered entity (CE), the State the entity is located in, the number of individual affected, date of submission of the breach, type of the breach, business associate present and the description of the breach from October 2009 to November 2017. To stay within the objective of predicting how human factors can lead to data breach incidents on data locations for an organizations, only a selected number of parameters are considered; date of submission of the breach, the data location breached, and the description. The descriptive parameter narrates what led to the breach. Some of the records had missing values in all the columns except for the year (date of submission of breach). Such records were removed and not considered in this study. To clean data in a way that will be supported by quantitative analysis, the descriptive column, which is a string format was examined, record by record, case by case and where it was indicative of human factors such that the underlying cause of the breach was directly due to human error or behavior, a score of 1 was assigned otherwise 0. For example, a breach on a desktop in 2009 has the description:
“The covered entity (CE) changed the business associate (BA) it used as its information technology vendor. During the transition, a workforce member of the outgoing BA entered the CE’s computer system, changed the passwords, disabled all accounts, and removed drive mappings on the computer server for all of the workstations. The BA also removed the CE’s backup program and deactivated all of its antivirus software. The breach affected approximately 2,000 individuals. The protected health information (PHI) involved in the breach included patients’ names, addresses, dates of birth, social security numbers, appointments, insurance information, and dental records. The CE provided breach notification to affected individuals, HHS, and the media. Following the breach, the CE implemented security measures in its computer system to ensure that its information technology associates do not have access to the CE’s master system and enabled direct controls for the CE. A new server was installed with no ties to the previous BA. The new BA corrected the CE’s passwords and settings, mitigating the issues caused by the previous vendor. The CE provided OCR with copies of its HIPAA security and privacy policies and procedures, and its signed BA agreements that included the appropriate HIPAA assurances required by the Security Rule. As a result of OCR’s investigation, the CE improved its physical safeguards and retrained employees.”
The events that preceded the breach and the Office for Civil Rights (OCR) investigation indicates that the breach was aided by the human factor problem, so a score of 1 will be assigned to a desktop computer for the year 2009. This process is performed for each recorded breach. The data were then extracted according to the data breach location, the year the breach happened, and the number of human factors associated with it for that particular year. We assume that even though undetected and unreported data breach incidences may be significant to the findings of this study, we are confident that the reported data breach cases typify data breach incidences in general.
The experiment of this study is in two major parts.
First, an analysis of variance (ANOVA) for linear regression is used for the analysis of the study and we implored Pearson’s r, which measures the linear relationship between two continuous variables. The regression line used is,
, that is:
where the first term is the total variation in the dependent variable(s) y from the dataset, the second term is the variation in the mean observation, while the third term is the residual value. We now square each of the given terms in Equation (
1) and add them over all the observations n, which gives the equation
Equation (
2) can be rewritten as
, where SST is the notation for the total sums of square, SSE error sums of square and SSM is the model sums of squares. The sum of the samples is equal to the ratio of the model’s sums of square,
. With this, there is a formalization that the interpretation
which explains the fraction of the variability in the data that is explained by the regression model. The variance
is given by:
where DFT is the total degree of freedom.
where DFM is a model degree of freedom. In Equation (
4) the mean square model (MSM) applies because the regression model has one explanatory variable x. The corresponding mean square error (MSE) is the estimate of the variance of the population of the regression line
The ANOVA calculations for the regression are shown in
Table 1.
Equation (
6) is used to compute the correlation matrix of all the dependent variables. It is a Pearson correlation matrix between the variables
and
.
Next, We ranked the most to the least susceptible data locations in the event of a breach due to human factors. We used collaborative filtering for performing the data location ranking. We first determined the number of data locations similar to a data location (
), then a calculation of the number of breaches(
B) that
for a certain year
Y. The Ranking
R for data location
is close to the average of the rankings given to
. The mathematical formula for the average ranking given by
n data locations looks like this:
The formula shows that the average ranking given by n data locations is equal to the sum of the ranking given by them, divided by the number of data locations, which is n.
The next step is to find the similarity of the data locations using angles, we use a computation that returns a higher similarity or smaller distance for a lower angle and a lower similarity or larger distance for a higher angle as illustrated in Equation (
8). The cosine of an angle is given by a function that decreases from 1 to −1 as the angle increases from 0 to 180. The cosine of the angle is used to find the similarity between two data locations. The higher the angle, the lower will be the cosine and hence, the lower will be the similarity of the data locations. It is also accurate to compute the inverse of the value of the cosine angle to get the cosine distance between the data locations by subtracting it from 1.
To obtain the final ranking, the weighted average approach is used, multiplying each ranking by a similarity factor. By doing this, weights are added to the rankings. The heavier the weight, the more the ranking would matter. The similarity factor, which would serve as weights, should be the inverse of the distance explained above because less distance implies higher similarity. For example, a deduction of the cosine distance can be made from 1 to get a cosine similarity. Using the similarity factor
S for each data location similar to the target data location
, we calculate the weighted average using this formula:
In Equation (
9), every ranking is multiplied by the similarity factor of the data location that was breached. The final predicted ranking by data location
will be equal to the sum of the weighted rankings divided by the sum of the weights.
We then evaluated the accuracy of the predicted rankings, using the root square mean error (RMSE). This was done by computing the mean value of all the differences squared between the true and the predicted values.
where
is the rank in the
ith year and
is the predicted rank.
RMSE values that are greater or equal to 0.5 are a reflection of a poor ability of a model to accurately predict the data [
48].
6. Results
The results in
Table 1 are linear regression computations in which the following observations are seen between HF (the predictor) and the different dependent variables. The dependent variable NS can be statistically and significantly predicted by HF, with an F statistics of 42.492 and a distribution of [1,7) and the probability of observing the value is greater than or equal to 42.492 is less than 0.01. The computation on HF and DC also proved that DC can be statistically and significantly predicted by HF, giving an F statistic of 6.059 and the probability of observing greater or equal to the F statistic is less than 0.05 with a distribution of [1,7). Next, we see the dependent variable LAP can be statistically and significantly predicted by HF, with an F statistic of 6.145 and a distribution of [1,7). The probability of observing the value greater than or equal to its F statistic is less than 0.05. With an F statistic of 5.757 and a distribution of [1,7), OPD’s probability of observing the value greater than or equal to its F statistic is as LAP which is less than 0.05. HF can statistically and significantly predict OPD. EMR and EMAIL both being non-hardware locations, have the probability of observing their values greater than or equal to their F statistic 13.705, and 15.474 respectively to be less than 0.01. They are both statistically and significantly predicted by HF, with a distribution of [1,7). Last but not least, the dependent variable OTHERS can be statistically and significantly be predicted by HF, with an F statistic of 8.079 and a distribution of [1,7). The probability of observing the value greater than or equal to its F statistic is less than 0.05.
The proportion of the variation of the dependent variables explained by the independent variables is shown in
Table 2. HF accounts for
,
,
,
,
,
,
and
of the explained variability in NS, DC, LAP, OPD, EMR, EMAIL, PF and OTHERS respectively. While these results suggest that non-human factors also account for
,
,
,
,
,
,
and
of the explained variability in NS, DC, LAP, OPD, EMR, EMAIL, PF and OTHERS respectively, an empirical study to better understand how they affect breached locations will be required. The results from
Table 2 establishes that breached locations are hugely influenced by human factors.
Table 3 provides the analysis for the linear regression between each of the dependent variables and HF. The predictions show that
(
). An increase or change in human factors, projects a mean of 0.286 in a network server breach. The regressional equation of
(
) predicts a mean of 0.139 each in desktop computers, for an increase or change in human factors. For laptops, there is a mean 0.275 increase or change in human factors, with the regressional equal of
(
). The prediction also shows
(
),
(
) and
(
, where there are mean changes of 0.129, 0.111 and 0.187 in other portable devices, electronic medical records and emails, respectively, for increases in human factors. Lastly,
(
) and
(
) are the predictions for paper-films and other locations respectively. An increase or change in human factors, project means of 0.385 and 0.283 in paper-films and others respectively.
Computation of the Pearson correlation coefficient in
Table 4 is indicative of the strength of the relationship between HF and the dependent variables when a location is breached. There exists a very strong positive correlation of
and
between HF and NS, and HFand PF respectively.
p is significant at 0.000 for both. HF also has a strong positive correlation of
and
, with EMR and EMAIL sequentially and
p is significant at 0.008 and 0.006, respectively. A moderate positive correlation of
,
,
and
exists between HF, and DC, LAP, OPD and OTHERS accordingly with each having
p significant at 0.043, 0.042, 0.048 and 0.025 in that order.
The correlation matrix in
Table 4 also epitomizes how close some of the data locations are. For example, the network server is a good correlation with all the electronic data locations. This can especially be seen with the two non-hardware electronic data locations (EMR and EMAIL), which have high degrees of correlations with network servers (NS). Network servers provide multiple resources to workstations and other servers on the network. The shared resources can be hardware such as disk space or hardware access and application access (i.e., email services).
Table 5 shows the ranking results of the most susceptible data locations in a data breach incident as a result of human factors using a collaborative filtering algorithm. The dataset extracted only included breaches on data locations that had human factor problems. The result may be different it non-human factor breaches were to be added or analyzed separately. The ranking shows Laptops to be the most susceptible data location and electronic medical records the least. The ranking of Network servers is quite intriguing. Mostly, network servers are manned by IT professions, who we assume are well positioned in terms of knowledge not to compromise the security of the system, especially as a result of human factors. However, network servers rank number two. Even though an empirical study may be needed to ascertain why network servers rank high. We believe it will not be academically strange to conclude per these results, that the other data locations have an indirect effect on a network server being breached due to human factors. The result also shows that human factors make affect data locations different when if come to data breach incidents.
The root mean square error (RMSE) was used to evaluate the accuracy of the ranking results as illustrated in
Figure 2. The evaluation of the differences between the true ranking and the predicted ranking ranges of 0.22 to 0.39. This is an indication that the ranking obtained from the collaborative filtering has a high degree of accuracy and therefore our ranking is reliable.