A Study on Machine Vision Techniques for the Inspection of Health Personnels’ Protective Suits for the Treatment of Patients in Extreme Isolation

: The examination of Personal Protective Equipment (PPE) to assure the complete integrity of health personnel in contact with infected patients is one of the most necessary tasks when treating patients affected by infectious diseases, such as Ebola. This work focuses on the study of machine vision techniques for the detection of possible defects on the PPE that could arise after contact with the aforementioned pathological patients. A preliminary study on the use of image classiﬁcation algorithms to identify blood stains on PPE subsequent to the treatment of the infected patient is presented. To produce training data for these algorithms, a synthetic dataset was generated from a simulated model of a PPE suit with blood stains. Furthermore, the study proceeded with the utilization of images of the PPE with a physical emulation of blood stains, taken by a real prototype. The dataset reveals a great imbalance between positive and negative samples; therefore, all the selected classiﬁcation algorithms are able to manage this kind of data. Classiﬁers range from Logistic Regression and Support Vector Machines, to bagging and boosting techniques such as Random Forest, Adaptive Boosting, Gradient Boosting and eXtreme Gradient Boosting. All these algorithms were evaluated on accuracy, precision, recall and F 1 score; and additionally, execution times were considered. The obtained results report promising outcomes of all the classiﬁers, and, in particular Logistic Regression resulted to be the most suitable classiﬁcation algorithm in terms of F 1 score and execution time, considering both datasets.


Introduction
Highly infectious diseases are treated by very strict procedures following the advice of the World Health Organization (WHO) [1].In the case of the Ebola virus, which has approximately 90% fatality rate, the protocol for both the patient and the medical staff results to be even more severe [2].These security measures must guarantee an adequate protection of the worker and to the rest of the persons susceptible to direct or indirect contact with the patient and/or the worker.Several analysis have been performed to certify the efficiency of Personal Protective Equipment (PPE) to prevent the contamination with the infected patient [3], for example utilizing fluorescent markers [4,5].Furthermore, Kang et al. [6] reported how significant it is to be aware and adopt standard procedures, demonstrating that frequent contaminations are related with lack of knowledge and carelessness of the health personnel.Other factors that could affect the performance of the PPE are related to a prolonged time wearing the equipment [7] or physical requirements such as the dimension of the changing room [8].One of the most delicate tasks is the removal of the PPE in the changing room after visiting the patient [9].The PPE that is used for treating this category of patients is composed by: (a) a protective suit that covers the entire body except face, hands, and feet; (b) FFP3 mask [10]; (c) waterproof glasses; (d) head cover; (e) three pairs of gloves; (f) two pairs of boot covers; (g) facial screen and body apron; and (h) areas of layer overlap (such as glove and suit on the forearm, and boot cover and suit on the legs) that are sealed with wide insulating adhesive tape.
The removal task involves several steps.During this activity, the actuation protocol states [11]: "We will always act slowly, calmly, being aware of our body and proceeding with slow but precise movements.Even feeling that we are accustomed to this activity, we will never stop listening and attending to the indications of the instructor-observer, which will indicate to the personnel the sequence for the removal of the PPE".The first pair of gloves, as well as the first pair of boot covers, are discarded in the patient's room.Gloves are always disposed, while other components are stored for their posterior sterilization and reuse.Once in the changing room, the steps that must be followed are reported in order: ( 1 9) remove the FFP3 mask; and (10) remove the new pair of gloves [11].Figure 1 illustrates the removal of the second pair of gloves once inside the changing room.The possibility that the suit has been deteriorated and/or contaminated with traces of blood, vomit, urine and, in general, with various fluids may occur successive to the contact with infected patient.This fact can lead to worker contamination; therefore, a safe inspection of the protective equipment is necessary before proceeding with the removal task.The anomalies of the suit can be detected either by the health worker or by the instructor-observer (outside the changing room).However, visual inspection performed by humans results to be subjective, and may be affected by degree of expertise as well as circumstances such as distraction caused by fatigue.For this reason, an objective solution is necessary to validate the integrity of the PPE, which is for the health care of patients with Ebola and other highly infectious diseases.
The goal of this work was to study machine vision algorithms for a real prototype to ensure that the PPE protective suit is neither broken nor contaminated with fluids.The system consists in robotic structure that displaces a camera in Cartesian coordinates providing different levels of zoom, effectively acting as a full-body scanner.This camera provides monocular images that are analyzed by computer vision algorithms to detect undesired traits on the suit, such as stains of blood from interaction with the patient.We studied the accuracy, precision, recall, and F 1 score of a set of computer vision algorithms which are used for this system.

Materials and Methods
The problem statement can be solved through the use of machine vision classification algorithms, where the purpose is to define if a particular image taken by the camera is recognized as suit with or without certain traits, such as blood stains.An assumption made is that the number of clean areas will significantly outnumber those that present blood stains, which in classification problems is known as class imbalance [12].Preceding studies analyze different approaches to treat unbalanced datasets in several real-world conditions, such as medical diagnosis [13], customer churn prediction [14], fraud detection in banking operations, creditworthiness of a bank's customers [15], detection of oil spills [16], and data mining [17].Common techniques can be categorized into three groups: data pre-processing, algorithmic approaches, and cost-sensitive learning [18].Data processing involves adjustments to obtain a more balanced distribution, respectively, over-sampling the minority class and/or under-sampling the majority class [19][20][21].Algorithmic approaches implicate the development of algorithms such as classification algorithms, ensemble techniques, decision trees, and neural networks, which take the class imbalance into account [22][23][24].Cost-sensitive learning combines both the aforementioned techniques considering different types of costs [25,26].Several classification algorithms, as well as performance metrics that can be used in the presence of class imbalance, are explained in this section.

Classification Algorithms
A preliminary study was performed to identify machine vision classification algorithms suitable for blood stain detection.In 2002, Melody et al. [27] studied the performance of different classifiers, among which Logistic Regression and Neural Networks presented higher accuracy with respect to some others such as Multivariate Discriminant Analysis, Decision Trees, and k-Nearest Neighbors.Brown and Mues et al. [28] based their comparison on an imbalanced credit scoring dataset, from which it emerges that Decision Trees, k-Nearest Neighbors and Quadratic Discriminative Analysis resulted to be not appropriate in the case of strong imbalance; rather than these, Random Forests and Gradient Boosting reported better outcomes.Regarding decision making in the clinical field, Support Vector Machines (SVM) showed greater diagnostic accuracy compared to Multilayer Perceptron Neural Networks, Combined Neural Networks, Mixture of Experts, Probabilistic Neural Networks, and Recurrent Neural Networks [29].More recent research in the field of agricultural environments indicated a greater performance of Support Vector Machines and Random Forests; moreover, Adaptive Boosting (AdaBoost) exhibited good generalization capability specifically in case of large sample sizes [30].Li et al. [31] compared 15 classification algorithms and determined that most of the supervised algorithms could achieve high accuracies setting adequate parameters and utilizing appropriate training samples.Since most of the standard classification algorithms assume a balanced training dataset [32], to avoid the misclassification of the minority class (images with blood stains) as much as possible, it is necessary to adjust the model used by the classifier.Based on previous research, our study focused on six different classifiers.

Logistic Regression
Logistic Regression (Logit) is one of the classification algorithms that is most used in machine learning.It is very similar to Linear Regression with the difference that Linear Regression is used for regression rather than for classification, thus their loss functions are typically different.The logistic function, also called Sigmoid, is an S-shaped curve that can take any real-valued number and map the output into a value between 0 and 1, but never exactly at those limits [33].Hence, a Logistic Regression model determines the probability that the input variables belong to one of the two classes: where x are the input variables and α is the parameter vector.Logit is a very efficient and widely used method, due to its low computational complexity and minimal risk of overfitting [34].

Support Vector Machine
Support Vector Machines (SVM) are supervised learning algorithms which can be utilized both for discriminative classification and regression problems.They are based on the definition of an optimal hyperplane (Figure 2).For classification, this hyperplane is identified by the maximum margin between the vectors of the two classes [35].The margin to obtain an optimal hyperplane is defined considering only a small set of the training data, called support vectors [35].The SVM classifier works very well with a clear margin of separation and in high dimensional spaces, but also exhibits a prolonged time execution when there is a large dataset [37].Furthermore, this classifier has been applied to several areas such as object detection [38], digital handwriting recognition [39], and text categorization [40,41].

Random Forest
Random forest (RF) is a decision tree classifier referred to an ensemble algorithm of supervised learning.This ensemble technique creates a set of decision trees from randomly selected subsets of training data [42].The selected training data are the only ones used to find the best split for the node; in this way, each split is based on the best features among a random subset of features [43].This kind of approach is called bagging and it involves parallel training of the different models.Moreover, considering the significant imbalance of our classes, it can be necessary to additionally adjust the class weight parameter to increase the weight of the minority class and obtain an equal percentage of the two classes.The advantages of using a Random Forest algorithm is that it can handle high dimensional data, perform large numbers of trees, and avoid overfitting [44,45].

Adaptive Boosting
The Adaptive Boosting (AdaBoost) is one of most efficient and widely used classifiers; moreover, it was the first successful boosting algorithm developed for binary classification introduced by Freund and Schapire [46].The boosting technique involves the formation of a strong classifier from several weak classifiers, but here the weaker classifiers are created sequentially and not "in parallel" as in the bagging methods.Furthermore, AdaBoost adjusts the weight of the instances at each iteration, given more weight for the misclassified samples and less weight for the right ones.The number of weak learners can be regulated to obtain the best performance and usually the higher is the number of estimators the better will be results.The order of magnitude of weak learners used is typically between tens and hundreds of estimators.Based on previous studies, this classifier shows a very good generalization (the ability to classify new data) although some literature indicates it is not qualified to prevent overfitting when dealing with very noisy data [47].

Gradient Boosting
A different Boosting approach is called Gradient Boosting (GB) [48], which over the last few years has obtained a great interest.This learning procedure progressively fits new models to define the strong classifier [49].The new model gradually minimizes the loss function; in such way, the initial base learner is grown and every tree in the series is fit to the pseudo-residuals of the prediction from the earlier tree [28].The resulting formula is shown below: where G 0 equals the first value for the series, T 1 , ..., T n are the trees fit to the pseudo-residuals, and β n are coefficients for the tree nodes computed by the algorithm [28].The parameter that is possible to set up is the number of estimators, as well as the number of boosting stages to execute.The order of magnitude of estimators used is also typically between tens and hundreds.The GB classifier allows fast execution time and high accuracy [50].The most common problem regarding these estimators is overfitting, which could arise depending on the choice of the weak learners or if the number of them reaches large values [51].

eXtreme Gradient Boosting
The last classifier of study is the eXtreme Gradient Boosting (XGB), which is a variation of the above-mentioned Gradient Boosting [52].It involves a parallel tree boosting procedure to solve the classification problem and adds few levels of regularization to prevent the overfitting [53].Likewise, the order of magnitude of estimators used with this classifier is typically between tens and hundreds.The XGB results to be highly efficient, flexible, simple and very fast due to a parallel implementation on the features [53].

Performance Metrics Analyzed
All the aforementioned classification algorithms were evaluated on four performance metrics: accuracy, precision, recall and F 1 score.These metrics can be obtained from the parameters reported in a confusion matrix (Figure 3): Accuracy is defined as the percentage of correctly classified positive and negative samples on the total observation, and the formula is reported below [54]: The second parameter is called precision and it is determined with number of positive predictions divided by the total number of positive samples values predicted [55]: Recall, instead, is used to measure the fraction of positive samples that are correctly classified, thus is true positives samples divided by true positives plus false negatives [56]: The last parameter is known as F 1 score, which is considered the weighted average of the precision and recall, as is it is possible to observe in the reported formula [54]: These metrics have been provided as they jointly provide more insight on the performance of the classifiers.Specifically, accuracy may provide misleading results in the case of unbalanced data (e.g., a classifier that treats all samples as clean would report a high accuracy) [57], but its values can be better interpreted in the context of the other metrics.In particular, the matrix elements are defined as: TP, true positive, the cases in which we predicted that there is blood and there is actually blood; TN, true negative, where we predicted there is no blood and there is no blood; FP, false positive, we predicted there is blood and instead there is no blood; and, FN, false negative, where we predicted there is no blood but there actually is blood.

Experimental Setup
The study involved an initial analysis on a synthetic image dataset, followed by an additional study on an image dataset of the PPE with a physical emulation of blood stains taken by a real prototype.Figure 4 illustrates the simulated PPE suit and the physical PPE that were used for our investigation.The datasets have been openly published as a specific complement of this work [58].

Synthetic Dataset
The synthetic dataset was reconstructed with a simulated environment that includes the protective suit with the blood stains.The simulated environment additionally allows different points of illumination and zoom levels, as shown in Figure 5.
This dataset includes 30 images of 960 × 480 pixels, taken with a fixed simulated focal length of 4.8 mm.These images represent the whole protective suit subdivided into six different areas (Figure 6): chest, abdomen, pelvis, thighs, lower legs and ankles.Five different synthetic samples were generated for each of the six areas, each with a different point of illumination.The total time required for rendering these 30 simulated images was 15.75 s.To represent the data distribution of the physically emulated dataset, the synthetic dataset is also affected by a strong imbalance between the two class samples, where the stained areas represent 0.0006% of the total of 13,824,000 pixels, and the majority represent clean areas.

Physical Emulated Dataset
The acquisition of images with physical emulation of blood stains was performed on a real PPE suit using a proprietary blood simulant and the actual full-body scanner mechanism of Figure 4. Images were obtained utilizing the system's single monocular camera surrounded by four high intensity LED lamps and different camera focal length from 4.8 mm (0% zoom) to 57.6 mm (100% zoom) (Figure 7).The total time required for the acquisition was 19.23 min, dominated by camera movement times such that actual camera acquisition times are depreciable.This second dataset is composed by a larger number of images at higher resolution and a larger proportion of stains: 47 images of 1280 × 960 pixels, where stained areas represent a 2.25% of the total of 57,753,600 pixels.Such high intensity LED illumination technology provides information in a wide spectrum, ranging from UV to IR, in particular: (a) white light is used for blood stains, bruises and bites; (b) UV light for body fluids and drug residues; (c) violet light for splashing blood and hair; (d) orange and red light for general contrast search; and (e) IR light for blood splashes, fiber, etc. [59].In view of the detection of exclusively blood stains, temporarily the white light was the only one utilized.

Experimental Results
In our study, six classification algorithms were trained on a synthetic and a physical emulated dataset.In particular, classifiers which are able to manage the high imbalance between the positive and negative samples were selected.Initially, Logistic Regression and a discriminative classification algorithm, known as Support Vector Machine, were carried out.The investigation further proceeded through the application of bagging and boosting approaches, in particular with the Random Forest for the bagging techniques and Adaptive Boosting, Gradient Boosting and eXtreme Gradient Boosting relative to the second approach.The performances of these classification algorithms were evaluated on the four metrics in Section 2.2, presented for the synthetic dataset in Table 1 and for the physical emulated dataset in Table 2.
Table 1.Outcomes obtained on simulated PPE images from the six classifiers: Logistic Regression (Logit), Support Vector Machine (SVM), Random Forest (RF), Adaptive Boosting (AdaBoost), Gradient Boosting (GB) and eXtreme Gradient Boosting (XGB).All these learners were evaluated on four parameters: Accuracy (Acc), Precision (Pr), Recall (Re) and F 1 score; moreover, the execution time was taken into account (t e in minutes).From the results presented in Tables 1 and 2, is possible to observe that all the classifiers exhibit positive results.To define the classification algorithm that better approximates the desired outcomes, several aspects of each metric parameters were taken into account.Firstly, for the accuracy, which is the ratio of correctly predictive observations respect to the total observations, the best values were reported from the Support Vector Machine with 99.99% for the synthetic dataset, and from the eXtreme Gradient Boosting with 99.23% using 200 estimators for the physical emulated dataset.Regarding this metric parameter, it is important to specify that, in our particular condition, it is not the appropriate measure for model performance evaluation.In fact, when facing strongly imbalanced datasets, accuracy values may result misleading, as explained in Section 2.2.Considering the precision, the highest values for both datasets were reported from Adaptive Boosting, with, respectively, 100% on the synthetic PPE images utilizing 30 estimators, and 95.77% on the physical emulated blood stains dataset with 50 estimators.

Acc
This metric takes false positives into account, which are instances that our model incorrectly recognized pixels as blood stain that are actually clean areas.On the other hand, recall considers false negatives, which are cases in which our model labels pixels as clean areas where instead blood stains are present.Therefore, when precision increases, recall decreases, and vice versa.It is possible to observe this in the case of Adaptive Boosting, which presented the maximum precision values, but in terms of recall resulted to be one of the worst classification algorithm for both datasets.For this reason, a trade off between these two parameters is necessary and this is called the F 1 score.The aforementioned metric is a weighted average of precision and recall.Combining these two terms F 1 results to be the best measure for our performance evaluation.The highest outcomes were obtained from the Support Vector Machine with 98.82% for the synthetic dataset, and eXtreme Gradient Boosting utilizing 200 estimators with 80.68% for the physical emulated dataset.The execution time was the last compared measurement among all the algorithms; furthermore, in our study, it is important to consider that the two dataset were performed on two different machines.The first dataset, which includes synthetic images, was executed on a standard 3.00 GHz dual-core CPU machine, while the second larger dataset, composed by larger real PPE images, was treated on a 4.00 GHz quad-core CPU machine with an NVIDIA Titan X GPU.The fastest algorithms regarding the first dataset were Logistic Regression and eXtreme Gradient Boosting (with 20 estimators), both performing within 2 min.The processing of physical emulated PPE images needs more time; in fact, here, both Logistic Regression and eXtreme Gradient Boosting executed approximately around 5 min.Moreover, it is possible to observe in Table 2 that the SVM did not achieve any results after more than one week of processing the emulated blood stains (over 10,080 min).

Conclusions
In this study, several classification algorithms were compared, from conventional Logistic Regression to more modern algorithms such as eXtreme Gradient Boosting, to detect undesired blood stains on the personal protective suit of medical care personnel after contact with infected patients.The analysis involves the evaluation of the algorithms on two datasets: the first includes synthetic images of the PPE that were reconstructed in a simulated environment, and the second represents images of the PPE with physical emulated blood stains acquired through a real prototype.All the classifiers were evaluated on four performance metrics and taking their execution time into account.From the obtained results, it is possible to observe that all the selected algorithms report satisfying outcomes (above 50%); furthermore, as we expected, the physical emulated images are more challenging to be processed and this may be due to irregular blood stains or a larger dataset.For the definition of the most suitable algorithm, it was necessary to consider the great imbalance between the positive and negative class samples.In this condition, the F 1 score results to be the most appropriate metric parameter.In particular, the highest values were reported from Support Vector Machine for the first images dataset and from the eXtreme Gradient Boosting for the second dataset; in both cases, prolonged execution times were needed.Therefore, the best solution was to find a compromise between F 1 score and execution time t e , where Logistic Regression achieved the most balanced outcomes considering both datasets.
) remove the second pair of boot covers; (2) remove the second pair of gloves; (3) open the front closure of the suit; (4) remove the suit head cover, remove the suit arms and legs, remove completely and roll it up for storage; (5) remove the third pair of gloves; (6) perform first-hand hygiene; (7) put on a new pair of gloves; (8) remove the waterproof glasses; (

Figure 1 .
Figure 1.Removal and disposal of gloves inside the changing room (during training session).

Figure 2 .
Figure 2. Support Vector Machine hyperplane, in which blue spots are positive samples, pink spots correspond to the negative samples, the rimmed ones are the support vectors, and the central thicker line is the obtained optimal hyperplane [36].

Figure 3 .
Figure 3.The confusion matrix compares the predicted values with respect to the actual real values.In particular, the matrix elements are defined as: TP, true positive, the cases in which we predicted that there is blood and there is actually blood; TN, true negative, where we predicted there is no blood and there is no blood; FP, false positive, we predicted there is blood and instead there is no blood; and, FN, false negative, where we predicted there is no blood but there actually is blood.

Figure 4 .
Figure 4. Comparison between the PPE suit of the simulated environment (left) and the physical PPE suit in the actual full-body scanner mechanism (right).

Figure 5 .
Figure 5. Possible points of illumination can be chosen within the simulated environment.

Figure 7 .
Figure 7. Physical emulated dataset respectively with different zoom levels: the (upper left) image represents 4.88 mm of focal length (0% zoom), the (upper right) image corresponds to 31.92 mm focal length (60% zoom), and the (bottom) image with focal length of 57.6 mm (100% zoom).

Table 2 .
Results acquired on real PPE images from the six classifiers: Logistic Regression (Logit), Support Re) and F 1 score; moreover, the execution time was taken into account (t e in minutes).