MasPA: A Machine Learning Application to Predict Risk of Mastitis in Cattle from AMS Sensor Data

: Mastitis is a common disease that prevails in cattle owing mainly to environmental pathogens; they are also the most expensive disease for cattle in dairy farms. Several prevention and treatment methods are available, although most of these options are quite expensive, especially for small farms. In this study, we utilized a dataset of 6600 cattle along with several of their sensory parameters (collected via inexpensive sensors) and their prevalence to mastitis. Supervised machine learning approaches were deployed to determine the most effective parameters that could be utilized to predict the risk of mastitis in cattle. To achieve this goal, 26 classiﬁcation models were built, among which the best performing model (the highest accuracy in the shortest time) was selected. Hyper parameter tuning and K-fold cross validation were applied to further boost the top model’s performance, while at the same time avoiding bias and overﬁtting of the model. The model was then utilized to build a GUI application that could be used online as a web application. The application can predict the risk of mastitis in cattle from the inhale and exhale limits of their udder and their temperature with an accuracy of 98.1% and sensitivity and speciﬁcity of 99.4% and 98.8%, respectively. The full potential of this application can be utilized via the standalone version, which can be easily integrated into an automatic milking system to detect the risk of mastitis in real time. N.A.G.; project B.S.


Introduction
The global dairy industry was valued at around 720 billion USD in 2019, contributing to 54% of the global liquid milk share, and it is projected to grow to 1032 billion USD by 2024 [1]. However, the industry is not invincible, as cattle, like any other animal, can develop diseases. Among them is clinical mastitis, which is the single most expensive disease among the dairy industry, resulting in a loss of around 6% of the production value annually as a result of several factors such as reduction of production, treatment expenses, and milk discard, while also being among the top reasons for permanent removal of the cattle from the herd or even cattle mortality [2][3][4]. While 6% is not a high overall amount, the loss is drastic to small farms as the loss per cow can be significant, around 100-500 kg/cow/lactation or around a 5-7% decrease in milk yield per lactation [4,5]. Aside from lower milk yield, prevalence of mastitis results in a financial burden to the farmers as each clinical mastitis case involves therapeutic expenses, veterinary expenses, labor expenses, premature culling loss, non-saleable milk losses, future reproductive loss, replacement loss, and/or death loss, all of which could add up to 444 USD; this amount could be massive for smaller farms or those in low-income countries [6]. Antibiotics alone or in combination with non-steroidal anti-inflammatory drugs (NSAID) are often used for preventing mastitis, and they do indeed work efficiently in preventing most of the economic loss due to clinical mastitis; however, such a strategy should be strictly used for treatment as long term use of such drugs for prevention rather than treatment results in antibiotics/drug residues reaching the end consumers, which results in drug/antibiotic resistance-related health issues, which themselves cost the human health care industry around 55 billion USD annually in the United States alone; in other words, the use of antibiotics for preventive care magnifies the burden and transfers it to other sectors rather than solving the core issue [7][8][9].
Mastitis is often caused by microbial infections (mainly bacterial) from the environment, either directly or through feed, eventually causing pathological lesions and inflammation of the mammary glands that could result in progressive fibrosis or even occurrence of severe toxemia in the cattle. The severity of the symptoms is determined mainly by the type of the pathogen and the resistance of the cattle's mammary gland [10,11]. The cattle's mammary gland is not entirely defenseless against these pathogens; the humoral and acquired immune response of the cattle for the most part could successfully prevent these pathogens from causing any damage; moreover, the lysosome enzyme found in the cattle's milk can also digest the peptidoglycan layer of the Gram-positive and Gram-negative bacteria, causing their death. Another glycoprotein, lactoferrin, found in milk and other secretions of the cattle, can also kill some bacterias by hindering their iron intake pathways; furthermore, animal breeding activities also consider mastitis risk when breeding cattle. One of the genetic traits considered is the somatic cell count (SCC), as SCCs contribute to the cattle's immune system, as low levels of SCC are directly correlated to higher risk of environmental mastitis, breeding programs tend to favor cattle with high SCCs; however, this approach is limited in practice [12].
As eradication of mastitis risk is rather quite difficult, preventive measures have been studied extensively, with antibiotics and probiotics being in the front line of most of the studies [13]. In a recent study on 108 Dutch dairy cattle, it was estimated that the cost of preventive measures against mastitis in cattle was around €120/cow/year, of which €81.6 (or 68%) went towards labor expenses with another average of €301/cow/year in the case of failure or clinical mastitis [14].
Several sensors based on factors such as milk color; temperature; SCC; electrical conductivity; thermal cameras; and/or enzyme based methods such as L-Lactate dehy-drogenase (LDH), N-acetyl-beta-D-glucosaminidase (NAGase), and haptoglobin (Hp), have been developed for commercial use (commonly as biosensors or immunosensors) in automatic milking systems in different farms to detect and alert the cattle at risk of mastitis before its prevalence; however, such single parameters can limit the sensitivity and specificity of the results, not to mention their expenses, cost of specialized labor and equipment, and limitation to only automatic milking systems, all of which could further limit their use in small farms and organic farms that do not utilize such advance systems [15][16][17].
A promising emerging approach in the early diagnosis of mastitis in cattle is the use biosensors, as such devices are effective at detecting pathogens even at low concentration; however, the pathogens found in cattle milk that could cause mastitis are quite diverse and the research on the development of a multi-pathogen detecting biosensor is yet to de-liver promising results. Among the largest projects aimed at delivering such technology was the Pathomilk project (Grant agreement ID: 30392), which received €1.7 million of funding for the development of a rapid biosensor that could potentially detect multiple pathogens commonly found in milk. The initial technology was based on a DNA hybridization coupled to surface plasmon resonance detection; however, with more than a decade past since the project's initiation, no significant outcomes have been reported. The use of immunoassays for the detection of pathogens in the milk is also limited owing to the heterogenous content of the milk that could hinder the antibody binding mechanisms involved in such assays. Likewise, the use of standard PCR-based methods is limited as the presence of ions (mainly calcium) plasmin, fats, and somatic cells makes it necessary to perform several filtrations before the PCR reaction, eventually increasing the overall cost of the diagnosis [18].
Recent developments in the field of artificial intelligence and machine learning have revolutionized many fields in recent years, with biotechnology not left behind. Especially with Industry 4.0 initiatives, the internet of things (IoT) has made data acquisitions to perform such analyses more feasible than ever. Recently, the "Sack for Data" approach was proposed, which included four flex sensors and a temperature sensory to collect eight udder parameters along with temperature of the udder using Arduino and Raspberry pi boards only, which are extremely cheap to purchase and easy to use [19]. They have also utilized cloud technology to automate the data maintenance and using K-nearest neighbor (KNN) and support vector machine (SVM) algorithms, where they achieved 73% and 86% mastitis prediction rates, respectively. While these percentages are not perfect, they serve as a proof of concept to a cheap detection method for mastitis with affordable technology and some data analytic approach [19].
By the end of 2009, 8000 dairy farms utilized AMS, with the number growing continuously as AMS provides several benefits such as reduction in the labor cost, more time flexibility, and overall higher milk yield as cattle within AMS can be milking multiple times per day. Reports had also shown that cattle are calmer in such systems as they can be milked whenever they are most comfortable [20][21][22]. Another advantage of AMS it is an automated system that can collect consistent data, hence different sensors can be integrated into them to collect specific data including the amount milk, heat, milking time, and so on.
The aim of this study is to utilize the latest trends in data science and machine learning to develop semi-automated pipelines that could provide relief to farmers from the cost of mastitis preventive measures, which could add up to €120/cow/year, or at least reduce the contribution of labor expenses to €0 using the affordable Raspberry pi kits to manually collect data and predict the risk of mastitis through an online webserver or by integrating such kits to AMS for a fully automated data collection, which can be integrated into an open-source application to predict the risk of mastitis in real time. Such a solution could provide small farmers or farmers with limited technical background great advantages in monitoring their cattle's mastitis status without any external expenses and allow them to save on revenues [19,23].

Materials and Methods
The dataset used to train and build the machine learning model for predicting the risk of mastitis in this study was obtained from recent research lead by Ankitha (2020) [24]. This dataset contains 6600 entries (three entries per cattle) for cattle with 15 attributes; cow ID, date, breed, months since giving birth, previous occurrence of mastitis, front left udder inhale limit (IUFL) front left udder exhale limit (EUFL), front right udder inhale limit (IUFR), front right udder exhale limit (EUFR), rear left udder inhale limit (IURL) rear left udder exhale limit (EURL), rear right udder inhale limit (IURR), rear right udder exhale limit (EURR), temperature of the cow, the hardness of an udder (from user input via a switch), pain due to swelling of the udder (manual user input), photographs of the cow's milk, and a binary class label (healthy or mastitis). Among the attributes, only those with significant variance among the dataset and those that can be measured in a cost-effective manner were selected.
The raw dataset was preprocessed via SciPy tools and Scikit-learn library's feature selection function; unnecessary attributes such as ID, breed, hardness, and pain (which requires labor) were removed; and parameters that constituted less than 50% variance among all the entries were also removed [25,26]. The raw dataset contained 6600 samples; however, it was imbalanced with 3961 healthy cows (60.02%) and 2639 cows with mastitis (39.98%). To overcome the bias that might arise owing to this imbalance, the RandomOver-Sampler function from the imbalanced-learn library (within Scikit-learn) was utilized. This function takes the underrepresented class (here, the cows with mastitis) and generates sample inputs corresponding to it using its AI-based algorithm.
The balanced dataset was divided into training (6337 entries, 80%) and testing (1585 entries, 20%) subsets and a total of 26 classification algorithms were utilized to build 26 classification models (RandomForestClassifier, XGBClassifier, LGBMClassifier, Bagging-Classifier, DecisionTreeClassifier, ExtraTreeClassifier, KNeighborsClassifier, AdaBoostClas-sifier, LabelPropagation, LabelSpreading, SupportVectorClassifier, QuadraticDiscriminant-Analysis, NuSupportVectorClassifier, SGDClassifier, RidgeClassifier, LogisticRegression, LinearDiscriminantAnalysis, RidgeClassifierCV, CalibratedClassifierCV, LinearSupport-VectorClassifier, GaussianNB, BernoulliNB, PassiveAggressiveClassifier, NearestCentroid, DummyClassifier, and Perceptron). They were built without any hyperparameter tuning (default parameters) and their accuracies were compared. The top performing classifier was selected and hyper parameter tuning via the grid search method was performed. The tuned model was then subjected to a 10-fold cross validation and its average mean accuracy was calculated along with its sensitivity and specificity. The sensitivity and specificity of the model were also calculated using Equation (1) and Equation (2) The true positives were calculated as the number of healthy cows that were predicted correctly, false positives were calculated as the number of healthy cows predicted to be at risk of mastitis, true negatives were calculated as the number of cows at risk of mastitis that were correctly predicted, and false negatives were calculated as the number of cows at risk of mastitis that were predicted to be healthy.

Data Preprocessing
Following the preprocessing and features' selection steps, the eight-udder parameter (IUFL, EUFL, IUFR, EUFR, IURL, EURL, IURR, and EURR) and temperature attributes of the cattle were sufficient to generate a functional model with significant accuracy. The remaining attributes were dropped as their contributions were insignificant; photographs of the milk and attributes like pain/hardness attributes were also dropped as they are open to bias (by the labors' interpretation) and would contribute to higher labor cost. This dataset was further balanced with RandomOverSampler and the final dataset contained 50% healthy and 50% mastitis samples, totaling 7922 samples. The final curated dataset is provided in Supplementary Material 1 (S1).

Model Fitting and Hyperparameter Optimization
The accuracy scores and time taken to build the models with the selected algorithms are summarized in Table 1. The best performing model was random forest classifier with an initial accuracy of 99.117%; the suggested parameters from the hyper tuning utilized to build the final model are summarized in Table 2. All models were generated with default parameters using their respective Scikit-learn classifier/algorithm (via lazypredict library). M All models were build using the same training and testing sets (same random 80-20 split). L Accuracy percentages are from the respective model's performance on the testing set. Table 2. Parameters used to build the best performing model.

Web App Usage and Local Deployment
The web application developed based on the top model build can be accessed at https://share.streamlit.io/naeemmrz/maspa.py/main/MasPA.py (accessed on 2 August 2021); the user needs to input the eight-udder inhale and exhale limits and the temperature of the cattle (with an optional identifier for each row); a sample input file is provided in Supplementary Material 2 (S2) and can be downloaded from the web interface as well, and the general interface of the web app is explained in Figure 2. For real-time usage or integration with automatic milking systems, both the model "RndmForest_mastistis.pkl" and the application source code "MasPA.py" are available as open source at the author's GitHub page along with a step-by-step guide https://github.com/naeemmrz/MasPA.py (accessed on 2 August 2021).

Discussion
Mastitis single-handedly costs the dairy industry around 6% of its production value, and this contribution is expected to grow as the demand for more milk production per cattle increases [4]. While several preventive measures are available for early diagnosis of mastitis in cattle, most of these measurements are either too expensive or impractical and inaccessible for small farmers or farmers from low-income countries. The cost of preventive measures for early diagnosis of mastitis in cattle for farmers is up to €120/cow/year, which could contribute a significant amount to the budget [14].
Different farms around the world opt for different preventive measures depending on their geographic region, size of herd, and their revenue; these measures could range from classical inexpensive methods such as temperature monitoring and forestripping that can be performed by any labor, to more sophisticated methods such as LDH, NAGase, immunoassays, and biosensors that provide much higher accuracy, but have higher costs higher, and access could be limited by region [17,18].
Recent developments in the field of data science and artificial intelligence have opened a lot of opportunities in developing new methods of diagnosis and detection of diseases by deploying sophisticated algorithms to these problems. Several attempts have been made in diseases affecting humans, including many other species of plants and animals [19,[27][28][29][30].
The aim of this study was to integrate these recent developments in the field of data science to derive a solution in predicting the risk of mastitis in cattle before it occurred so as to reduce the high cost of treatment, encourage farmers to avoid using antibiotics as a preventive measure, and reduce unnecessary veterinary expenses by providing an open-source tool accessible online free-of-cost.
The integration of machine learning and deep learning technologies for the prediction of clinical and sub-clinical mastitis in cattle is not a novel approach. Indeed, quite recently, Ebrahimi (2019) analyzed parameters such as milk volume, lactose concentration, electrical conductivity, protein concentration, peak flow, and milking time for 364,249 milking instances in cattle, and applied several deep learning and machine learning algorithms to determine the best statistical model that could predict the risk of sub-clinical mastitis. Their study concluded that the gradient-boosted tree algorithm provided the best accuracy of 84.9% from the former parameters, with the random forest algorithm ranking as the worst performing algorithm, with an accuracy of 82.3% [31]. Following a similar path, Fadul-Pacheco (2021) investigated the efficiency of naïve Bayes, random forest, and extreme gradient boosting on the dataset from the Dairy Brain project for early prediction of clinical mastitis. Their study, however, concluded with random forest being the best performing algorithm, with an accuracy of 71% for the first lactation and 85% for the continuous (follow up) model, respectively [32]. These results indicate that different algorithms perform differently when applied to different parameters/attributes. As shown in Table 1, the random forest algorithm performed the best on the attributes in the dataset used for this study.
MasPA is an ML-based solution that can predict the risk of mastitis in the cattle from the inhale and exhale limits for each of the cattle's four udders and the cattle's temperature, which can be collected via highly affordable sensors either manually (for farms with conventional milking systems) or by integrating such sensors into AMS. As shown in Figure 1, MasPA is based on the random forest algorithm, and can predict the risk of mastitis in cattle with a near-perfect accuracy of 98.10%. The application is available as a web application, free of any cost and/or limitations.
To address the inaccessibility to internet for some farms/farmers, we also provided a standalone package MasPA.py (Supplementary Material 3 (S3)) that can run almost on any computer locally while providing the same web interface. Considering the potential lack of technical literacy in some farmers, the interface of the package was made to be as simple as possible ( Figure 2); furthermore, the source code of the application is also available open source, so anyone could modify, optimize, or integrate it into their local system (such as AMS) and/or modify and apply it to different datasets to predict different diseases.

Conclusions
The proposed web application, MasPA, which is based on the random forest algorithm, can predict the risk of mastitis in cattle from the inhale and exhale limits of their udder and their temperature with an accuracy of 98.10% and sensitivity and specificity of 99.4% and 98.8%, respectively. The full potential of this application can be utilized via the standalone version (available as open source), which can be easily integrated into AMS to detect the risk of mastitis in real time.