Clustering and Characterization of the Lactation Curves of Dairy Cows Using K-Medoids Clustering Algorithm

Simple Summary A lactation curve (LC) provides valuable insights in planning appropriate management strategies related to health, nutrition, and breeding in dairy cows. A clustering based approach on LC patterns analysis is presented. The k-medoids algorithm is adopted for the clustering. This approach generates several clusters which have similar milking characteristics of total milk yield, peak milk yield, and days in milk at peak yield. The LCs of some groups represent characteristics of atypical milking patterns which are not considered much in previous approaches, whereas LCs of the other groups show the typical LC patterns similar to the results of previous methods. This approach could be used as a tool to manage an abnormal herd of cows. Abstract The aim of the study was to group the lactation curve (LC) of Holstein cows in several clusters based on their milking characteristics and to investigate physiological differences among the clusters. Milking data of 330 lactations which have a milk yield per day during entire lactation period were used. The data were obtained by refinement from 1332 lactations from 724 cows collected from commercial farms. Based on the similarity measures, clustering was performed using the k-medoids algorithm; the number of clusters was determined to be six, following the elbow method. Significant differences on parity, peak milk yield, DIM at peak milk yield, and average and total milk yield (p < 0.01) were observed among the clusters. Four clusters, which include 82% of data, show typical LC patterns. The other two clusters represent atypical patterns. Comparing to the LCs generated from the previous models, Wood, Wilmink and Dijsktra, it is observed that the prediction errors in the atypical patterns of the two clusters are much larger than those of the other four cases of typical patterns. The presented model can be used as a tool to refine characterization on the typical LC patterns, excluding atypical patterns as exceptional cases.


Introduction
Lactation curves (LC) provide invaluable information that could be used to evaluate the genetic and management status of lactating dairy cows. An LC is a graphical representation of change in milk yield during the lactation period following calving [1]. It typically increases rapidly until the peak yield is achieved and then decreases slowly until the drying period. The shape of an LC, described based on peak yield, peak time, and persistency, is used to determine the milking potential of a dairy cow and is an index that facilitates feed and health management, in addition to breeding decision [2,3]. For a large

Backgrounds
An LC can be thought as a time-series of daily milk production for an individual dairy cattle. Therefore, grouping LCs can be handled by time-series clustering. Time-series clustering and its application have been widely studied in various fields [28][29][30][31][32]. Liao T. W. [28] provided guidance on how to apply clustering algorithms to time-series data. Aghabozorgi et al. [31] also classified the time-series clustering into three categories: whole time-series clustering, subsequence time-series clustering and time point clustering. In the study, the authors only focused to whole time-series clustering because subsequence time-series clustering was considered meaningless by Keogh and Lin [33] and time point clustering is similar to time-series segmentation [31].
There are two most extensively used time-series clustering algorithms: the connectivity-based hierarchical clustering and the centroid-based k-means clustering. Hierarchical clustering repeatedly merges to form a larger cluster (i.e., agglomerative clustering) or divides a larger cluster into smaller clusters until the cluster is not partitioned (i.e., divisive clustering). This clustering algorithm is often used as a primary prescription for time-series clustering, because it requires no prior understanding of the datasets and clusters [31]. However, due to its high computational complexity, the hierarchical clustering method is not recommended for large datasets [28,32]. Conversely, k-means algorithm is a non-trivial algorithm, which implies that it relies heavily on the initial center of the cluster [31,34]. In addition, k-means algorithm has a slow initial convergence rate [34]. To address the challenge, a weighted probability distribution can be adopted to initialize the center of the cluster, which is introduced in the k-means++ algorithm [34]. In the method, the initial center is selected with probability P(x) = D (x) 2 / ∑ x ∈X D (x ) 2 , where D(x) is the shortest distance from the data point to the nearest center already selected. Therefore, the method increases the rate of convergence, and evenly distributes the initial centers.
Although the k-means algorithm is the most extensively applied, the k-medoids algorithm has promising applications because it is more robust against noise and outliers [23,35]. For example, Sauder et al. [36] compared the performance of hierarchical clustering and two non-hierarchical clustering algorithms (i.e., k-means and k-medoids) in the classification of Holstein dairy cows based on their growth curves, and reported that the k-medoids method was the most appropriate for grouping cows. They also observed significant differences in milking performance between the clustered groups.
K-medoids algorithm is a k-means variant that complements the noise vulnerabilities of k-means algorithm. K-means algorithm adopts the mean value as the center of the cluster. Therefore, it naturally suffers from noise [32]. Conversely, k-medoids algorithm is less sensitive to outliers because it uses the median for the selection of center values, which makes the computation procedure in k-medoids more complex than the procedure in k-means algorithm; however, k-medoids is still a more powerful tool for clustering large datasets than hierarchical clustering [31]. In addition, unlike k-means clustering, k-medoids does not require additional calculation for the inter-LC distances whenever the center is updated.

Dataset
The datasets were collected from January 2016 to December 2018 from four commercial farms in Chungcheong Province, South Korea. An automatic milking system (Astronaut A4; Lely Industries NV, Maassluis, the Netherlands) was installed in three farms, while a conventional milking parlor system (DeLaval international AB, Tumba, Sweden) was used in one farm. Milk yield data were stored individually using radio-frequency identification (RFID) tags in both milking systems. A total of 1332 lactations (i.e., records of milk production events during a single lactation period) were recorded from 724 cows (Table 1). The milking records from 10 to 280 days-in-milk (DIM; calving day was considered as 0 DIM) were used. This was because the data before 10 DIM and after 280 DIM were error-prone due to differences in management practices among the four farms. In addition, an ideal LC is formed using a concave function and the peak is often achieved within 10 weeks [37]. Because the peak information is critical source of information in LCs, we set the non-interpolation period within the 10 through 70 DIM. Lactation records were preprocessed according to three criteria for incomplete data ( Figure 1).

Noise Filtering Standardization
Type-I Criterion

Initial observation is missing
First observation is occurred before 10 DIM First of all, data recorded prior to 10 DIM were filtered out (Type-I). Datasets which had any missing records between 10 to 70 DIM were also discarded (Type-II). Finally, lactations that contained missing data for more than 10 consecutive days between 70 to 280 DIM were also discarded dataset (Type-III).

Any observation in critical section is missing
Out of the total 1332 lactation data, 228 (22.8%), 42 (4.2%), and 732 (73.1%) records were excluded via the Type-I, II and III criteria, respectively. Consequently, a total of 330 (24.74%) lactation records remained in the final dataset (Table 2). One-hundred and one (33.6%) records were from primiparous cows (i.e., cows at first lactation), with an average parity of 2.3, which is close to the Korean national average parity [38].
Data preprocessing was performed on the final dataset to fill gaps and minimize noise in the data. Missing values in lactation records were compensated for via the linear interpolation method [39]. Table 2 presents lactation performance values within the dataset. The milking days, average daily milk yield per head, and total milk yield in a lactation increased with an increase in parity number. After the compensation step, noise filtering was also performed. The moving average was used based on a 10-day window.
Before the clustering, the datasets were normalized using Z-score transformation. The normalization process minimizes clustering divergence caused by differences in dynamic range and emphasizes inter-cluster phenotypic homogeneity.

Clustering
A metric for similarity measures is required for clustering [28,32]. Root mean square (RMS) of Euclidean distance denoted as 'd' is used to measure the similarity between two records as, where: K-medoids clustering, a center-based clustering method, was used. It is advantageous for large datasets, although a pre-defined number of clusters is necessary [28,31,32]. Unlike the k-means algorithm, the k-medoids algorithm considers the median as the center of a cluster. Using a median instead of mean enhances robustness against outliers [40]. In the k-medoids algorithm, a cluster S is given as: where:

between lactation datasets X and µ A
The elbow method was adopted for the selection of the pre-defined number of clusters. The elbow method is a heuristic approach for determining the point where the local optimum is observed [41]. Six was the most appropriate cluster size with a low root mean square error (RMSE) in our experiments.

Characterization and Comparison of the Clusters
After clustering, differences in lactation and physiological characteristics (i.e., parity, peak milk yield, DIM at peak milk yield, total milk yield from 10 to 280 DIM, the day at 70 DIM in a lactation) were compared among the six clusters. The mean LCs of each cluster and the entire dataset were generated. Subsequently, the mean LCs were regressed to the three of the most popular LC models to investigate differences in estimated parameters and lactation characteristics (i.e., peak milk yield, DIM at peak milk yield, and persistency of milk production) among clusters. The models included the Wood model, an incomplete gamma function proposed by Wood [5], the Wilmink model, a model that combines exponential and linear decline function presented by Wilmink [6], and the Dijkstra model, a mechanical model presented by Dijkstra et al. [7], as shown in Table 3. The residuals of the models did not follow the normal distribution, rather shaped as bell curves.
An HPE Proliant Gen10 DL380 server equipped with 16 core CPUs (2.10 GHz) and 32 GB ECC DDR4 RAM was used for the computation. It took several minutes to group and train data. The least-square method was applied for the regression. Clustering algorithm was implemented in Python (v3.6). For the regression, the "SciPy" package (v1.3.0) was used.

Statistical Analysis
Analysis of variance was performed to evaluate the differences in milking characteristics among clusters. Differences in means were tested using Fisher's Least Significant Difference. Statistical significance was set at p < 0.05, and 0.05 ≤ p < 0.1 were considered trends.
Two fitting errors, f and c were measured to test fitting performance. f is the difference between the regression curve and a representative LC of a cluster measured in RMSE. The mean value of the cluster is chosen as the representative LC. c is the difference measured in RMSE between the regression curve and all the LCs in a cluster. When calculating c , all curves are each Z-normalized to compare shapes. Table 3. Traditional regression models.

Model
Lactation Curve [7] y: daily milk yield; t: time from parturition. All parameters a, b, c and d define the scale and shape of the LC. Wood model [5]: 'a' represents the general production level, 'b' represents the increasing period of the lactation before the peak yield, and 'c' represents the decreasing period of the lactation after the peak yield. Wilmink [6] 'a' represents the general production level, 'b' represents the decreasing period of the lactation after the peak yield, and 'c' represents the increasing period of the lactation before the peak yield, and 'd' represents the peak day. Dijsktra [7] 'a' represents the cell population at parturition (theoretical initial milk production), 'b' represents the cell proliferation rate at birth, 'c' represents the rate of decrease in cell proliferation during lactation, and 'd' represents the cell death rate during lactation. All parameters are non-zero positive except b and c from Wilmink model. b and c in Wilmink model are non-zero negative.

Results
The k-medoids clustering algorithm (k = 6) grouped 330 datasets into six clusters of 119, 64, 50, 47, 38, and 12 datasets. The clusters were denoted (a)-(f) ( Table 4). The largest cluster (a), had 36% of the datasets, while the smallest cluster (f) had 4% of the datasets, which represented the smallest dataset. Excluding the 70th DIM in a lactation, there were significant differences in parity, average daily milk yield, total milk yield from 10 to 280 DIM, peak DIM, and peak milk yield (p < 0.01) among the clusters. Clusters (a) and (d), which had high proportions of multiparous cows, had a parity of 2.6, which was significantly higher than those of other clusters (p < 0.05). In contrast, clusters (e) and (f) contained more primiparous cows than multiparous cows, and had lower average parities when compared with the other clusters (p < 0.05). The daily average milk yield and total milk yield were the highest in cluster (a) (39.5 L and 10,713 L) and the lowest in cluster (e) (35 L and 9509 L) when compared to other clusters (p < 0.05). Peak milk yield was the highest in cluster (d) (54 L), with an 11.6-L difference from cluster (e), which had the lowest peak yield (p < 0.05). Clusters (a), (b), and (d) had peak yields at the early lactation period, while clusters (c), (e), and (f) had peak yields at the mid-lactation period. In cluster (e), which had the lowest peak yield, DIM at peak yield was 144 days, which was the latest among the six clusters (p < 0.05).
The regression graphs and model parameters of the three conventional models for the average lactation data in the clusters are presented in Figure 2 and Table 5, respectively. Clusters (a) and (b) curves displayed typical shape similar to that of the average LC of the 330 lactations, while clusters (d) and (f) curves had a different shape (Figure 2). Clusters (c) and (e) curves were gentle with lower peak milk yield than clusters (a) and (b). Cluster (d) curves exhibited rapid declines in milk production in the mid-lactation period after the high peak yields, and cluster (f) curves displayed an abnormal shape rather than a general LC. Cluster (f) curve did not fit properly in all three model. Distorted parameter estimates were also derived from the Wilmink model (Table 5). When the f s of the three models for the clusters were compared, the model with the lowest f varied across the clusters (    The estimated values of the parameters are listed in Table 6, except Wilmink model. Predicted values of peak milk yield were close between the Wood model and the Dijkstra model, but were underestimated by 6 L when compared to the actual lactation data. The greatest difference was 8 L, which was observed in cluster (d). Excluding in cluster (a), the calculated peak DIM value varied between the two models, and there was a gap of 15 days on average. The greatest difference between the models was 20 days, which was observed in cluster (d). Compared to the actual lactation value, the difference in the predicted peak DIM value from two models was relatively low, at 3 days in cluster (a) and (e), but high in cluster (d), at 34 days. Persistency, the relative declining rate at the half point between peak milk yield and the end of lactation, was the lowest in cluster (a) and the highest in cluster (e). Peak yield: peak milk yield; Peak DIM: days in milk at the peak milk yield; Persistency: relative rate of decline at the point halfway between peak milk yield and end of lactation [42]; N/A: not applicable; 1 The features of LC were calculated by the equations as follows [5,42]:

Discussion
Clustering of LC shapes yielded clusters (a), (b), (c), and (e), which accounted for 82% of the total individuals, and had typical LC shapes (Table 4). In particular, clusters (a) and (e) showed typical LC shapes of multiparous and primiparous cows, respectively. Primiparous cows generally have a flatter LC and have relatively high persistency, whereas multiparous cows exhibit rapid increases in daily milk yield from calving to the peak milk yield, followed by a significant decline [5,43]. Our study revealed significant differences in milking characteristics such as peak milk yield, peak time, and persistency (p < 0.01). The results are consistent with the findings of previous studies [44][45][46][47][48][49], which reported that total milk yield in a 305-d lactation increases with an increase in the parity of Holstein cows. Particularly, some studies reported that peak milk yield increased dramatically from parity 1 to parity 2, and peak milk yield was observed at later periods in primiparous cows and earlier in multiparous cows [47,48]. Other studies have reported that milk yield is generally higher in multiparous cows than in primiparous cows, while persistency is often greater in the primiparous cows with less developed mammary glands [50][51][52]. In addition, most primiparous cows are not physically mature [53]. Since primiparous cow require nutrients for their own growth, their metabolic status is different from that of multiparous cows [54]. Such results support the observation that the clustering method used in the present study can discriminate milking characteristics such as total milk yield, peak milk yield, and DIM at peak yield, which vary depending on parity.
Clusters (d) and (f), which accounted for approximately 18% of the cows (Table 4), exhibited abnormal LC shapes ( Figure 2). Cluster (f), which consisted of only 12 individuals (4%), had an LC shape with no peaks and no significant changes in milk production. The LC of cluster (d), which had high and rapid peaks, had an undulating shape due to a sharp decrease in milk yield during the mid-lactation period. Previous studies have reported that LCs with atypical shapes, such as no peak in LC, account for approximately 20∼30% individuals in datasets [2,9,[13][14][15], which is consistent with the observations of the present study. Some studies have also reported very high peaks in the early lactation stages are accompanied by sharp decreases in milk yield during subsequent lactation periods, which could be due to metabolic stress and negative energy balance in some instances [55,56]. In addition, previous studies have reported that cows with high milk production and relatively early peak milk yields in early-lactation periods exhibit slow rates of recovery of body condition score, increased physiological stress, and have high risks of udder disease during mid-and late lactation periods [57][58][59]. The results suggest that the undulating shape of cluster (d) indicated energy unbalance or metabolic disorder. However, such atypical shapes (e.g., non-peak or undulating) of LCs could be caused by individual characteristics [2,9,60]. Arnal, et al. [61] classified LCs of dairy goats using principal component analysis and reported that LCs that were undulating in the mid lactation (120 DIM) period after peak yield were not associated with health problems or environmental factors. To facilitate precision dairy farming, it is critical to determine when such atypical LC shapes emerge, and whether they are attributable to variations, or health problems. Therefore, future studies should develop algorithms that can distinguish such variations or statuses.
The best fitting models are not the same in each clusters as shown in Table 5. The results are consistent with the those of previous studies [3,19]. Previous studies performed goodness of fit tests of the LC model for lactation data by grouping the data according to the period in which the milking data was obtained, the level of milk production, and parity, with the aim of finding a model that optimally describes lactation. Direct comparison on fitting accuracy of presented method to the previous studies is somewhat difficult to perform, since data collection duration, number of cows breed, metric of average milk yield and the data grouping method were not unified among the previous studies. The LC models in previous studies [3,19] were established from the selected groups considering ability of milk production and parity from a large breeding group. Abnormal populations were not considered for the LC model and excluded as exceptional cases. The previous approaches are not appropriate for studying the atypical cases of LC. The LC characteristics of atypical cases would be taken into account using the clustering method.
The three conventional LC models applied in the present study had high fitting errors for clusters with atypical shapes (Table 5). Cluster (d), which had the undulating shape, had the highest fitting error among the clusters, and cluster (f), which exhibited an abnormal shape without peaks, had distorted parameter estimates. Consequently, it was not possible to calculate peak yield, DIM, and persistence in cluster (f) using parameters derived from the model ( Table 6). In addition, in cluster (d), the peak milk yield and DIM calculated using model parameters were significantly different from the actual milking data when compared with other clusters. Cluster (f) individuals accounted for only 4% of the individuals in the entire dataset and were less similar between individual LCs in the group. Therefore, the validation of individual data should be performed before further analyses (Table 5; c ). However, cluster (d) individuals accounted for a considerable proportion of the total data (47 out of 330 cows). In addition, since the similarity between individual LCs in the group was comparable to the similarity observed within other clusters (Table 5; c ), it can be considered as one of the LC types that can emerge in the commercial farms. In previous studies, LCs with atypical shapes have been omitted from initial datasets or their influence on the overall dataset has been diluted using mean values [16][17][18]. As mentioned above, such LC types are attributable to diverse factors ranging from simple individual differences to health problems [2,9,55,56,60], so more detailed investigations are required to determine their specific origins. However, in the present study, the conventional model had a high fitting error for LCs with atypical shapes, which is consistent with the findings of previous studies [9,15,62]. Therefore, it is necessary to develop a high-resolution model that can distinguish detailed features with high goodness of fit for diverse LC shapes.
In summary, clusters delineated from LC shape have different milking characteristics due to the varying proportions of multiparous and primiparous cows in each cluster. In addition, 18% of the individuals had atypical LC shapes. One of the clusters had an undulating LC shape and accounted for 14% of the individuals. This observation is presumably due to metabolic problems caused by rapid peaks and high production at the early stages of lactation. The fit of the model varied for each cluster following the fitting the three popular LC models to average lactation data clusters. We confirmed, however, that the conventional model is not suitable for clusters with atypical shapes due to high fitting errors. The method proposed in the present study may facilitate the identification and management of cows that require attention in herds. One of the practical applications of this study is to predict the daily milk production of individual cows using the cluster information. If the LC of currently milking cow follows a typically shaped cluster, then the regression models of the cluster can be thought as a predictor. The present study, however, was carried out using a limited amount of lactation data, so more commercial farm data should be collected to validate the various LC shapes. In addition, the correlations between LC shapes and the factors influencing such correlations should be investigated further based on biological and environmental data from individuals linked to lactation data. It is also necessary to investigate how many clusters of lactation data are appropriate for application in herd management. Finally, further studies should be conducted to develop a model that can mathematically explain LCs with various shapes.

Conclusions
This study described the development of a pattern-based clustering method for LCs using k-medoids algorithms. This methodology was able to successfully group the individual lactation data with similar milking characteristics (i.e., total milk yield, peak milk yield, and days in milk at peak yield). These results suggest that our methodology can facilitate the identification and management of cows that require attention in the herd.