Unsupervised Few Shot Key Frame Extraction for Cow Teat Videos

A novel method of monitoring the health of dairy cows in large-scale dairy farms is proposed via image-based analysis of cows on rotary-based milking platforms, where deep learning is used to classify the extent of teat-end hyperkeratosis. The videos can be analyzed to segment the teats for feature analysis, which can then be used to assess the risk of infections and other diseases. This analysis can be performed more efficiently by using the key frames of each cow as they pass through the image frame. Extracting key frames from these videos would greatly simplify this analysis, but there are several challenges. First, data collection in the farm setting is harsh, resulting in unpredictable temporal key frame positions; empty, obfuscated, or shifted images of the cow’s teats; frequently empty stalls due to challenges with herding cows into the parlor; and regular interruptions and reversals in the direction of the parlor. Second, supervised learning requires expensive and time-consuming human annotation of key frames, which is impractical in large commercial dairy farms housing thousands of cows. Unsupervised learning methods rely on large frame differences and often suffer low performance. In this paper, we propose a novel unsupervised few-shot learning model which extracts key frames from large (∼21,000 frames) video streams. Using a simple L1 distance metric that combines both image and deep features between each unlabeled frame and a few (32) labeled key frames, a key frame selection mechanism, and a quality check process, key frames can be extracted with sufficient accuracy (F score 63.6%) and timeliness (<10 min per 21,000 frames) for commercial dairy farm setting demands.


Introduction
Monitoring the dairy cows' health is critical in ensuring quality milk production. In the commercial dairy farm setting, monitoring the health of thousands of cows is a time-consuming and expensive task. During the milking process, cows are moved toward large parlors for machine milking as shown in Figure 1. These systems consist of a large and slowly rotating set of stalls, where a cow is guided into a stall, a milking unit is manually attached to the cow's teats, machine milking commences via vacuum, and the milking unit automatically detaches and retracts from the teats. Thereafter, the cow exits the rotating parlor.
Within a milking session, the opportunity for a veterinarian to assess the health of the dairy cows' teats is limited to immediately before or after the milking unit attaches or detaches from the teats. Mastitis, or bacterial infections of the udders and/or teats, poses one of the greatest health concerns for dairy cows. The risk of mastitis is increased with changes in the callosity (hyperkeratosis) of the teat end, and this can be assessed via manual inspection. While it is possible to assess the extent of hyperkeratosis in a large proportion of the herd during a milking session, the total time available for the veterinarian to conduct this assessment is limited due to the finite amount of time that the cow is in the stall (typically tens of seconds). It is thus impractical to conduct health assessments of the Recently, we proposed a digital framework for evaluating the extent of hyperkeratosis by using a digital camera and software [2]. This approach allows remote assessment of the entire population of cows that enter the parlor. A digital approach also permits the opportunity for several experts to conduct assessments of hyperkeratosis independently and mitigate the influence of inter-rater variability. We have also shown that it is feasible to use deep learning to classify the extent of hyperkeratosis [3]. These innovations permitted the opportunity to explore whether such health assessments can be conducted remotely using video-based imaging systems. Later, we proposed a separable confident transductive learning [4] model to minimize the difference between training and test datasets, and we improved the hyperkeratosis recognition accuracy from 61.8 to 77.6%.
While a video-based analysis might seem like a simple extension of this work, analyzing the entire video frame by frame is inefficient since only a small number of frames contain useful diagnostic information. Many vision-based tasks (classification, segmentation) can be performed more efficiently using key frames (KFs) instead of the full video, thus one option is to select KFs from these cow teat videos for analysis. Most existing key frame extraction (KFE) methods use supervised or unsupervised learning. Supervised learning requires the manual labeling of KFs from large-scale training data to train a model. In the dairy farm setting, it is not practical nor economical to manually label all video images; thus, unsupervised or semi-supervised learning models are preferred. Unsupervised learning models for detecting KFs rely on significant changes between image frames. The cognitive goal of our problem is to extract key teat frames from video sequences in which changes in objects between frames are less obvious. The utilitarian goal is to efficiently and accurately extract key frames with only a few key frames. Therefore, existing supervised (require massive labels) and unsupervised (require significant frame changes) methods are ineffective in our problem.
We propose a modified few-shot learning approach and leverage knowledge from several (N = 32) support KFs and then identify KFs in unlabeled video image frames ( Figure 2). Figure 3 shows 6 of the 32 KFs used in this study. This paper provides three specific contributions: • The CowTeatVideo Benchmark. We provide a new, publicly available dataset consisting of dairy cow teat videos for key frame extraction. This is a published dataset of dairy cow teat videos that can be used for the testing and evaluation of different KFE models. • Few-shot generalized learning. We run few-shot learning without a base training dataset and unlabeled query datasets (cow teat videos). The key frames are detected using the distance between unlabeled query datasets and support key frame images. • UFSKFE model. We describe a novel unsupervised few-shot learning key frame extraction (UFSKFE) model for our problem. We combine the L1 distance of raw RGB images and extracted deep features to form a robust fusion distance. After selecting key frame candidates, we further propose a quality check process to remove noisy key frames.  Six sample key frames (KFs) in the cow teat video. These KFs should provide a clean, unambiguous, and high-resolution image of the dairy cow teats for clinical diagnosis, suppress similar frames, and be diverse enough to reduce redundancy.

Related Work
Extracting correct KFs has been a long-standing problem with many applications, such as managing, storing, transmitting, and retrieving video data. Both traditional and deep learning-based methods have been explored.

Traditional Methods
Traditional KFE models can be divided into two categories: unsupervised learning and supervised learning. Unsupervised KFE often relies on computing the relevance, diversity and representations using extracted traditional features using optical flow [5,6], SIFT [7,8] and SURF features [9,10]. The clustering approach is one representative unsupervised KFE method [11]. Mendi and Bayrak [12] developed a dynamic KFE method through three steps: color histogram differences, self-similarity modeling and unsupervised k-means clustering. Priya and Dominic [13] utilized inter-cluster similarity analysis to extract KFs. Vázquez-Martín and Bandera [14] computed similarity by building an auxiliary graph of frame features and then applied spectral clustering to extract KFs. Later, Ioannidis et al. [15] extracted KFs via applying spectral clustering to a composite similarity matrix that was computed using weighting sum of all similarity matrices of video frames. Supervised KFE models rely on human annotated data to train a machine learning model and generate KFs from the test videos. Ghosh et al. [16] and Gygli et al. [17] treated the process of extracting KFs as a regression scoring problem, where a higher score is selected as a KF. Yao et al. [18] proposed a multifeature fusion method (which can capture complicated and changeable dancer motions) to extract KFs from dance videos.

Deep Learning Models
Recently, deep learning approaches have attracted interest in KFE. Both supervised and unsupervised deep learning models have been proposed to boost the performance of KFE from videos. Supervised deep KFE models usually estimate a frame's importance via deep neural networks with the aid of ground truth KFs. Zhang et al. [19] first applied long short-term memory (LSTM) units to model variable-range temporal dependency among video frames, and they predicted the frame's importance via multi-layer perceptron. Later, Zhao et el. [20] proposed a two-layer LSTM to estimate the key fragments of a video. They further developed a tensor-train embedding layer in a hierarchical architecture of recurrent neural networks to model the long temporal dependency among video frames [21]. Based on [19], Casas and Koblents introduced an attention mechanism to estimate the frame's importance and select the video KFs. Fajtl et al. [22] utilized self-attention with a two-layer fully connected network to predict the frame's importance score. Li et al. [23] developed a global diverse attention mechanism based on a pairwise similarity matrix that contains diverse attention weights. These weights can further transform into frame importance scores. Jian et al. [24] extracted the KFs of sports videos, considering the neighboring probability difference of frames, and these probabilities were estimated from a CNN on extracted region of interest areas. Yuan et al. [25] introduced a global motion model to extract candidate KFs; spatial-temporal consistency and hierarchical clustering were used to extract KFs.
There are also several unsupervised deep learning models for KFE. Yuan et al. [26] introduced a bidirectional LSTM model to automatically extract KFs. Mahasseni et al. [27] applied the generative adversarial networks (GAN) in KFE. They employed an LSTM as a frame selector and confused the discriminator (which aims to distinguish original video and reconstructed video). Yuan et al. [28] utilized bidirectional LSTM as a frame selector to model the temporal dependency among frames, and KFs were evaluated by two GANs. Yan et al. [29] proposed an automatic self-supervised learning model to detect KFs in videos. They proposed to generate pseudo labels for each frame with optical flow and RGB image features. Li [30] proposed an end-to-end network embedding for unsupervised KFE for person re-identification. They designed a KFE module by training a CNN with pseudo labels generated by hierarchical clustering. Recently, Elahi and Yang [31] proposed an online learnable module for KFE, and the extracted KFs were used for recognizing action with deep learning-based classification models.
Our goal is to devise an effective strategy to extract KFs that contain a clear, unambiguous, and high-resolution image of the dairy cow teats for clinical diagnosis. Unsupervised learning models rely on the sharp differences between consecutive frames to determine the KFs, but this is not the case in our problem. Unsupervised clustering models can lead to low performance in our situation since KFs are similar to each other and may be easily be assigned to the same class (see sample KF images in Figure 3).
Few-shot learning aims to accomplish a learning task by using very few training examples, which typically recognize the different categories of images in the query dataset given a base training dataset and a support dataset [32][33][34]. Oreshkin et al. [35] trained a normal global classifier on the base dataset to form an auxiliary task, which can cotrain the few-shot classifier and create a regularization effect. Gidaris et al. [36] combined self-supervision with few-shot learning, which can learn rich and transferable visual representations with few annotated samples. Hong et al. [37] utilized reinforcement learning for training an attention agent to generate discriminative representation in few-shot learning. Wei and Mahmood [38] optimized few-shot learning tasks by generating new samples using variational autoencoders on face recognition. However, current few-shot models are mostly supervised and rely on labeled examples. Current attempts of unsupervised few-shot learning [39,40] are not suitable in our problem. Only a few KFs (support dataset) and unlabeled cow teat videos are provided for the learning.

Motivation
Given the unique nature of our dataset and problem, we propose to apply few-shot learning in an unsupervised manner for KFE. We then design a framework that takes the knowledge from the few support KF images to find its nearby neighbors using both raw RGB images and pre-trained deep features distances as shown in Figure 4. . The scheme of our proposed unsupervised few-shot key frame extraction (UFSKFE) model. We first calculate the raw distance d raw between each video frame image and few support key frame (KF) images. Secondly, we employ a pre-trained CNN (ResNet-101) to extract deep features for video frame images Φ(V) and 32 support key frames (Φ(S)) and then calculate the deep distance d deep . Lastly, we perform a quality check (QC) to select KFs for each video with a smaller fusion distance (d).

Key Frame Extraction
where v i is the i-th frame image and n v is the number of frames in video V, the goal of video KFE is to fetch the KF numbers Y: where Y = {y j } n y j=1 (n y is the number of predicted KFs, and n y << n v ) and S is an automatic KF selection function. In supervised KFE, the KF numbers F = { f i } n f i=1 of video V, or importance of each frame image, are provided, where n f is the number of KFs and typically n f << n v . We aim to minimize the error between Y and F during the training and generalize the trained model for new video data. In unsupervised KFE, no KFs are known (i.e., F = Ø). It aims to predict Y that can best describe the content of a video V.

Few Shot Learning
In supervised few-shot learning, we have a labeled base training dataset that contains n d labeled training images from A base classes, i.e., z i ∈ {1, 2, · · · , A}. In addition, we are given a support dataset D S of labeled images from C novel classes, and each class has K examples. The goal of few-shot learning is to train a model that can accurately recognize the C novel classes in another query dataset D Q . This learning paradigm is called C-way K-shot learning. In unsupervised few-shot learning, there are no labels for the base training dataset, i.e., In our KFE problem, the base training dataset is also unavailable i.e., D base = Ø. We treat the full video as the query dataset, and it has no labels. In the next section, we discuss how we can construct tasks in unsupervised KFE with few-shot learning.

Unsupervised Few-Shot KFE
In traditional unsupervised KFE, poor performance is often the result of no labeled KFs. In our videos, there are no distinctive changes between frames, like in sports videos. To improve the learning of these KFs, we start with a few KFs (i.e., a support dataset D S exists). Since we only have one class (KFs) and K KFs (K images, K = 32 in our case), our problem can be treated as a one-way 32-shot problem, or a few-shot learning perspective. However, the aforementioned base training dataset is not provided. Furthermore, the query dataset is our unlabeled cow teat video (D Q = V). A key question then is how to obtain key frames from each cow in all unlabeled videos with only a few prior KFs. Inspired by few-shot learning, we consider measuring the distance between each video frame image and support KFs.

Raw Distance Representation
To select KFs from the unlabeled videos, we propose to calculate the distance between support KF images D S = {s k } K=32 k=1 and each frame image of a video. Frames with the lowest distances could be potential KFs. First, we calculate a distance based on each raw frame image and support KF image via the distance matrix M raw ∈ R n v ×K in Equation (2), which represents the L1 difference between each raw video frame image and K support KF raw images. An element in the distance matrix is defined as where | · | 1 is the L1 norm of the difference between one support KF image and one video frame (k ∈ {1, · · · , K} and i ∈ {1, · · · , n v }), |s k − v i | 1 ∈ R 1×1 , and hence, M raw ∈ R n v ×K . We then define the raw distance as where min r returns the minimum number of each row in the matrix M raw . For each frame v i , its associated raw distance is d i raw = min{M ik raw } K k ∈ R 1×1 and denotes the distance to one of the closest support KF images. Since a video contains many images of a cow-and many cows-several KFs to compare against an analyzed image are necessary. We aim to have a diverse set of support key frame KF s k from which at least one image closely resembles the current frame. For all frames in any video V, we can calculate the raw distance d raw ∈ R n v ×1 . Note, however, that the raw distance is computed using original images and might not capture all of the important features in a key frame. So, we describe how we extract deep features from both the video frame images and support KF images, and then calculate a deep feature distance, as described in more detail in the next section.

Deep Distance Representation
There is no deep model for cow teat video classification or segmentation; thus, our approach is to extract deep features from a pre-trained ImageNet model. Let Φ represent feature extraction from a pre-trained ImageNet model. Similar to the raw distance matrix in Equation (2), an element in a deep distance matrix is denoted as where Φ(·) −→ R D , which represents the feature vector for a given frame image with dimensionality D (We extract deep features from the layer prior to the last fully connected layer.), M ik deep ∈ R 1×1 , and M deep ∈ R n v ×K . The deep distance is then defined as: Again, d deep has the size of n v × 1. This deep distance can represent feature differences of the current video frame to its closest support KF. Both d raw and d deep can denote the distance between one video frame and support KFs. Next, we form a robust fusion distance by considering these two distances for KFE.

Fusion Distance
We combine the raw and deep distances in a new distance function to improve the performance of KF detection in our problem: Since raw distance d raw and deep distance d deep have different magnitudes, we rescale them by dividing them by the maximum distance within their respective matrices i.e.,d raw = d raw / max(d raw ) andd deep = d deep / max(d deep ). The parameter α controls the weight between re-scale raw distanced raw and re-scale deep distanced deep . With this new fusion distance d defined, the next step is to design a KF select function S that correctly retrieves KFs.

Key Frame Selection Mechanism
One straightforward way of extracting KFs is to return frames that have a distance score below a threshold. However, establishing an (arbitrary) threshold is prone to errors and redundancy. Different KFs could have very large distances to the support KFs, resulting in an incorrect KF. Alternatively, frames which have a distance below the threshold can belong to the same cow (redundancy). For example ( Figure 5), the fusion distance when analyzing a video suggests four frames (circles) would be selected as KFs. However, each circle represents one cow, and only one frame is needed for the best view of the key cow teat frame. To address the redundancy problem, we propose to first sort d in ascending order, then iteratively take the first small distance frame as the KF, and then remove its nearby window of potentially ±R redundant frames. This process can be summarized in Algorithm 1. This key frame selection S allows us to uniquely obtain KFs from each cow in the video.

Predicted KFs Quality Check
After generating several KF candidates Y S with Algorithm 1, we conduct a quality check (QC) of the predicted KFs. The most common issue for an incorrect KF candidate is when the milking unit is still attached to the dairy cow, or it obstructs visualization of the dairy cow teats (as shown in Figure 6a). To enforce selected KFs with a clear view of the teat area, we calculate the structural similarity index (SSIM) [41] score between support KF approximate teat area and selected KFs area (The position and size of teat areas is x coordinate = 130, y coordinate = 80, width = 170, height = 190, and remains constant since the camera is in a fixed position.). If the SSIM score between the most similar support KF and the selected KF is smaller than the threshold (O = 0.45), the selected KF is excluded. This threshold is determined empirically. Let L be the number of selected KFs candidates Y S and Y l S be its l-th KF number. We then can calculate SSIM between each selected KF and each support KFs in the teat position to form a similarity matrix H ∈ R L×K . An element in H is defined as where p represents the sub-region of the image of greatest clinical relevance, and v Y l S is the selected KF image. Finally, we determine the KFs numbers with the following equation,   . Quality check between detected key frame (a), which shows the milking apparatus still attached to the dairy cow, and its closest support frame (b). SSIM are computed over a fixed region of interest within the frame (red and green rectangles). Key frame (a) does not pass the quality check since its SSIM score is lower than the predetermined threshold. Figure 6a displays a candidate KF image from S but the milking unit is still attached to the dairy cow teats. To mitigate this issue, we calculate the SSIM between the the current KF and support KFs within the sub-region using Equation (7), and the highest SSIM scores among all K frames to obtain its most similar support KF (Figure 6b). The SSIM score is 0.41, which is lower than the threshold O = 0.45. Using this method, we are able to exclude the detected KF in Figure 6a.

Ufskfe Model
where QC is the quality check, S is the section mechanism in Algorithm 1, and d is the fusion distance. The overall learning algorithm is shown in Algorithm 2.
Algorithm 2 Unsupervised few-shot key frame extraction 1: Input: Cow teat video V, K = 32 support KFs, weight balance factor α, redundant frame number R and similarity threshold O 2: Output: predicted KFs Y 3: for i = 1 to n v do 4: for k = 1 to K do 5: Compute M ik raw and M ik deep according to Equations (2) and (4) 6: end for 7: end for 8: Calculate d raw and d deep according to Equations (3) and (5) and form d using Equation (6) 9: Select KFs candidates using Algorithm 1 10: for l = 1 to len(Y S ) do 11: for k = 1 to K do 12: Compute similarity matrix H according to Equation (7) 13: end for 14: end for 15: Return predicted key frame numbers Y S using Equations (8)

Data Collection
Approximately eight hours of video footage of dairy cow teats on a commercial dairy farm were obtained using a GoPro 10 camera mounted on a tripod with two adjustable LED lights directed towards the teats. The farm houses approximately 1600 Holstein cows which are milked daily on a 60-stall rotary parlor. The 1691 Holstein dairy cows were housed in free-stall pens and milked three times per day in a 60-stall rotary parlor. Cows were in the first (697, 41.2%), second (446, 26.4%), and third or greater lactation (548, 32.4%) and between 1 and 738 days in milk (mean and standard deviation, 185 (113)). All procedures were reviewed and approved by the Cornell University Institutional Animal Care and Use Committee (protocol no. 2013-0064). Videos were sampled at 1080× 1920 × 3 pixels, 59.94 frames per second and saved in MP4 format. The camera was set to use default settings, and external lighting was used. The images were acquired immediately after removal of the milking cluster.
The rotational speed of the milking rotary parlor was 8.5 s/stall, leading to a rotation time of 510 s (i.e., 8.5 min). This resulted in a theoretical throughput of 423 cows per hour. The average milking duration to milk the 1600 cows was approximately five hours. The speed of rotation of the milking parlor platform does not affect the accuracy of the camera measurements, provided that the video feed is sampled at a sufficiently high enough rate. Our data were sampled with a minimum of 60 frames per second. Four milking technicians operated the milking parlor and were assigned to four different positions, including the following tasks: Position 1, manual forestripping of teats and application of pre-milking teat disinfection; position 2, cleaning and drying of teats with a clean cloth towel; position 3, attachment and alignment of the milking unit; and position 4, application of post-milking teat disinfectant with a dip-applicator cup. Post-milking teat disinfectant was applied by an automatic teat spray robot. Cows were led to the holding area by one farm technician.
Plastic covers protected the tripod and lights and were mounted around the camera to minimize the contamination of feces and other contaminants. The camera feed was displayed continuously and regularly checked to ensure that the lens was not obfuscated from such contaminants, and the camera lens itself was regularly inspected and cleaned throughout the data collection. Table 1 shows the statistics of the cow teat videos analyzed in this study. There are only few KFs in each cow teat video, which leads to the difficulty of KFE. Note that cows do not always occupy all the stalls in the rotating parlor, which explains why fewer key frames are detected in videos 1-10. Note also the videos are relatively large in file size (2.47 gigabytes on average), with 21,191 frames in each video. Here, the number of KFs are checked with an expert for evaluation purposes. There are usually 500 frame differences between two successive KFs unless the parlor rotation is interrupted, the parlor stall is empty, or the milking system obfuscates the teats: for these reasons, the redundant frame number R is set to 500. The computation time should be as short as reasonably possible, as it may result in delays in assessing a cow's teat health. We expect the computation time of any KFE algorithm to be less than an hour per video, which is reasonable in a commercial dairy farm setting.

Evaluation Metric
We use the F score to evaluate the performance of KFE models [27,29]. The F score uses recall (Re) and precision (Pr) to measure how much the KFs overlap in Equation (10). The higher these metrics, the better the model is.
where n f is the number of ground truth KFs (third column in Table 1), len(Y ) is the length of the predicted KFs, and N corr is the number of correctly detected KFs. The closer the F score is to 1 (or 100% in Table 2), the better the model. Since several nearby frames of annotated KFs are similar to each other, they also contain a clear view of the teat area. Therefore, we treat the annotated KF number within ±20 frames (or approximately 0.3 s) as the correct prediction (e.g., a predicted KF number of 120 is correct if the annotated KF number is 100). This value will vary based on the video frame rate and rotation rate of the parlor.

Implementation Details
In our UFSKFE model, we utilize ResNet-101 [42] as the pre-trained model to extract deep features from the layer prior to the last fully connected layer. We conducted experiments with 12 different ImageNet models in order to justify selecting ResNet-101 for feature extraction. The performance of different ImageNet models can be found in Appendix A. Frame image features are extracted with an NVIDIA RTX A6000 GPU with 48 Gigabyte. The three hyperparameters are set at α = 0.4, R = 500 and O = 0.45. We also conduct a parameter analysis in Section 4.6. Since there are no KFE models to adopt in our problem, we compare several existing models with different frame image extraction methods. In Section 3.3.2, Φ refers to the feature extractor from an ImageNet model. We can also extract other features, such as SURF features [9,10], binary image features [43] and Sobel edge detection image features [44]. We then can calculate d SURF , d Binary and d Sobel . We replace the fusion distance d with these other distances in Algorithm 1 to predict KFs. The details of feature extraction can be found in Appendix A.1. Results in Table 2 are reported with the additional quality check. Table 2 shows the performance of all 18 cow teat videos. Compared with all other baselines, our UFSKFE model achieves the highest average F score over all videos. Note that while d crop raw has the lowest F score, it only calculates the distance between each video frame to the support KF images in the small teat area, and ignores other important areas, e.g., the cow leg area. We find that the performance of extracted AlexNet [45] features and NASNetLarge [46] features are similar and lower than ResNet-101 [42] features. One reason for this is that AlexNet is not a high-performance ImageNet model. Its extracted features might focus on shallow features. In comparison, the performance of the NASNet-Large model is high, suggesting extracted features might lead to ImageNet image features. The F score of the SURF features is lower than those of the AlexNet and NASNetLarge features, likely because the SURF features can only detect a few important points while ignoring background features. d raw achieves the second-best results, which demonstrates that the raw images also contain important features that deep neural networks are not captured. The performance when using binarized images of the videos is similar to the performance of d raw since it contains similar important features to the raw frame images. They both perform better than the Sobel edge detection images likely because edge features are predominantly analyzed. In terms of computation time, deep feature extraction with ResNet-101 is faster than with all other models. Although our UFSKFE model takes longer, it combines the feature extraction time of ResNet-101 and raw images. The total average time to extract KFs is less than nine minutes and is faster than extracting SURF, NAS-NetLarge features, and H SSI M (details are shown in Appendix A.2). The SSIM similarity selection is not an efficient method, with computation times of more than 1.7 h per video. These extensive results demonstrate that our proposed UFSKFE model can quickly and accurately extract KFs. Figure 7 shows KFs detection using our model from the GH060066 video using fusion distance d. There are five true KFs (green dots), while our model detected six points as the KFs (red dots). Although there are differences between green and red dots, those differences are within ±20 frames, and are considered correct predictions. Figure 8 compares detected KF images with the ground truth. In our UFSKFE model, only one wrong prediction (frame 2611) is detected. This is likely due to the milking apparatus still being attached to the dairy cow's teats with the low field of view. The quality check process does not remove this detected KF (similarity score 0.62 exceeds threshold O). The other two methods d SURF and d Binary also incorrectly identify this cow's images as a key frame. Compared with predicted KF images of other models, UFSKFE has a higher F score, and these frames are closer to the ground truth KF images than other models.  Different key frame comparisons of GH060066 video. The means a correct prediction, while means a wrong prediction. The number below each image is the video frame. The F score is also reported in each method. UFSKFE achieves the highest F score.

Ablation Study
To demonstrate the effectiveness of different components on the final F score, we conduct an ablation study for each component of our proposed UFSKFE model (Table 3) with four randomly selected videos (GH060066, GH030072, GH010066, and GH050066). Realizing that the KF selection function S is required, we conduct the ablation study with d raw , d deep , and QC. d raw selects the KFs using raw distance with S, d deep selects the KFs using deep distance with S, and d raw + QC conducts a quality check after selecting the KFs of the raw distance. We find that the F score for fusion distance d is higher than when using d raw or d deep . The quality check process is also effective in improving the F score. Therefore, all our proposed components demonstrate effectiveness and importance in this KFE.

Parameter Analysis
There are three hyperparameters in our model: weight balance factor α, redundant frame number R and similarity threshold O. To determine the best parameters, we report F score of randomly select three videos when these hyperparameters are varied. α is selected from {0.1, 0.

Discussion of Results and Limitations of UFSKFE Model
Our UFSKFE achieves the highest average F score when compared with other methods. There are three possible reasons why this model performs well. First, the proposed unsupervised few-shot learning paradigm leverages knowledge from a few support KFs to all of the video frames. Second, our proposed fusion distance takes advantage of both raw and deep distances from support frames that represent a diverse range of possible key frames. Third, the quality check process acts as an effective method for removing noisy KF candidates, resulting in a substantially improved overall performance.
A limitation of our proposed UFSKFE model is that it cannot remove some of the KFs, primarily those images where the milking apparatus remains attached to the dairy cows. Although our quality check can remove some of these images (Figure 6a) the removal of such images becomes more challenging when the cameras' field of view does not adequately image the cows' teats. This can be circumvented with re-positioning the camera in the portrait as opposed to landscape mode when collecting the videos. In addition, we only extract one key frame per cow. A clear view of each teat in the same cow can come from different frames. Therefore, we can consider extracting key frames for each teat. Exploring other methods for extracting features from video frames from few-shot learning may also be of value in our efforts to improve performance. Furthermore, our process can be performed in real time if we can directly store the recorded video on the cloud console.
Future work focuses on the use of other machine learning approaches to assess the extent of hyperkeratosis and the risk of mastitis.
The performance of our key-frame extraction methodology may also be influenced by farm-and cow-related factors. With regards to the farm itself, the lighting conditions and cleanliness of the farm, stall, and parlors could affect the performance. The rotary parlor is housed inside a large complex which mitigates the effects of weather, lighting, and other environmental factors that could affect the quality of the video data. Variations from best milking practices could similarly affect performance, such as inconsistent cleaning of the teat ends. Key-frames with the milking unit still attached to the cow will depend on the settings of the milking system (vacuum pressure and detachment), parlor rotational speed, and location of the camera. Finally, the performance of the key-frame extraction method will depend on the size of the dataset. Our model was developed using only 32 labeled key frames. While additional labeled data could improve overall performance, our findings suggest that in the commercial dairy farm setting where such rotating parlors are used, key frames from only a very small fraction of the herd are necessary if using our automated key-frame extraction technique.

Conclusions
In this paper, we propose a novel unsupervised few-shot learning key frame extraction model for cow teat videos. We combine the raw and deep distances between each video frame and support key frame images and form a fusion distance to better denote the differences between each video frame and support key frame images. An efficient key frame selection mechanism is proposed to first determine the key frame candidates, followed by a quality check procedure to refine the predicted key frames. Extensive experiment results demonstrate that the proposed UFSKFE model can accurately and efficiently extract the key cow teat frames. Our approach provides an opportunity to reduce the redundancy of processing large videos. The extracted key teat-end frames can be collected to monitor the health status of dairy cows. To calculate deep distance, we need to extract deep features for video frames from pretrained ImageNet models. To select the best ImageNet model for feature extraction of our cow teat videos, we conducted extensive experiments with 12 frequently used ImageNet models. We use the layer prior to the last fully connected layer to extract deep features. These 12 ImageNet models are AlexNet [45], VGG16 [47], VGG19 [47], GoogLeNet [48], DenseNet-201 [49], ResNet-18 [42], ResNet-50 [42] , ResNet-101 [42] , Inception-V3 [50], Xception [51], InceptionResNet-V2 [52], NASNetLarge [46].
We also utilize t-SNE [53] to visualize extracted deep features in 2D space as shown in Figure A1, but it is still difficult to select the best pre-trained ImageNet model. We thus plot the projected loss from a high dimension space to the 2D space of the t-SNE model in Figure A2. We observe that ResNet-101 has the smallest projection loss among the 12 models, which suggests that ResNet-101 is a suitable ImageNet model for extracting deep features. However, we still do not know whether the performance of ResNet-101 features of our key frame extraction problem is better than other deep features. We thus report the performance of all 12 models in Table 2. We first calculate these deep distances and use key frame selection S to select the key frame candidates, then the quality check is performed to remove noisy key frames. We can find that the deep ReseNet-101 distance indeed achieves a higher F score than other models. Figure A1. T-SNE visualization of extracted features of 12 ImageNet models from GH060066 video. Blue represents video frames, while green dots are the key frame image position. Figure A2. T-SNE projection loss of different ImageNet models. Y-axis denotes the projected loss from high dimension space to the 2D space.
where B refers to obtaining the binary image. In Figure A3d, we calculate the distance between edge detection images using the Sobel algorithm of each video frame and support key frame images. The Sobel distance is defined as follows: where E refers to obtaining an edge detection image using the Sobel algorithm. After obtaining SURF distance d SURF , binary distance d Binary and Sobel distance d Sobel , we use the key frame selection S (the fusion distance d will be replaced with these three distances, respectively) and perform the quality check process to obtain the final extracted key frames. We utilize visualize extracted SURF features ( Figure A4) using t-SNE. These frame features (blue dots) are indistinguishable from the non-key frames, as shown in blue in Figure A1. SURF features also have a higher project loss (1.819) than other ImageNet models. This implies that the performance of SURF features might be lower than different ImageNet models. The average F score of SURF is 32.3 (in Table 1 of the main paper), which is lower than most ImageNet models. In Table 1 of the main paper, we also present the result of H SSI M , using the SSIM similarity matrix to determine key frames. Specifically, we calculate the SSIM score of the crop teat area of each video frame image and support video key frame image, and then select the highest score to detect the key frames and then perform the quality check process. The SSIM similarity matrix is defined as follows: where p represents the teat position area, and max r returns the maximum number of each row of the similarity matrix h SSI M ∈ R n v ×K . Hence, H SSI M ∈ R n v ×1 . We then have a partial new key frame selection function S to determine the key frame candidates as in Algorithm A1. There are two changes, the first is that the input is not the fusion distance d, but the SSIM similarity matrix H SSI M . Secondly, we sort H SSI M in a descending order since a more similar teat area is more likely to be a key frame. Figure A5 shows the process of detecting key frames using new key frame selection function (S ) on GH060066 video with similarity matrix H SSI M .
Algorithm A1 Key frame selection mechanism (S ) if I t ! = −1 then     (52.1). The reason is that the small teat area tends to ignore other important background features. The performance of L2 norm (58.15) is also lower than the simple L1 norm (63.6 in Table 2). In addition, the F score of raw L2 distance (55.2) is slightly lower than the performance of the L1 distance 55.4 (Table 2). We can conclude that all proposed strategy in our UFSKFE model is effective in improving the accuracy of key frame extraction in cow teat videos.