EnsemblePigDet: Ensemble Deep Learning for Accurate Pig Detection

: Automated pig monitoring is important for smart pig farms; thus, several deep-learning-based pig monitoring techniques have been proposed recently. In applying automated pig monitoring techniques to real pig farms, however, practical issues such as detecting pigs from overexposed regions, caused by strong sunlight through a window, should be considered. Another practical issue in applying deep-learning-based techniques to a speciﬁc pig monitoring application is the annotation cost for pig data. In this study, we propose a method for managing these two practical issues. Using annotated data obtained from training images without overexposed regions, we ﬁrst generated augmented data to reduce the effect of overexposure. Then, we trained YOLOv4 with both the annotated and augmented data and combined the test results from two YOLOv4 models in a bounding box level to further improve the detection accuracy. We propose accuracy metrics for pig detection in a closed pig pen to evaluate the accuracy of the detection without box-level annotation. Our experimental results with 216,000 “unseen” test data from overexposed regions in the same pig pen show that the proposed ensemble method can signiﬁcantly improve the detection accuracy of the baseline YOLOv4, from 79.93% to 94.33%, with additional execution time.


Introduction
The health and well-being of group-housed pigs can be maintained by detecting or managing problems regarding their health and welfare in the early stages [1][2][3][4][5]. The reduction of practical problems (e.g., infectious diseases, hygiene deterioration) with individual pigs is essential, as pigs that roam around in an enclosed pen have a high possibility of being infected by diseases or developing stress [6]. However, in general, the farm workforce is significantly low compared to the number of pigs. For example, the pig farm from which the video monitoring data were obtained had more than 1000 pigs cared for by each worker. It is nearly impossible for a small workforce to manage a huge number of pigs. Therefore, the main objective of this study was to identify the number of pigs in a pig pen and to prevent deaths of individual pigs due to health and welfare problems by detecting irregularities.
Many studies have reported the use of monitoring techniques to solve problems in pig pens . It is important to detect individual pigs in each video frame to analyze this type of motion behavior, as object detection is the first process for various visionbased high-level analyses. While many researchers have reported the detection of pigs using typical learning and image processing techniques, the detection accuracy for highly occluded images may not be at an acceptable level. Recently, end-to-end deep learning techniques have been proposed for object detection, and various pig-detection methods based on deep learning results (along with the typical learning and image processing techniques) have been reported [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30]. YOLOv4 [31] is a recently released detector that

•
Another practical issue in applying deep-learning-based techniques to pig detection is the annotation cost for large-scale data. Experimental results for pig detection with large-scale test data have not yet been reported because the box-level annotation cost for this data is very expensive. Accuracy metrics for pig detection in a closed pig pen are proposed to evaluate the accuracy of detection, without box-level annotation. Presumably, this is the first report of large-scale pig detection in a pig pen with 216,000 test data, without any box-level annotation. It is also indicated that the detection accuracy with 216,000 raw video frames is very similar to that with 13,997 key frames. Thus, reducing the number of test images using key frame extraction is effective in reducing both the evaluation cost (with very large test data) and the inference time (with the model ensemble). This paper is organized as follows: Section 2 summarizes previous pig detection methods. Section 3 describes the proposed method to efficiently detect pigs using the model ensemble method. Section 4 explains the details of the experimental results, along with the new accuracy metric, and the paper is concluded in Section 5. Another practical issue in applying deep-learning-based techniques to pig detection is the annotation cost for large-scale training and test data. Traditional metrics, such as average precision (AP)/average recall (AR), which are widely used in COCO [32] and VOC [33], are employed to evaluate the detection accuracy of group-housed pigs. These require box-level annotation for each pig in an image. Because of this annotation cost (which is typically 5 min for one image of group-housed pigs), previous studies on pig detection have reported detection accuracy with a small number of test images. For example, if one image takes about five minutes to annotate, then about 18,000 h (estimated to be 2 years) are taken to annotate 216,000 test images, and 1166 h (estimated to be 1.5 months) are taken to annotate 13,997 key frame images. For large-scale evaluation, accuracy metrics without box-level annotation are needed.

Background
This study proposes a performance testing method and performance metrics that are used to compare the accuracy of the proposed ensemble detectors and baseline detectors in processing all frames of the 216,000 images, without box-level annotations. Hence, the number of test data is reduced by extracting key frames for movements that occur by considering the characteristics of pigs that have been lying down for a long time. In addition, this study proposes an accuracy measurement method that does not require additional annotation cost by modifying CorLoc [34], which is used in weakly supervised object detection (WSOD). The validity of this measurement metric can be verified by comparing the pig detection performance in an enclosed pen on all video frames without annotation with the proposed performance metric. This is possible by visual verification of a small number of 13,997 key frame images. The contributions of the proposed method are summarized as follows: • For real-time deployment, a deep-learning-based pig detector should handle unseen data. An ensemble-based pig detection method is proposed in this study to improve the detection accuracy in overexposed regions (as an example of unseen data), presumably for the first time. The detection of pigs from such overexposed regions is very challenging because the pixel distribution of such regions caused by strong sunlight through a window is different from that of other regions. Without using training data, including those from overexposed regions, image preprocessing for diversity and a model ensemble with different preprocessed images can robustly detect pigs from overexposed regions. • Another practical issue in applying deep-learning-based techniques to pig detection is the annotation cost for large-scale data. Experimental results for pig detection with large-scale test data have not yet been reported because the box-level annotation cost for this data is very expensive. Accuracy metrics for pig detection in a closed pig pen are proposed to evaluate the accuracy of detection, without box-level annotation. Presumably, this is the first report of large-scale pig detection in a pig pen with 216,000 test data, without any box-level annotation. It is also indicated that the detection accuracy with 216,000 raw video frames is very similar to that with 13,997 key frames. Thus, reducing the number of test images using key frame extraction is effective in reducing both the evaluation cost (with very large test data) and the inference time (with the model ensemble).
This paper is organized as follows: Section 2 summarizes previous pig detection methods. Section 3 describes the proposed method to efficiently detect pigs using the model ensemble method. Section 4 explains the details of the experimental results, along with the new accuracy metric, and the paper is concluded in Section 5.

Background
The main objective of this study was to automatically monitor and analyze the behavior of an individual pig for 24 h using computer vision methods. Many studies have analyzed the behavior of group-housed pigs. For example, research on pig behavior analysis [7,8], weight measurement [9], environment control [10], pig detection [11][12][13][14][15], tracking [16][17][18], and segmentation [19] using image processing have been reported. In addition, pig detection [20][21][22][23][24], behavior analysis [25], and posture detection [27,28] using deep learning have also been reported. Seo, J. et al. [29] proposed a method to solve the problem of process time and accuracy when detecting individual pigs in a resource-limited embedded environment. Considering the high complexity of deep learning for individual pig detection, parallel pipeline processing and filter clustering methods were used to solve the problem of accuracy and process time, allowing the process in an embedded system. The study applied an image processing method to recover the dropped accuracy using filter clustering. Cowton, J. et al. [30] proposed a method of tracking individual pigs in various environment by combining deep learning detection and a tracking algorithm. More specifically, Faster RCNN and DeepSORT were used after organizing a dataset of 1646 images for individual tracking in a low-light environment. However, since the detection method for 24 h individual pig monitoring is dependent on individual pig detection, there is a need for a method to improve individual pig detection accuracy using deep learning that considers special condition of an overexposed region in a pig pen environment. In other words, special conditions such as overexposed regions of the pig room caused by sunlight have to be considered to automatically monitor and analyze the behavior of an individual pig for 24 h. Our proposed study focused on deep learning technology that improves individual pig detection accuracy. This study proposes a method to improve accuracy by combining two models (CLAHES FB and CLAHE ET ) that considers difficult overexposed regions in pig detection.
Infrared input images were used to detect individual pigs during the day and night. Figure 2 shows the results of deep learning (i.e., YOLOv4 [31]) during the day, night, and daytime with sunlight. During the day, pigs tend to move more actively than at night, as shown in Figure 2a. In contrast, the pigs tend to sleep at night, leading to less movement than during the day, as shown in Figure 2b. During both day and nighttime environments, deep learning methods can provide high detection accuracy for detecting individual pigs. Note that YOLOv4 [31] is a recently released detector that can detect pigs with a good tradeoff between speed and accuracy. However, in environments with strong sunlight, as shown in Figure 2c, the detection accuracy can be degraded by an overexposed environment.
As mentioned in the introduction, it is very difficult to generate training images from overexposed regions for different types of pig pens. Instead of this approach, the overexposed regions with gray pixel values higher than 240 were considered as occlusion (i.e., invalid data), and a method is proposed to improve the detection accuracy with valid data only (i.e., gray pixel values lower than 240). From the training images without gray pixel values higher than 240, two different parameters for image preprocessing were derived to maximize the diversity of the given images. Then, a model ensemble method was developed for combining the test results from the two YOLOv4 models in a bounding box level to further improve the detection accuracy.
Previous studies on pig detection have reported detection accuracy with a small number of test images because of annotation cost. Table 1 shows the pig detection results with 100-1792 test images during the last 10 years.
For large-scale evaluation, accuracy metrics without box-level annotation are required. As the number of pigs in a closed pig pen is known, two accuracy metrics are proposed based on the number of pigs in the pen and the observed number of pigs in images. The effectiveness of the accuracy metrics was also evaluated by defining "key frames" (frames that are considered to have captured meaningful movements of pigs) and "hard frames" (key frames that have located pigs in overexposed regions, that is, frames where accurate detection of pigs is difficult due to overexposed regions). The accuracy metrics were then compared with all frames, key frames, and hard frames, without expensive box-level annotation costs. with sunlight. During the day, pigs tend to move more actively than at night, as shown in Figure 2a. In contrast, the pigs tend to sleep at night, leading to less movement than during the day, as shown in Figure 2b. During both day and nighttime environments, deep learning methods can provide high detection accuracy for detecting individual pigs. Note that YOLOv4 [31] is a recently released detector that can detect pigs with a good tradeoff between speed and accuracy. However, in environments with strong sunlight, as shown in Figure 2c, the detection accuracy can be degraded by an overexposed environment.

Proposed Method
This study proposes a solution to the problem of decreasing accuracy caused by overexposed regions with strong sunlight by applying image preprocessing to images that are not exposed to strong sunlight. In addition, a model ensemble method of combining detection results using the detection boxes of two models to improve detection accuracy is proposed, along with a method for extracting key frames that have effective movement in a test video and new accuracy metrics that measure accuracy, without additional box-level annotation cost. The entire structure of the proposed method is shown in Figure 3.

Proposed Method
This study proposes a solution to the problem of decreasing accuracy caused by overexposed regions with strong sunlight by applying image preprocessing to images that are not exposed to strong sunlight. In addition, a model ensemble method of combining detection results using the detection boxes of two models to improve detection accuracy is proposed, along with a method for extracting key frames that have effective movement in a test video and new accuracy metrics that measure accuracy, without additional boxlevel annotation cost. The entire structure of the proposed method is shown in Figure 3.

Image Preprocessing
In this study, detection accuracy is considered to be decreased due to occlusion when the images exposed to strong sunlight have pixel values higher than 240. To solve this problem, image preprocessing on training data that does not include overexposed regions is proposed. It considers overexposed regions with pixel values higher than 240 as invalid regions and increases accuracy by improving image quality through image preprocessing in the remaining regions. This study proposes image preprocessing methods that include contrast limited adaptive histogram (CLAHE) [35], Gaussian filter [36], and sharpening filter [36] and divides them into two sections.
CLAHE divides the input image into small blocks of uniform sizes and smoothens the histogram for each block. The main parameter, TilesGridSize, determines the block sizes to be divided, and ClipLimit is a threshold value that is needed for the histogram smoothing process. This is used to redistribute the pixels that exceed the threshold value and equalize the histogram. As TilesGridSize decreases, it has the effect of increasing local contrast between the object and background, and as it increases, the overall feature strengthens, which emphasizes the object's texture. For this study, averages of individual pixel values in the ground truth box region and ground truth excluded region were acquired to find the effective parameter combination of the first model. This emphasizes the effect of the contrast between the object and background. Among the combinations, the parameter combination that had the largest difference between the averages was chosen. Following this, the entropy value, which is one of the metrics that shows the amount of

Image Preprocessing
In this study, detection accuracy is considered to be decreased due to occlusion when the images exposed to strong sunlight have pixel values higher than 240. To solve this problem, image preprocessing on training data that does not include overexposed regions is proposed. It considers overexposed regions with pixel values higher than 240 as invalid regions and increases accuracy by improving image quality through image preprocessing in the remaining regions. This study proposes image preprocessing methods that include contrast limited adaptive histogram (CLAHE) [35], Gaussian filter [36], and sharpening filter [36] and divides them into two sections.
CLAHE divides the input image into small blocks of uniform sizes and smoothens the histogram for each block. The main parameter, TilesGridSize, determines the block sizes to be divided, and ClipLimit is a threshold value that is needed for the histogram smoothing process. This is used to redistribute the pixels that exceed the threshold value and equalize the histogram. As TilesGridSize decreases, it has the effect of increasing local contrast between the object and background, and as it increases, the overall feature strengthens, which emphasizes the object's texture. For this study, averages of individual pixel values in the ground truth box region and ground truth excluded region were acquired to find the effective parameter combination of the first model. This emphasizes the effect of the contrast between the object and background. Among the combinations, the parameter combination that had the largest difference between the averages was chosen. Following this, the entropy value, which is one of the metrics that shows the amount of information in an image, was calculated to find the parameter combination that highlights the texture intensity of an image. Table 2 shows the difference in average pixel value of the object and background (difference) and entropy value (entropy) for each parameter combination. In this study, CLAHE SFB (i.e., the separation of foreground and background) was assigned as the parameter amongst the CLAHE parameter combinations when using the results of CLAHE image processing for data augmentation. This combination is deemed to separate the foreground from the background most effectively because the difference between the foreground and background pixel average is the highest. On the other hand, CLAHE ET (i.e., enhancement texture) is assigned as a parameter that is deemed to emphasize the texture of an image, as the entropy value is the largest (see Table 2). A Gaussian filter was then applied to the result of CLAHE SFB to emphasize low-frequency features, whereas a sharpening filter was applied to the result of CLAHE ET to emphasize high-frequency features. Finally, "model A" was created by training YOLOv4 using training data applied with image preprocessing A. In contrast, "model B" was created by training YOLOv4 using training data applied with image preprocessing B.

Model Ensemble with Two Models
Although the model ensemble can increase accuracy, it also increases the execution time. Therefore, this study proposes a model ensemble method using information from the detection box of two models at the post-processing level to increase detection accuracy. Initially, a union (AB_box) of the detection box set is acquired from model A and model B using two sets, A_box and B_box, which are sets of detection boxes from models A and B, respectively. After boxes with a lower confidence score for each set of boxes in A_box, B_box, and AB_box are removed, non-maximum suppression (NMS) is performed for each set. The thresholds for confidence and NMS are set differently with a set of model boxes, A_box and B_box, and a union set of model box AB_box, for diversity. For example, the boxes from set A_box and B_box can be removed aggressively (in Figure 4, Confidence1 and Threshold1 are set to high and low, respectively), whereas the boxes from set AB_box can be removed conservatively (in Figure 4, Confidence2 and Threshold2 are set to low and high, respectively). The effects of diverse combinations of thresholds are explained in Section 4. After the NMS process, the NMS results of A_box and AB_box are combined as A/AB_box using the box merging algorithm (explained later), and the NMS results of B_box and AB_box are combined as B/AB_box. Finally, the detection boxes of A/AB_box and B/AB_box are combined as final boxes, using the same box merging algorithm. Hence, this method merges the detection boxes produced from two models in two steps, with a union set of the model box, to maximize the effect of the model ensemble. The overall structure is shown in Figure 4. model B using two sets, A_box and B_box, which are sets of detection boxes from mod A and B, respectively. After boxes with a lower confidence score for each set of boxe A_box, B_box, and AB_box are removed, non-maximum suppression (NMS) is perform for each set. The thresholds for confidence and NMS are set differently with a set of mo boxes, A_box and B_box, and a union set of model box AB_box, for diversity. For examp the boxes from set A_box and B_box can be removed aggressively (in Figure 4, Co dence1 and Threshold1 are set to high and low, respectively), whereas the boxes from AB_box can be removed conservatively (in Figure 4, Confidence2 and Threshold2 are to low and high, respectively). The effects of diverse combinations of thresholds are plained in Section 4. After the NMS process, the NMS results of A_box and AB_box combined as A/AB_box using the box merging algorithm (explained later), and the N results of B_box and AB_box are combined as B/AB_box. Finally, the detection boxes A/AB_box and B/AB_box are combined as final boxes, using the same box merging al rithm. Hence, this method merges the detection boxes produced from two models in t steps, with a union set of the model box, to maximize the effect of the model ensem The overall structure is shown in Figure 4. The box merging algorithm proposed in this study assumes that the number of p in a closed pig pen (i.e., no_pigs) is known. For continuous pig monitoring applicatio note that the number of pigs in a pen does not generally change for a long time (i.e., month) except during some events (such as the death of a pig). Furthermore, the propo algorithm can be modified by introducing thresholds to select the final boxes if the r sonable assumption cannot be satisfied. Another point to note is that the number of p (by manual inspection of each video frame) can be less than no_pigs because of possi occlusion. Therefore, the notation max_pigs was used in this study to represent the kno number of pigs in a pen.
Each video frame is deemed to be correct when the number of bounding boxes tected by one of the two models matches the number of pigs that can be detected ( no_pigs). If neither of the two models matches, a box that is produced from a train model that has the largest intersection over union (IOU) value with a box of another mo is considered to be a matching box. Subsequently, if this IOU value is higher th iou_thresh, which is a value deemed for two boxes to be the same box (set to 0.7 in t work), it is considered that the matching box has correctly detected the pig. Finally, bo that have the highest confidence value compared to the remaining boxes that are not c sen as correct boxes from each model are considered until the total number of corr boxes is equal to no_pigs. Hence, this is a merging method in which the detection bo The box merging algorithm proposed in this study assumes that the number of pigs in a closed pig pen (i.e., no_pigs) is known. For continuous pig monitoring applications, note that the number of pigs in a pen does not generally change for a long time (i.e., one month) except during some events (such as the death of a pig). Furthermore, the proposed algorithm can be modified by introducing thresholds to select the final boxes if the reasonable assumption cannot be satisfied. Another point to note is that the number of pigs (by manual inspection of each video frame) can be less than no_pigs because of possible occlusion. Therefore, the notation max_pigs was used in this study to represent the known number of pigs in a pen.
Each video frame is deemed to be correct when the number of bounding boxes detected by one of the two models matches the number of pigs that can be detected (i.e., no_pigs). If neither of the two models matches, a box that is produced from a trained model that has the largest intersection over union (IOU) value with a box of another model is considered to be a matching box. Subsequently, if this IOU value is higher than iou_thresh, which is a value deemed for two boxes to be the same box (set to 0.7 in this work), it is considered that the matching box has correctly detected the pig. Finally, boxes that have the highest confidence value compared to the remaining boxes that are not chosen as correct boxes from each model are considered until the total number of correct boxes is equal to no_pigs. Hence, this is a merging method in which the detection boxes are produced from two models using the number of pigs that can be detected within a pen at the postprocessing level. The proposed box-merging algorithm is summarized as Algorithm 1.

Key Frame Extraction and Accuracy Metrics
To measure the detection accuracy of the test dataset, an expensive "box-level" annotation to create ground truth is usually required. Therefore, a method to measure the detection accuracy at an inexpensive cost is necessary for continuous pig monitoring applications. Raw video frames have a lot of redundant information for pig monitoring because pigs sleep frequently for a long time. In this study, key frames that show significant changes in movement (hence, pigs that show movement) were chosen to reduce the test dataset.
After reducing the test dataset through key frame extraction, CorLoc [34] used in weakly supervised object detection (WSOD) was modified to measure the detection accuracy of key frames, under the assumption of knowing the number of pigs to be detected (max_pigs). This study proposes a method to measure the detection accuracy without the box-level annotation cost.
To extract key frames, the number of pigs that show movement in a current video frame is estimated by comparing the previous and current frames. YOLOv4 is then applied to calculate average size of bounding boxes. Note that YOLOv4 is applied only for a sample frame from training dataset, not input video data. The pixel difference between the previous and current frames is computed to decide whether a pig is moving. Finally, the number of pigs that are moving is estimated, and the current frame is set as the key frame, if at least one moving pig exists.
Initially, an object detector YOLOv4 is applied to the sample frame, and bounding boxes are acquired. Bounding boxes that have less than a certain confidence value (set to 0.7 in this study) are removed to reduce false positives, and boxes that can be trusted are chosen. The average detection box size S for the sample frame is calculated. Following this, the number of pixels D for which the difference in pixel value is higher than a certain threshold (i.e., TH pixeldiff ) is calculated for regions in individual detection boxes of the current frame. Subsequently, if the divided value of D to S × N is higher than a certain threshold (TH keyframe ), where the N is number of pigs in a pig pen, then the current frame is designated as the key frame. For this study, threshold values that are needed to extract key frames were set to TH pixeldiff = 1, TH keyframe = 1, and the algorithm to extract the key frame is shown as Algorithm 2. After extracting the key frames, the detection accuracy is evaluated without box-level annotation. As explained, the maximum number of pigs that can be detected within an enclosed pig pen (i.e., max_pigs) is known. With max_pigs, the accuracy ACC max_pigs is defined for n test frames. For each test frame i, the number of detection boxes C i is compared with max_pigs. If they are equal, the frame is considered as the correct frame (i.e., F i = 1). Then, ACC max_pigs is computed as the ratio of the total number of test frames to the total number of correct frames w.r.t max_pigs.
However, even though the number of detection boxes is equal to max_pigs, there is a possibility of false-positive and/or false-negative errors. As explained, the number of pigs by manual inspection (GT i ) for each test frame i can be less than max_pigs because of possible occlusion. Therefore, with GT i , an accuracy ACC manual_inspection for n test frames is defined. For each test frame i, the number of detection boxes C i is compared to the value of GT i . If they are equal, the frame is considered as the correct frame (i.e., M i = 1). Then, ACC manual_inspection is computed as the ratio of the total number of test frames to the total number of correct frames for manual inspection.
Because the cost of manual inspection is still high for a relatively large number of key frames, a key frame with at most one pig in overexposed regions is also defined as a hard frame, and ACC max_pigs and ACC manual_inspection to detect pigs are evaluated for difficult frames. In other words, the detection accuracy for hard frames can represent the lower bound of the detection accuracy for key frames. In Section 4, ACC max_pigs is compared between raw video frames, key frames, and hard frames, and ACC manual_inspection is compared between key frames and hard frames to evaluate the effectiveness of ACC max_pigs and the accuracy relationship between key frames and hard frames. The experiment was conducted in a 3.2 m tall, 2.0 m wide, and 4.9 m long pigsty at Chungbuk National University, and a low-cost Intel RealSense camera (D435 model, Intel, Santa Clara, CA, USA) [37] was installed on the ceiling to obtain the images. A total of nine pigs (Duroc × Landrace × Yorkshire) were raised in a pig pen, and the average initial body weight of each pig was 92.5 ± 5.9) kg. Color, infrared, and depth images were acquired using a low-cost camera installed on the ceiling, and each image had a resolution of 1280 × 720 at 30 frames per second (fps). Figure 5 shows a pig pen with a camera installed on the ceiling. To exclude the unnecessary region of the pig pen, the region of interest (RoI) was set to 608 × 288.

Experimental Setup and Resources for the Experiment
For the purpose of comparison, individual pig detection experiments were conducted in the following environment: Intel Core i5-9400F 2.90 GHz (Intel, Santa Clara, CA, USA), NVIDIA GeForce RTX2080 Ti (NVIDIA, Santa Clara, CA, USA), 32 GB RAM, Ubuntu 16.04.2 LTS (Canonical Ltd., London, UK), and OpenCV 3.4 [36] for image processing.
The experiment was conducted in a 3.2 m tall, 2.0 m wide, and 4.9 m long pigsty at Chungbuk National University, and a low-cost Intel RealSense camera (D435 model, Intel, Santa Clara, CA, USA) [37] was installed on the ceiling to obtain the images. A total of nine pigs (Duroc × Landrace × Yorkshire) were raised in a pig pen, and the average initial body weight of each pig was 92.5 ± 5.9) kg. Color, infrared, and depth images were acquired using a low-cost camera installed on the ceiling, and each image had a resolution of 1280 × 720 at 30 frames per second (fps). Figure 5 shows a pig pen with a camera installed on the ceiling. To exclude the unnecessary region of the pig pen, the region of interest (RoI) was set to 608 × 288. From the camera, 2904 training images were acquired, and image preprocessing A and B were applied with the basic image augmentation method (horizontal flip, vertical flip, horizontal/vertical flip). Following this, models A and B were trained (0.0001 for learning rate, 0.0005 for decay, 0.9 for momentum, Mish as the activation function, default anchor parameter, and 6000 for the iterations) to obtain EnsemblePigDet. Then, 216,000 test images were extracted from surveillance videos between 8:30 and 10:30 in the morning, when the pigs were most active. From these, 13,997 key frames and 4193 hard frames were extracted and verified as being exposed to strong sunlight. The reported accuracy was the average of the five-fold cross-validation. The proposed method was implemented based on YOLOv4 [31]. With the COCO data set [32], YOLOv4 exhibited a better tradeoff between speed and accuracy than other detectors, and thus YOLOv4 was selected as the baseline. Table 3 shows the experimental results with YOLOv4 as baseline, image augmentation applied with the single model, and ensemble model after applying proposed image preprocessing A and image preprocessing B. Detection accuracy metric ACCmax_pigs symbolizes the accuracy when the number of detected boxes is not nine, though the number of pigs that can be detected within a pen is nine in total (max_pigs = 9). If a falsely detected box and an omitted box are present in a frame, the frame is not considered as an error From the camera, 2904 training images were acquired, and image preprocessing A and B were applied with the basic image augmentation method (horizontal flip, vertical flip, horizontal/vertical flip). Following this, models A and B were trained (0.0001 for learning rate, 0.0005 for decay, 0.9 for momentum, Mish as the activation function, default anchor parameter, and 6000 for the iterations) to obtain EnsemblePigDet. Then, 216,000 test images were extracted from surveillance videos between 8:30 and 10:30 in the morning, when the pigs were most active. From these, 13,997 key frames and 4193 hard frames were extracted and verified as being exposed to strong sunlight. The reported accuracy was the average of the five-fold cross-validation. The proposed method was implemented based on YOLOv4 [31]. With the COCO data set [32], YOLOv4 exhibited a better tradeoff between speed and accuracy than other detectors, and thus YOLOv4 was selected as the baseline. Table 3 shows the experimental results with YOLOv4 as baseline, image augmentation applied with the single model, and ensemble model after applying proposed image preprocessing A and image preprocessing B. Detection accuracy metric ACC max_pigs symbolizes the accuracy when the number of detected boxes is not nine, though the number of pigs that can be detected within a pen is nine in total (max_pigs = 9). If a falsely detected box and an omitted box are present in a frame, the frame is not considered as an error frame. The frame is considered to be an error frame when the pigs are completely occluded and only eight pigs are visually identified. These are the current limitations of the proposed error frame metric, but the general accuracy of the large number of test data, without visual verification, holds special significance. As shown in Table 3, ACC max_pigs shows better results for the proposed single model using image preprocessing than the baseline YOLOv4, and the ensemble model that uses the model ensemble method shows the best result. In addition, the miniscule difference between the accuracy of the total frame and the key frame is verified. Using the proposed method, the key frame is only 7% of the total frame, but a conclusion similar to the experimental result on the total frame can be drawn on the key frame, verifying the possibility of effectively reducing the total number of frames. Hence, instead of monitoring all the frames, the extracted key frames from a pig pen can be monitored to reduce the process time and overhead time in terms of the model ensemble process time using a two-model method. In addition, a movement-based key frame is decided dynamically; however, as only 7% of all frames captured in the morning show pigs with significant movement, the process time in the ensemble model of the keyframe is practically faster than the process time in a single model of the entire frame. In addition, experiments were commenced to validate the suppression effect of the decreasing accuracy problem based on the pigs exposed to sunlight, proposed in this study. Table 4 shows the results of the experiments on extracted hard frames that contain pigs exposed to strong sunlight. When compared to Table 4, the overall accuracy decreased in all of the models when exposed to strong sunlight, as shown in Table 4, but the decrease in accuracy is relatively small for the proposed single model and the ensemble model, compared to baseline YOLOv4. Therefore, models trained on data applied with image preprocessing A and image preprocessing B are shown to have a significant effect in the presence of strong sunlight, which can be increased through the model ensemble. However, even if the number of detected boxes and max_pigs are the same, the detected boxes have the possibility of being falsely detected boxes. Therefore, validity and accuracy of the results cannot be confirmed with the number of detections alone. Figure 6a represents the case where the number of detection boxes and max_pigs is equivalent but is considered as an error frame because of the falsely detected box. Furthermore, while the maximum number of pigs that could be detected was nine in the video used in this experiment, there were cases where the number of pigs was not nine, due to severe occlusion. Figure 6b shows the case where a frame would be considered as a correct frame as the number of detection boxes was equivalent to the actual number of pigs when visually checked, even though the number of detection boxes and max_pigs was different. Therefore, the actual detection result was checked visually using the extracted key frame and hard frame. In addition, a total of 864,000 images have to be checked visually to acquire ACC manual_inspection for four models, with 216,000 images each, when test frames are visually checked, instead of a key frame or hard frame. Therefore, a visual test is performed on the key frame and hard frame, instead of the total frame. If the ACC max_pigs value and ACC manual_inspection value on the keyframe and hard frame are not significantly different, it can be inferred that the ACC max_pigs value and ACC manual_inspection value on 216,000 total test frames also does not have a significant difference. the number of detection boxes was equivalent to the actual number of pigs when visually checked, even though the number of detection boxes and max_pigs was different. Therefore, the actual detection result was checked visually using the extracted key frame and hard frame. In addition, a total of 864,000 images have to be checked visually to acquire ACCmanual_inspection for four models, with 216,000 images each, when test frames are visually checked, instead of a key frame or hard frame. Therefore, a visual test is performed on the key frame and hard frame, instead of the total frame. If the ACCmax_pigs value and ACCmanual_inspection value on the keyframe and hard frame are not significantly different, it can be inferred that the ACCmax_pigs value and ACCmanual_inspection value on 216,000 total test frames also does not have a significant difference. Figure 6. Limitations of ACCmax_pigs when max_pigs = 9. (a) An "error" frame with one false-positive error and one false-negative error based on ground-truth (although it is regarded as a "correct" frame based on ACCmax_pigs); (b) a "correct" frame with no false positive/negative error based on ground-truth (although it is regarded as an "error" frame based on ACCmax_pigs). Table 5 shows the result of actual counting with visual checking (hence, ACCmanual_inspection), which is considerably similar to the counting result using the number of boxes (ACCmax_pigs shown in Table 5). As explained in Figure 6, there are cases where ACCmax_pigs does not consider a frame to be an error frame when it actually is and deems a frame to be an error frame when it is not. Overall, ACCmax_pigs and ACCmanual_inspection have similar results. Hence, if the test data are too large to waive box annotation for each individual pig or pose difficulties in checking the frames visually, ACCmax_pigs is identified as an actual performance metric to represent the detection accuracy of the deep learning model when determining the number of pigs within a pig pen (max_pigs). In addition, unlike the general detection accuracy metric, which is based on box annotation of individual pigs, AP (average precision)/AR (average recall), ACCmax_pigs and ACCmanual_inspection consider a frame even with one error as an error frame (for example, eight pigs are correctly detected, but if a pig is omitted, then the frame is considered to have an error). Thus, the metrics display fewer numbers overall when compared to AP/AR. Figure 6. Limitations of ACC max_pigs when max_pigs = 9. (a) An "error" frame with one false-positive error and one false-negative error based on ground-truth (although it is regarded as a "correct" frame based on ACC max_pigs ); (b) a "correct" frame with no false positive/negative error based on ground-truth (although it is regarded as an "error" frame based on ACC max_pigs ). Table 5 shows the result of actual counting with visual checking (hence, ACC manual_inspection ), which is considerably similar to the counting result using the number of boxes (ACC max_pigs shown in Table 5). As explained in Figure 6, there are cases where ACC max_pigs does not consider a frame to be an error frame when it actually is and deems a frame to be an error frame when it is not. Overall, ACC max_pigs and ACC manual_inspection have similar results. Hence, if the test data are too large to waive box annotation for each individual pig or pose difficulties in checking the frames visually, ACC max_pigs is identified as an actual performance metric to represent the detection accuracy of the deep learning model when determining the number of pigs within a pig pen (max_pigs). In addition, unlike the general detection accuracy metric, which is based on box annotation of individual pigs, AP (average precision)/AR (average recall), ACC max_pigs and ACC manual_inspection consider a frame even with one error as an error frame (for example, eight pigs are correctly detected, but if a pig is omitted, then the frame is considered to have an error). Thus, the metrics display fewer numbers overall when compared to AP/AR. To test the effect on image preprocessing and the model ensemble, a comparison experiment was performed, and Table 6 shows the results. In a single model, as shown in Table 6, a trained model that included raw input showed higher accuracy than the trained model that applied image preprocessing. In addition, training all of image preprocessing A, image preprocessing B, and raw input together and dividing them into image preprocessing A + raw input and image preprocessing B + raw input showed higher accuracy. In the case of the model ensemble, the ensemble of two models with high accuracy on a single model is shown to have the best accuracy.  Tables 7 and 8 show the combination results of Confidence1 and 2 and Threshold1 and 2 for 13,997 key frames. Table 7 shows the result of setting Confidence1 and Threshold1 conservatively and Confidence2 and Threshold2 aggressively. As the Confidence1 and 2 values increased, accuracy showed a decreasing trend. In addition, unrelated to the values of Confidence1 and 2, the accuracy tended to decrease with the increase of the Threshold1 value. Table 8 shows the results of setting Confidence1 and Threshold1 aggressively and  Confidence2 and Threshold2 conservatively, exhibiting similar results to Table 7, where the accuracy was shown to decrease as Confidence1 value increased. Nevertheless, unrelated to the confidence value, the accuracy increased when Threshold1 and 2 values increased. The overall accuracy was shown to increase when Confidence1 and Threshold1 were set aggressively and Confidence2 and Threshold2 were set conservatively. When comparing the ensemble model to the single model in Tables 7 and 8, the accuracy was shown to increase. Therefore, based on the combination of confidence and threshold, the ensemble model displays less accuracy than a single model.    Although many errors with single models could be solved by the ensemble model, some errors still remain. Figure 8 shows the result of not being able to detect all pigs within a frame, even after applying the model ensemble. In the case of Figure 8a, while false negatives occurred in models A and B, detection of all pigs within a frame failed even after merging the result using the model ensemble because of false negatives occurring for the same pig. In addition, as shown in Figure 8b, even though false negatives of model A and false positives of model B were created, the detection boxes could not be merged. This was because the false positives of model B were considered as pigs instead of the false negatives of model A, even after applying the model ensemble, due to false positives with high confidence scores. To summarize, the detection result might not merge correctly, even after applying the model ensemble to the following cases: the same pigs not being detected, highest confidence score of false positive, and the previous case with the false negative and false positive occurring at the same time.

Discussion
A and false positives of model B were created, the detection boxes could not be merged. This was because the false positives of model B were considered as pigs instead of the false negatives of model A, even after applying the model ensemble, due to false positives with high confidence scores To summarize, the detection result might not merge correctly, even after applying the model ensemble to the following cases: the same pigs not being detected, highest confidence score of false positive, and the previous case with the false negative and false positive occurring at the same time.   A and false positives of model B were created, the detection boxes could not be merged. This was because the false positives of model B were considered as pigs instead of the false negatives of model A, even after applying the model ensemble, due to false positives with high confidence scores To summarize, the detection result might not merge correctly, even after applying the model ensemble to the following cases: the same pigs not being detected, highest confidence score of false positive, and the previous case with the false negative and false positive occurring at the same time.   Although the proposed method could improve the accuracy of baseline YOLOv4 significantly for pig detection from unseen overexposed regions, the limitations of this study are as follows:

•
While research on unseen data that include strong sunlight in the same farm was performed, the development of a more robust model (through semi-supervised or self-supervised learning) using unseen data from other farms might be necessary in future research. In addition, the remaining errors with each key frame could be solved by exploiting the temporal information among the key frames; thus, this issue could be addressed in future research. • As shown in Table 9, the accuracy improvement of EnsemblePigDet was strongly dependent on the accuracy of the baseline model used. Because all the 13,997 key frames were considered as difficult images due to overexposed regions, ACC max_pigs of the light-weight model was significantly degraded. Note that EmbeddedPigDet [26] modified TinyYOLOv2 for embedded board implementations, and 13,962 key frames (from 13,997 total key frames) typically produced one or two errors (i.e., missing and/or false pig errors) in each keyframe with EmbeddedPigDet. That is, EmbeddedPigDet targeted for embedded board implementations cannot be used for a hard scenario including strong sunlight, and the accuracy improvement of EnsemblePigDet based on EmbeddedPigDet was limited. Ensemble techniques for light-weight baseline models need to be studied further.

•
In this study, even though fast and accurate YOLOv4 was applied, the execution time of the ensemble model (i.e., the total time on a PC for processing one input image was 29.84 ms, with 5.22 ms for two preprocessing executions, 24.24 ms for two YOLOv4 executions, and 0.38 msec for one postprocessing execution) was slower than those of single models. However, if detection was applied to key frames that captured movements through pig pen monitoring using the proposed method, an average of 20-fold (extracting 13,997 key frames of 216,000 frames using the key frame extraction method) reduction of computation complexity was verified. Therefore, with the RTX2080 Ti GPU, the detection speed of the video composed of key frames was 17 times faster than that of the raw video (i.e., 7 min were required for processing 13,997 key frames obtained from the two-hour raw video); thus, the proposed method with key frames could be executed in real time even on an embedded board.  [26] for grayscale images rather than composite images.

Conclusions
Automated pig monitoring is important for smart pig farms; thus, several deeplearning-based pig monitoring techniques have been proposed recently. In applying automated pig monitoring techniques to real pig farms, however, practical issues such as detecting pigs from overexposed regions, caused by strong sunlight through a window, should be considered. Another practical issue in applying deep-learning-based techniques to a specific pig monitoring application is the annotation cost for pig data.
In this study, a method for managing these two practical issues is proposed. Using annotated data obtained from training images without such overexposed regions, augmented data were generated to reduce the effect of the overexposed condition. Then, YOLOv4 was trained with the annotated as well as augmented data. The test results from two YOLOv4 models were combined in a bounding box level to further improve the detection accuracy. Finally, accuracy metrics were proposed for pig detection in a closed pig pen to evaluate its accuracy with no box-level annotation.
The experimental results with 216,000 "large-scale unseen" test data with overexposed regions in the same pig pen showed that the proposed ensemble method could significantly improve the detection accuracy to 94% from 79% of the baseline of YOLOv4. In addition, the accuracy for 216,000 raw video frames was consistent with that of 13,997 key frames, and the accuracy for 4193 hard frames could provide the lower bound of the accuracy for 13,997 key frames. One limitation of the proposed method is the increased execution time.
Although the proposed method with key frame extraction could be executed in real time (i.e., 2 h test data could be processed without any delay) even on an embedded board, a method for reducing the increased execution time needs to be studied further.