Depth-Based Detection of Standing-Pigs in Moving Noise Environments

In a surveillance camera environment, the detection of standing-pigs in real-time is an important issue towards the final goal of 24-h tracking of individual pigs. In this study, we focus on depth-based detection of standing-pigs with “moving noises”, which appear every night in a commercial pig farm, but have not been reported yet. We first apply a spatiotemporal interpolation technique to remove the moving noises occurring in the depth images. Then, we detect the standing-pigs by utilizing the undefined depth values around them. Our experimental results show that this method is effective for detecting standing-pigs at night, in terms of both cost-effectiveness (using a low-cost Kinect depth sensor) and accuracy (i.e., 94.47%), even with severe moving noises occluding up to half of an input depth image. Furthermore, without any time-consuming technique, the proposed method can be executed in real-time.


Introduction
The early detection of management problems related to health and welfare is an important aspect of caring for group-housed livestock. In particular, caring for individual animals is necessary to minimize the possible damage caused by infectious diseases or other health and welfare problems. However, it is almost impossible for individual animals to be cared for by a small number of farm workers who work on a large-scale livestock farm. For example, the pig farm from which we obtained video monitoring data in Korea had more than 2000 pigs per farm worker.
Several studies using surveillance techniques have recently been conducted to automatically monitor livestock, in what is known as "precision livestock farming" (PLF) [1]. Several attached sensors, such as accelerometers, gyro sensors, and radio frequency identification (RFID) tags, are used to automate the management of livestock farms in examples of PLF [2]. However, such approaches increase costs, and require additional manual labor for activities such as the attachment and detachment of sensors to and from individual animals by farm administrators. To circumvent this, studies have been conducted that analyze data from non-attached (i.e., non-invasive) sensors (such as cameras) [2][3][4][5]. In this study, we focus only on video-based pig monitoring applications [6].
In fact, video-based pig monitoring applications have been reported since 1990 [7,8]. However, because of the practical difficulties (e.g., light fluctuation, shadowing, cluttered background, varying floor status caused by urine/manure, etc.) presented by commercial farms, even the accurate detection of pigs in commercial environments has remained a challenging problem until now . To consider these practical difficulties, it is reasonable to employ a topview-based depth sensor. However, the depth values obtained from a low-cost sensor such as Microsoft Kinect may be inaccurate for classifying a weaning pig as standing or lying. Furthermore, in many monitoring applications, the input video stream data needs to be processed in real-time for an online analysis.
In this study, we propose a low-cost, practical, and real-time method for detecting standing-pigs at night, with the final goal of achieving 24 h individual pig tracking in a commercial pig farm. In particular, caring for weaning pigs (25 days old) is the most important issue in pig management, because of their weak immunity. Therefore, we aim to develop a method for detecting standing-pigs in a pig pen during a one month period after weaning (i.e., 25 days-55 days old). Compared with previous work, the contributions of the proposed method can be summarized as follows: • Standing-pigs are detected at night (i.e., with a light turned off) with a low-cost depth camera. It is well known that most pigs sleep at night [44][45][46]. For the purpose of 24 h individual pig tracking, we only need to detect standing-pigs (i.e., we do not need to detect the majority of lying-pigs at night). Recently, low-cost depth cameras, such as Microsoft Kinect, have been released, and thus we can detect standing-pigs using depth information. However, the size of a 20-kg weaning pig is much smaller than that of a 100-kg adult pig. Furthermore, the accuracy of the depth data measured from a topview Kinect degrades significantly, because there is a limited distance (e.g., a maximum range of 4.5 m) and field-of-view (e.g., horizontal degree of 70.6 and vertical degree of 60) in which depth values are covered. If we install a Kinect at 3.8 m above the floor to cover the entire area of a pen (i.e., 2.4 m × 2.7 m), thus minimizing the installation cost for a large-scale farm, then it is difficult to classify a weaning pig as standing or lying. To increase the accuracy, we consider the undefined depth values around standing-pigs. • A practical issue caused by moving noises is resolved. For example, in a commercial pig farm with a harsh environment (i.e., disturbances from dust and dirt), there are many moving noises (i.e., undefined depth values varying across frames) at night. Because these moving noises occlude pigs (i.e., even up to half of a scene can be occluded by moving noises), we need to recover the depth values that are occluded by the moving noises. Because we utilize the undefined depth values around standing-pigs to increase the detection accuracy, we need to classify undefined depth values as useful ones (i.e., caused by standing-pigs) and useless ones (i.e., caused by moving noises). We apply spatial and temporal interpolation techniques to reduce the moving noises. In addition, we combine the detection results of standing-pigs from the interpolated images and the undefined depth values around standing-pigs to detect standing-pigs more accurately. • A real-time solution is proposed. Detecting standing-pigs is a basic low-level vision task for intermediate-level vision tasks such as tracking and/or high-level vision tasks such as aggressive analysis. To complete the entire vision tasks in real-time, we need to decrease the computational workload of the detection task. Without any time-consuming techniques to improve the accuracy of depth values, we can detect standing-pigs accurately at a processing speed of 494 frames per second (fps).
The remainder of this paper is structured as follows. Section 2 summarizes topview-based pig monitoring results, targeted for commercial farms. Section 3 describes the proposed method for detecting standing-pigs in various noise environments, including with moving noises. The experimental results are presented in Section 4, and conclusions are presented in Section 5.

Background
As explained in Section 1, the accurate detection of pigs in commercial environments has been a challenging problem since 1990, because of the practical difficulties (e.g., light fluctuation, shadowing, cluttered background, varying floor status caused by urine/manure, etc.) presented by commercial farms. Table 1 summarizes the topview-based pig monitoring results introduced recently . Two-dimensional gray-scale or color information has been used to detect a single pig in a pen or a specially built facility (i.e., in "constrained" environments) [9][10][11]. However, even with advanced techniques applied to 2D gray-scale or color information, it remains challenging to detect multiple pigs accurately in a "commercial" farm environment . For example, images from a gray-scale or RGB camera are affected by various illuminations in a pig pen. Thus, a monitoring system based on a gray-scale or RGB camera cannot detect objects in low-to no-light conditions. Although some monitoring results at night have been reported using infrared cameras [34][35][36], problems caused by a cluttered background cannot be perfectly solved. Although some researchers have utilized a thermal camera to resolve the cluttered background problem [37], this is an expensive solution for large-scale farms.
To solve the cluttered background problem for 2D information, some researchers have utilized a stereo camera [38]. However, the accuracy measured from a stereo camera is far from a level at which 24 h individual pig tracking is possible, even with many pigs in a pen. Recently, low-cost depth cameras such as Kinect have been released. Compared with typical stereo-camera-based solutions, a Kinect can provide more accurate depth information at a much lower cost, without a heavy computational workload [39][40][41][42][43]. In principle, Kinect cameras can recognize whether pigs are lying or standing based on the depth data measured. However, a low-cost Kinect camera has a limited distance range (i.e., up to 4.5 m), and the accuracy of the depth data measured by a Kinect decreases quadratically as the distance increases [47]. Thus, the accuracy of the depth data measured by a Kinect degrades significantly when the distance between it and a pig is larger than 3.8 m. Furthermore, the slate-based floor of a pig pen generates many undefined depth values, because of the field-of-view of the installed Kinect. A further issue is that a greater number of undefined depth values appear at the top of a depth image (see Figure 1). Because of the ceiling structure of the pig pen in a commercial farm in which we installed a Kinect, the Kinect could not be installed at the center of the pig pen. Considering these difficulties, it is challenging to classify a 20-kg weaning pig as standing or lying using a Kinect camera installed 3.8 m above the floor. Figure 1 shows the limitations caused by the characteristics of the Kinect camera and the pig pen.
As explained in Section 1, the accurate detection of pigs in commercial environments has been a challenging problem since 1990, because of the practical difficulties (e.g., light fluctuation, shadowing, cluttered background, varying floor status caused by urine/manure, etc.) presented by commercial farms. Table 1 summarizes the topview-based pig monitoring results introduced recently . Two-dimensional gray-scale or color information has been used to detect a single pig in a pen or a specially built facility (i.e., in "constrained" environments) [9][10][11]. However, even with advanced techniques applied to 2D gray-scale or color information, it remains challenging to detect multiple pigs accurately in a "commercial" farm environment . For example, images from a gray-scale or RGB camera are affected by various illuminations in a pig pen. Thus, a monitoring system based on a gray-scale or RGB camera cannot detect objects in low-to no-light conditions. Although some monitoring results at night have been reported using infrared cameras [34][35][36], problems caused by a cluttered background cannot be perfectly solved. Although some researchers have utilized a thermal camera to resolve the cluttered background problem [37], this is an expensive solution for large-scale farms.
To solve the cluttered background problem for 2D information, some researchers have utilized a stereo camera [38]. However, the accuracy measured from a stereo camera is far from a level at which 24 h individual pig tracking is possible, even with many pigs in a pen. Recently, low-cost depth cameras such as Kinect have been released. Compared with typical stereo-camera-based solutions, a Kinect can provide more accurate depth information at a much lower cost, without a heavy computational workload [39][40][41][42][43]. In principle, Kinect cameras can recognize whether pigs are lying or standing based on the depth data measured. However, a low-cost Kinect camera has a limited distance range (i.e., up to 4.5 m), and the accuracy of the depth data measured by a Kinect decreases quadratically as the distance increases [47]. Thus, the accuracy of the depth data measured by a Kinect degrades significantly when the distance between it and a pig is larger than 3.8 m. Furthermore, the slate-based floor of a pig pen generates many undefined depth values, because of the field-of-view of the installed Kinect. A further issue is that a greater number of undefined depth values appear at the top of a depth image (see Figure 1). Because of the ceiling structure of the pig pen in a commercial farm in which we installed a Kinect, the Kinect could not be installed at the center of the pig pen. Considering these difficulties, it is challenging to classify a 20-kg weaning pig as standing or lying using a Kinect camera installed 3.8 m above the floor. Figure 1 shows the limitations caused by the characteristics of the Kinect camera and the pig pen. In this study, we consider moving noises at night (see Figure 2) further. In a commercial farm, we could observe many moving noises every night, and even up to half of a scene was occluded by moving noises. For 24 h individual pig tracking in a commercial pig farm, we need to resolve this type of practical problem. To the best of our knowledge, this is the first report on handling these types of moving noises obtained from a commercial pig farm at night through a Kinect.
A final comment regarding previous research concerns real-time monitoring. Although online monitoring applications should satisfy the real-time requirement, many previous results did not In this study, we consider moving noises at night (see Figure 2) further. In a commercial farm, we could observe many moving noises every night, and even up to half of a scene was occluded by moving noises. For 24 h individual pig tracking in a commercial pig farm, we need to resolve this type of practical problem. To the best of our knowledge, this is the first report on handling these types of moving noises obtained from a commercial pig farm at night through a Kinect.
A final comment regarding previous research concerns real-time monitoring. Although online monitoring applications should satisfy the real-time requirement, many previous results did not specify the processing speed, or could not satisfy the real-time requirement (see Table 1). By carefully balancing the tradeoff between the computational workload and accuracy, we propose a light-weight detection method with an acceptable accuracy for the final goal of achieving a real-time "complete" vision system, consisting of intermediate-and high-level vision tasks, in addition to low-level vision tasks. specify the processing speed, or could not satisfy the real-time requirement (see Table 1). By carefully balancing the tradeoff between the computational workload and accuracy, we propose a light-weight detection method with an acceptable accuracy for the final goal of achieving a real-time "complete" vision system, consisting of intermediate-and high-level vision tasks, in addition to low-level vision tasks.

Proposed Approach
We initially define the terms used in the proposed method, to enhance the readability. Table 2 explains the main terms for each process. To detect standing-pigs at night in a pig pen, it is desirable to utilize a depth sensor, such as a Kinect camera. This allows the sensor to gain depth information on pigs (i.e., the distance from a pig to the camera) without light influences, such as the light being turned on or off in a pig pen. However, because much dirt or dust may be generated at night in the pen, many moving noises appear in a video stream obtained from the depth sensor. These noises make it difficult to detect standing-pigs due to occlusions on them. Therefore, we propose a method to effectively remove the noises generated from dirt or dust in the video, and to precisely detect standing-pigs using undefined depth values (e.g., outlines) of standing-pigs. Figure 3 presents the overview of our detection method for standing-pigs at night.

Proposed Approach
We initially define the terms used in the proposed method, to enhance the readability. Table 2 explains the main terms for each process.
To detect standing-pigs at night in a pig pen, it is desirable to utilize a depth sensor, such as a Kinect camera. This allows the sensor to gain depth information on pigs (i.e., the distance from a pig to the camera) without light influences, such as the light being turned on or off in a pig pen. However, because much dirt or dust may be generated at night in the pen, many moving noises appear in a video stream obtained from the depth sensor. These noises make it difficult to detect standing-pigs due to occlusions on them. Therefore, we propose a method to effectively remove the noises generated from dirt or dust in the video, and to precisely detect standing-pigs using undefined depth values (e.g., outlines) of standing-pigs. Figure 3 presents the overview of our detection method for standing-pigs at night.

Category
Definition Explanation

Types of images
Depth input image Background image Image to which spatiotemporal interpolation is applied Image to which background subtraction is applied Image of candidates detected Image of candidate edges Image of outlines detected around standing-pigs Image overlapped between and Image to which dilation operator is applied Image combining with Result image of standing-pigs

Noise Removal and Outline Detection
Using depth values from a 3D Kinect camera, information on pigs can be obtained at night without a light in a pen. However, undefined depth values corresponding to moving noises (i.e., UDF moving ) emerged in this process due to the dirt or dust generated from pigs, and this disturbs the accurate detection of pigs. To remove these noises, an interpolation technique using spatiotemporal information is applied to the input video.
Initially, an interpolation technique using a 2 × 2 window is applied to a current image, with two consecutive images (i.e., using temporal information), in I input . As shown in Figure 4a, the 2 × 2 window is used as spatial information. The 2 × 2 window moves within I input , and performs the interpolation on every pixel in I input . The interpolation is performed in three cases according to the pixel attributes in the window. In the first case, if more than two pixels in the 2 × 2 window have defined depth values such as right of Figure 4a, then an interpolated pixel can be created through their average calculation. In the second case, if there is only one pixel as a defined depth value in the window such as left of Figure 4a, then the pixel can be specified as an interpolated pixel. In the third case, if all pixels in the window are undefined such as middle of Figure 4a, then an interpolated pixel is assigned as an undefined depth value (i.e., noise pixel). In this procedure, three interpolated pixels obtained from each image are merged as a definitive interpolated pixel by calculating an average over them. Note that an undefined depth value is not included in the average calculation. Here, I interpolate is produced by integrating all of the interpolated pixels derived from all pixels in the input image. That is, UDF moving can be removed by repeating the interpolation technique for all of the images in I input .
Using depth values from a 3D Kinect camera, information on pigs can be obtained at night without a light in a pen. However, undefined depth values corresponding to moving noises (i.e., ) emerged in this process due to the dirt or dust generated from pigs, and this disturbs the accurate detection of pigs. To remove these noises, an interpolation technique using spatiotemporal information is applied to the input video.
Initially, an interpolation technique using a 2 × 2 window is applied to a current image, with two consecutive images (i.e., using temporal information), in . As shown in Figure 4a, the 2 × 2 window is used as spatial information. The 2 × 2 window moves within , and performs the interpolation on every pixel in . The interpolation is performed in three cases according to the pixel attributes in the window. In the first case, if more than two pixels in the 2 × 2 window have defined depth values such as right of Figure 4a, then an interpolated pixel can be created through their average calculation. In the second case, if there is only one pixel as a defined depth value in the window such as left of Figure 4a, then the pixel can be specified as an interpolated pixel. In the third case, if all pixels in the window are undefined such as middle of Figure 4a, then an interpolated pixel is assigned as an undefined depth value (i.e., noise pixel). In this procedure, three interpolated pixels obtained from each image are merged as a definitive interpolated pixel by calculating an average over them. Note that an undefined depth value is not included in the average calculation. Here, is produced by integrating all of the interpolated pixels derived from all pixels in the input image. That is, can be removed by repeating the interpolation technique for all of the images in .   Although most UDF moving areas usually move fast (see the bold boxes in Figure 4a), there are relatively slow moving UDF moving areas in certain consecutive images. In contrast with Figure 4b, some of these relatively slow UDF moving areas are not entirely removed by applying one spatiotemporal interpolation (see Figure 5b). This problem is due to the duplication of coordinates of the noises in consecutive images, and thus the interpolated pixels at such coordinates are continuously calculated as an undefined value.
To resolve this problem, the remaining noises in I interpolate can be removed by applying the interpolation one more time. A pixel in the preceding image is checked at the same coordinate corresponding to I interpolate , and it is mapped into I interpolate if it is recognized as a defined depth value. However, if the pixel has an undefined depth value, this procedure is repeated until the value at that coordinate is not an undefined depth value. Figure 5 illustrates the problem and its solution for relatively slow moving noises, which are entirely removed by applying the spatiotemporal interpolation technique one more time.
Although most areas usually move fast (see the bold boxes in Figure 4a), there are relatively slow moving areas in certain consecutive images. In contrast with Figure 4b, some of these relatively slow areas are not entirely removed by applying one spatiotemporal interpolation (see Figure 5b). This problem is due to the duplication of coordinates of the noises in consecutive images, and thus the interpolated pixels at such coordinates are continuously calculated as an undefined value.
To resolve this problem, the remaining noises in can be removed by applying the interpolation one more time. A pixel in the preceding image is checked at the same coordinate corresponding to , and it is mapped into if it is recognized as a defined depth value. However, if the pixel has an undefined depth value, this procedure is repeated until the value at that coordinate is not an undefined depth value. Figure 5 illustrates the problem and its solution for relatively slow moving noises, which are entirely removed by applying the spatiotemporal interpolation technique one more time. Furthermore, depth values are not consistent for all pigs, owing to different growth rates. For example, even if all of the pigs in a pig pen are weaning pigs (25 days old), a well-grown pig may often be larger than the others. In the depth image, the larger weaning pig may appear to be a standing-pig when it is actually sitting on the floor. To resolve this difficulty, we exploit generated around standing-pigs. Because the distance between a weaning lying-pig and the floor is small, values are not observed around a lying-pig. However, even for weaning pigs, values are observed around standing-pigs. Figure 6 shows that standing-pigs have values, but lying-pigs do not. Note that Figure 6 displays both color and depth images at daytime, to verify that the undefined outlines are generated around standing-pigs only. Furthermore, depth values are not consistent for all pigs, owing to different growth rates. For example, even if all of the pigs in a pig pen are weaning pigs (25 days old), a well-grown pig may often be larger than the others. In the depth image, the larger weaning pig may appear to be a standing-pig when it is actually sitting on the floor. To resolve this difficulty, we exploit UDF outline generated around standing-pigs. Because the distance between a weaning lying-pig and the floor is small, UDF outline values are not observed around a lying-pig. However, even for weaning pigs, UDF outline values are observed around standing-pigs. Figure 6 shows that standing-pigs have UDF outline values, but lying-pigs do not. Note that Figure 6 displays both color and depth images at daytime, to verify that the undefined outlines are generated around standing-pigs only.
Therefore, UDF outline can be used as beneficial information to detect standing-pigs, even though UDF outline occurs due to the limitation of the Kinect camera in I input . However, because UDF outline areas have the same values as other undefined values (i.e., 255), these are also removed after the interpolation technique. Thus, it is necessary to distinguish between UDF outline and other undefined values. To distinguish UDF outline , we exploit the differences between widths of UDF outline and other undefined values. For example, most areas with undefined values have widths that are greater than three, whereas UDF outline area has widths of less than two. These attributes help to accurately distinguish UDF outline from the others. First, 3 × 3 neighboring pixel values are compared to confirm whether they are UDF outline or not. Then, if the total pixels contain fewer than two undefined values, they are regarded as UDF outline . Figure 7 shows that fewer than two undefined values in I input are detected as UDF outline . Therefore, can be used as beneficial information to detect standing-pigs, even though occurs due to the limitation of the Kinect camera in . However, because areas have the same values as other undefined values (i.e., 255), these are also removed after the interpolation technique. Thus, it is necessary to distinguish between and other undefined values. To distinguish , we exploit the differences between widths of and other undefined values. For example, most areas with undefined values have widths that are greater than three, whereas area has widths of less than two. These attributes help to accurately distinguish from the others. First, 3 × 3 neighboring pixel values are compared to confirm whether they are or not. Then, if the total pixels contain fewer than two undefined values, they are regarded as . Figure 7 shows that fewer than two undefined values in are detected as .

Detection of Standing-Pigs
After removing using the spatiotemporal interpolation technique, the depth values in are subtracted from . Because the distance from each pig to the camera is different depending on the location of the pig, the depth values of pigs obtained from the Kinect camera need to be subtracted from . Ideally, the depth values obtained from a location under the same condition should be consistent; however, the depth values obtained by a low-cost Kinect are not consistent. For example, for the same location, different depth values of 76, 112 and 96 are obtained as time progresses. To solve this inconsistency problem, can be generated carefully as follows. Initially, a depth video in the empty pen is acquired for ten minutes. Then, the spatial interpolation is applied to to remove undefined values such as and . Furthermore, we compute the most frequent depth values of each pixel in over ten minutes. However, for certain pixel locations within a floor, the resulting values may not be similar to those of adjacent pixels. To resolve this problem, we apply line-filling, which replaces  Therefore, can be used as beneficial information to detect standing-pigs, even though occurs due to the limitation of the Kinect camera in . However, because areas have the same values as other undefined values (i.e., 255), these are also removed after the interpolation technique. Thus, it is necessary to distinguish between and other undefined values. To distinguish , we exploit the differences between widths of and other undefined values. For example, most areas with undefined values have widths that are greater than three, whereas area has widths of less than two. These attributes help to accurately distinguish from the others. First, 3 × 3 neighboring pixel values are compared to confirm whether they are or not. Then, if the total pixels contain fewer than two undefined values, they are regarded as . Figure 7 shows that fewer than two undefined values in are detected as .

Detection of Standing-Pigs
After removing using the spatiotemporal interpolation technique, the depth values in are subtracted from . Because the distance from each pig to the camera is different depending on the location of the pig, the depth values of pigs obtained from the Kinect camera need to be subtracted from . Ideally, the depth values obtained from a location under the same condition should be consistent; however, the depth values obtained by a low-cost Kinect are not consistent. For example, for the same location, different depth values of 76, 112 and 96 are obtained as time progresses. To solve this inconsistency problem, can be generated carefully as follows. Initially, a depth video in the empty pen is acquired for ten minutes. Then, the spatial interpolation is applied to to remove undefined values such as and . Furthermore, we compute the most frequent depth values of each pixel in over ten minutes. However, for certain pixel locations within a floor, the resulting values may not be similar to those of adjacent pixels. To resolve this problem, we apply line-filling, which replaces

Detection of Standing-Pigs
After removing UDF moving using the spatiotemporal interpolation technique, the depth values in I interpolate are subtracted from I background . Because the distance from each pig to the camera is different depending on the location of the pig, the depth values of pigs obtained from the Kinect camera need to be subtracted from I background . Ideally, the depth values obtained from a location under the same condition should be consistent; however, the depth values obtained by a low-cost Kinect are not consistent. For example, for the same location, different depth values of 76, 112 and 96 are obtained as time progresses. To solve this inconsistency problem, I background can be generated carefully as follows. Initially, a depth video in the empty pen is acquired for ten minutes. Then, the spatial interpolation is applied to I input to remove undefined values such as UDF f loor and UDF limitation . Furthermore, we compute the most frequent depth values of each pixel in I input over ten minutes. However, for certain pixel locations within a floor, the resulting values may not be similar to those of adjacent pixels. To resolve this problem, we apply line-filling, which replaces such a value with the average of the adjacent values in the same row, in order to obtain I background . Figure 8 shows the result of the background subtraction for depth values in I interpolate .
From I subtract , candidates for standing-pigs are detected by using a thresholding technique for depth values. By analyzing I subtract images, we found that the depth values for standing-and lying-pigs have some overlapping ranges. If the depth values do not overlap, then we can simply set a threshold to distinguish between standing-and lying-pigs. However, to resolve the overlapping problem, we generate standing pig candidates I candidate , and then verify these with the edge information I edge from the candidates and the outline information UDF outline for standing-pigs. First, we can obtain I candidate by detecting candidates in I subtract that may be considered as standing-pigs by setting a threshold. In addition, by using the thresholding technique, some undefined values resulting from limitations of the monitoring environment can be removed. That is, the undefined values such as UDF f loor and UDF limitation are removed through the thresholding technique. Figure 9 shows candidates detected as standing pigs, as well as unnecessary undefined values removed through the thresholding in I subtract .
Sensors 2017, 17, 2757 9 of 19 such a value with the average of the adjacent values in the same row, in order to obtain . Figure 8 shows the result of the background subtraction for depth values in . From , candidates for standing-pigs are detected by using a thresholding technique for depth values. By analyzing images, we found that the depth values for standing-and lying-pigs have some overlapping ranges. If the depth values do not overlap, then we can simply set a threshold to distinguish between standing-and lying-pigs. However, to resolve the overlapping problem, we generate standing pig candidates , and then verify these with the edge information from the candidates and the outline information for standing-pigs. First, we can obtain by detecting candidates in that may be considered as standing-pigs by setting a threshold. In addition, by using the thresholding technique, some undefined values resulting from limitations of the monitoring environment can be removed. That is, the undefined values such as and are removed through the thresholding technique. Figure 9 shows candidates detected as standing pigs, as well as unnecessary undefined values removed through the thresholding in . Based on both and , if is applied to , then standing-pigs in the pig pen can be identified more accurately. First, the candidates' edges (i.e., ) can be derived using a Canny operator. In fact, explained in Section 3.1 includes not only , but also other undefined values. To derive a more accurate set of , the candidates' edges in are overlapped into . Then, a dilation operator is applied to the candidates in , to eventually detect them as standing-pigs using the more accurate in . Finally, the more accurate values in are combined with . In , standing-pigs can be detected by calculating an overlapping ratio between the dilated candidates and the more accurate . In other words, if the boundaries of a dilated candidate overlap with the pixels of the more accurate by more than 50%, then the   From , candidates for standing-pigs are detected by using a thresholding technique for depth values. By analyzing images, we found that the depth values for standing-and lying-pigs have some overlapping ranges. If the depth values do not overlap, then we can simply set a threshold to distinguish between standing-and lying-pigs. However, to resolve the overlapping problem, we generate standing pig candidates , and then verify these with the edge information from the candidates and the outline information for standing-pigs. First, we can obtain by detecting candidates in that may be considered as standing-pigs by setting a threshold. In addition, by using the thresholding technique, some undefined values resulting from limitations of the monitoring environment can be removed. That is, the undefined values such as and are removed through the thresholding technique. Figure 9 shows candidates detected as standing pigs, as well as unnecessary undefined values removed through the thresholding in . Based on both and , if is applied to , then standing-pigs in the pig pen can be identified more accurately. First, the candidates' edges (i.e., ) can be derived using a Canny operator. In fact, explained in Section 3.1 includes not only , but also other undefined values. To derive a more accurate set of , the candidates' edges in are overlapped into . Then, a dilation operator is applied to the candidates in , to eventually detect them as standing-pigs using the more accurate in . Finally, the more accurate values in are combined with . In , standing-pigs can be detected by calculating an overlapping ratio between the dilated candidates and the more accurate . In other words, if the boundaries of a dilated candidate overlap with the pixels of the more accurate by more than 50%, then the Based on both I candidate and I outline , if UDF outline is applied to I candidate , then standing-pigs in the pig pen can be identified more accurately. First, the candidates' edges (i.e., I edge ) can be derived using a Canny operator. In fact, I outline explained in Section 3.1 includes not only UDF outline , but also other undefined values. To derive a more accurate set of UDF outline , the candidates' edges in I edge are overlapped into I outline . Then, a dilation operator is applied to the candidates in I candidate , to eventually detect them as standing-pigs using the more accurate UDF outline in I outline . Finally, the more accurate UDF outline values in I overlap are combined with I dilate . In I merge , standing-pigs can be detected by calculating an overlapping ratio between the dilated candidates and the more accurate UDF outline . In other words, if the boundaries of a dilated candidate overlap with the pixels of the more accurate UDF outline by more than 50%, then the candidate can be identified as a standing-pig in I output . Figure 10 summarizes the procedures for detecting standing-pigs using both UDF outline in I outline and candidates in I candidate , and Figure 11 shows the detection result for standing-pigs in the pig pen.
Finally, the proposed method is summarized in Algorithm 1, given below. candidate can be identified as a standing-pig in . Figure 10 summarizes the procedures for detecting standing-pigs using both in and candidates in , and Figure  11 shows the detection result for standing-pigs in the pig pen.  Finally, the proposed method is summarized in Algorithm 1, given below. candidate can be identified as a standing-pig in . Figure 10 summarizes the procedures for detecting standing-pigs using both in and candidates in , and Figure  11 shows the detection result for standing-pigs in the pig pen.  Finally, the proposed method is summarized in Algorithm 1, given below. Figure 11. Results of standing-pigs detection from I input to I output .

Algorithm 1 Standing-pigs detection algorithm
Input: Depth Image Output: Detected Image Step 1: While moving noise remaining Apply spatiotemporal interpolation; Subtract I background with I interpolate ; Step 2: If widths of undefined values ≤ 2: Determine as an outline; Else: Determine as a noise and remove it on the area; Step 3: If threshold1 ≤ subtracted pixel value ≤ threshold2: Determine as candidates for standing-pigs; Else: Determine as a noise and remove it on the area; Detect edges of candidates; Step 4: Overlap I edge into I outline ; If outline and edge on the same area: Determine as an outline; Else: Determine as a noise and remove it on the area; Step 5: Merge I overlap with I candidate ; If candidate pigs touch outlines: Detect standing-pigs; Else: Determine as a noise and remove it on the area;

Experimental Environments and Dataset
In our experiment, the proposed method was evaluated using Intel Core i7-7700K 4.20 GHz In the pig pen, we simultaneously obtained color and depth videos from 13 weaning pigs (i.e., 25 days old) through the Kinect camera. The color video had a resolution of 960 × 540 and 30 frames per second (fps), while the depth video had a resolution of 512 × 424 and 30 fps.
As described in Section 3, it was impossible to detect standing-pigs in the color video, because a light in the pig pen was turned off at night. Therefore, we only exploited the depth video, which could be used to monitor pigs at night. We used 8 h of depth video, including daytime (07:00, 10:00, 13:00 and 16:00) and nighttime (01:00, 04:00, 19:00 and 22:00), which consisted of 480 depth images (one image per minute). Because it was highly time consuming to create ground truth data, especially for nighttime images (i.e., when the light was turned off), we selected one image for each minute as a representative image. We then applied the proposed method to all the images to detect standing-pigs in the pen.

Detection of Standing-Pigs under Moving Noise Environment
Before detecting standing-pigs in the pig pen, we removed moving noises using the spatiotemporal interpolation technique. As explained in Section 3.1, we sequentially exploited spatial information to remove the moving noises. Moreover, we used temporal information to remove certain problematic noises, such as relatively slow moving noises. Then, 480 I interpolate images were obtained by applying the interpolation technique to 1440 I input images. From I interpolate , we obtained 480 I subtract images by using background subtraction with I background , and then obtained I candidate to detect candidates by applying the thresholding technique to I subtract .
For detecting the candidates, the defined depth values for standing-and lying-pigs in I subtract were measured as 9-30 and 4-15, respectively. In fact, the range of depth values for standing-and lying-pigs overlapped, and a lying-pig in the overlapping interval might be detected as a standing-pig. However, because our final goal is to implement a 24 h tracking system for pigs in the pen, it is not a serious problem to detect some lying-pigs as standing-pigs. Thus, we set threshold1 to 9, to detect all the standing-pigs without missing any. In addition, we set threshold2 to 30 to remove the remaining undefined values. That is, if the depth values were greater than threshold1, then the depth values were detected as candidates for standing-pigs. Moreover, if the depth values were greater than threshold2, then the remaining undefined values were removed. Figure 12 shows differences of detecting standing-pigs according to threshold1. As shown in Figure 12c,d, all the standing-pigs could be detected by setting threshold1 to 9.
Before detecting standing-pigs in the pig pen, we removed moving noises using the spatiotemporal interpolation technique. As explained in Section 3.1, we sequentially exploited spatial information to remove the moving noises. Moreover, we used temporal information to remove certain problematic noises, such as relatively slow moving noises. Then, 480 images were obtained by applying the interpolation technique to 1440 images. From , we obtained 480 images by using background subtraction with , and then obtained to detect candidates by applying the thresholding technique to . For detecting the candidates, the defined depth values for standing-and lying-pigs in were measured as 9-30 and 4-15, respectively. In fact, the range of depth values for standing-and lying-pigs overlapped, and a lying-pig in the overlapping interval might be detected as a standing-pig. However, because our final goal is to implement a 24 h tracking system for pigs in the pen, it is not a serious problem to detect some lying-pigs as standing-pigs. Thus, we set threshold1 to 9, to detect all the standing-pigs without missing any. In addition, we set threshold2 to 30 to remove the remaining undefined values. That is, if the depth values were greater than threshold1, then the depth values were detected as candidates for standing-pigs. Moreover, if the depth values were greater than threshold2, then the remaining undefined values were removed. Figure 12 shows differences of detecting standing-pigs according to threshold1. As shown in Figure 12c,d, all the standing-pigs could be detected by setting threshold1 to 9. To identify the standing-pigs among detected candidates, in was overlapped with edges of the candidates. This was conducted to identify the more accurate of a standing-pig if the edges in a region of a candidate matched in . If the candidates overlapped with the actual , then we finally identified the standing-pigs in these regions. Figure 13 displays the results for the detection of standing-pigs during the daytime and nighttime. To identify the standing-pigs among detected candidates, UDF outline in I input was overlapped with edges of the candidates. This was conducted to identify the more accurate UDF outline of a standing-pig if the edges in a region of a candidate matched UDF outline in I input . If the candidates overlapped with the actual UDF outline , then we finally identified the standing-pigs in these regions. Figure 13 displays the results for the detection of standing-pigs during the daytime and nighttime.

Evaluation of Detection Performance
To evaluate the detection performance of the proposed method, we compared the number of standing-pigs detected using our method with that of existing methods for object detection, which included the Otsu algorithm [49] (i.e., well-known method for object detection) and YOLO9000 [50] (i.e., a recently-used method for object detection based on deep learning).
In case of the Otsu algorithm, a background image was created by using the average and minimum values of each pixel in the input images for ten minutes from the empty pig pen. Using the test images, background subtraction was applied, and then the Otsu algorithm was performed. It is well known that the background subtraction method using the minimum value can detect typical foregrounds accurately with a Kinect camera [51]. However, as explained in Sections 2 and 3, there are many difficulties in detecting standing-pigs after weaning. That is, we confirmed that standing-pigs in the pen could not be detected at all, because the Otsu algorithm binarized results into undefined and defined regions such as pigs, floor, and side-walls.
In the case of YOLO9000, we generated a model using the training data, which consisted of 600 depth images. We set some parameters of YOLO9000 as follows: 0.001 for learning rate, 0.9 for momentum, 0.0005 for decay, leaky ReLU as the activation function, and 10,000 for the epoch. From each test image, YOLO9000 produced bounding boxes to represent standing-pigs, and the confidence score was calculated to measure the similarity between the training model and the bounding boxes produced from YOLO9000. This score was used to detect the target objects (i.e., standing-pigs) among the bounding boxes, by using a threshold in YOLO9000. We exploited the default threshold of 0.24 to detect standing-pigs in YOLO9000.
It is well known that YOLO9000 can detect typical foregrounds accurately in real-time [52]. However, YOLO9000 produced many false-positive and false-negative bounding boxes in detecting standing-pigs. Figure 14 displays the results of the standing-pigs detection for each method.  bounding boxes produced from YOLO9000. This score was used to detect the target objects (i.e., standing-pigs) among the bounding boxes, by using a threshold in YOLO9000. We exploited the default threshold of 0.24 to detect standing-pigs in YOLO9000. It is well known that YOLO9000 can detect typical foregrounds accurately in real-time [52]. However, YOLO9000 produced many false-positive and false-negative bounding boxes in detecting standing-pigs. Figure 14 displays the results of the standing-pigs detection for each method.  As shown in Figure 14, the Otsu method could not detect standing-pigs at all, and thus we did not compute the accuracy of the Otsu method. In fact, the Otsu algorithm has been performed using a histogram distribution to classify as the background, and with the objects in an input image. However, in our case, the depth values between the background and the objects were similar, and the depth values of the noises had some differences with the objects. In addition, because the Otsu algorithm binarized the background and objects as the same group, the pigs could not be detected using the Otsu algorithm. Meanwhile, YOLO9000 is a recent method for object detection. As YOLO9000 imitates the process in which the human brain receives visual information, it learns the feature vectors optimized for training samples by themselves, and improves the performance of object classification by using these. Therefore, we compared the detection accuracy of the proposed method with that of YOLO9000.
In the experimental results for the proposed method and YOLO9000, we calculated the detection accuracy for standing-pigs to compare the performance of each method. The detection accuracy was calculated for each method using the equation below: where true positive (TP) is "standing-pigs" identified as "standing-pigs", true negative (TN) is "lying-pigs or noises" identified as "not standing-pigs", false positive (FP) is "lying-pigs or noises" identified as "standing-pigs", and false negative (FN) is "standing-pigs" identified as "lying-pigs or noises", respectively. In particular, for each standing-pig, if the detected result had more than 50% intersection-over-union (IoU) [53] with the ground truth, then it was regarded as TP. Otherwise, it was regarded as FN. In Equation (1), the denominator (i.e., TP + FN) represents the number of standing-pigs, and the numerator (i.e., FP + FN) represents the number of detection failures. That is, the accuracy is comprised of how many pigs are failed to be detected as standing-or lying-pigs among the actual standing-pigs. Based on the experimental results, the detection accuracies for standing-pigs were measured as 94.47% (proposed method) and 86.25% (YOLO9000 method) as shown in Table 3. In Table 4, the number of undefined pixels means the average percentage of undefined pixels from the total number of pixels of I input . Even if this comprised more than 20% of the input image, it was possible to detect standing-pigs with a higher accuracy using the proposed method. Because we set threshold1 to 9, we could detect all the standing-pigs using the proposed method. As shown in Figure 14c,d, we could even detect standing-pigs occluded by moving noises, by applying the spatiotemporal interpolation. Furthermore, all the false standing-pigs detected were lying-pigs (having distance values overlapped with standing-pigs). On the contrary, with YOLO9000, some of standing-pigs were missed, and thus 24-h individual pig tracking might not be possible with this method. In addition, the false standing-pigs detected by YOLO9000 consisted of the floor or moving noises as well as lying-pigs (see Figure 14). Table 3. Accuracy of standing-pig detection.

Method Accuracy (%)
Proposed method 94.47 YOLO9000 86.25 Furthermore, we measured the execution time of each method, in order to confirm the real-time performance of standing-pig detection. As a result, the proposed method provided a faster processing speed in detecting standing-pigs than that of YOLO9000. Table 5 presents the processing speeds of each method for detecting standing-pigs. As explained in Section 1, our final goal is to develop a complete monitoring system, including both intermediate-and high-level vision tasks in real-time. By considering the further procedures in both intermediate-and high-level vision tasks, the detection of standing-pigs needs to be executed as fast as possible. Without time-consuming techniques (i.e., at least few seconds are required to process a single depth image to improve inaccurate depth values) such as in [54,55], it is possible to develop a real-time pig monitoring system including both intermediate-and high-level vision tasks. Table 5. Average processing speed for standing-pigs detection.

Method Frames per Second
Proposed method 494.7 YOLO9000 87.0

Conclusions
The automatic detection of standing-pigs in a surveillance camera environment is an important issue for the efficient management of pig farms. However, standing-pigs could not be detected accurately at night on a commercial pig farm, even using a depth camera, owing to moving noises.
In this study, we focused on detecting standing-pigs in real-time in a moving noise environment to analyze individual pigs with the ultimate goal of 24-h continuous monitoring. That is, we proposed a method to detect standing-pigs at night without any time-consuming techniques. In the preprocessing step, the noise in the depth image was removed by applying a spatiotemporal interpolation technique, to alleviate the limitations of a low-cost depth camera such as Kinect. Then, we detected the standingpigs by carefully generating a background image and then applying a background subtraction technique. In particular, we utilized undefined outline information (i.e., the undefined depth values around standing-pigs) to detect standing-pigs in a moving noise environment.
Based on the experimental results for 480 video images (including 1186 standing-pigs) over eight hours (i.e., obtained during 01:00-10:00 and 13:00-22:00 in intervals of three hours), we could correctly detect all 1186 standing-pigs (while the ground truth-based accuracy was 94.47%) in real-time. As a future work, we will use the infrared information obtained from a Kinect sensor to improve the detection accuracy further. In addition, we will also consider the case of monitoring a large pig room by using multiple Kinect sensors. By extending this study, we will develop a real-time 24-h individual pig tracking system for the final goal of individual pig care.