Panoptic Segmentation of Individual Pigs for Posture Recognition

Behavioural research of pigs can be greatly simplified if automatic recognition systems are used. Systems based on computer vision in particular have the advantage that they allow an evaluation without affecting the normal behaviour of the animals. In recent years, methods based on deep learning have been introduced and have shown excellent results. Object and keypoint detector have frequently been used to detect individual animals. Despite promising results, bounding boxes and sparse keypoints do not trace the contours of the animals, resulting in a lot of information being lost. Therefore, this paper follows the relatively new approach of panoptic segmentation and aims at the pixel accurate segmentation of individual pigs. A framework consisting of a neural network for semantic segmentation as well as different network heads and postprocessing methods will be discussed. The method was tested on a data set of 1000 hand-labeled images created specifically for this experiment and achieves detection rates of around 95% (F1 score) despite disturbances such as occlusions and dirty lenses.


Introduction
There are many studies that show that the health and welfare of pigs in factory farming can be inferred from their behaviour. It is therefore extremely important to observe the behaviour of the animals in order to be able to intervene quickly if necessary. A good overview of the studies, the indicators found and the possibility of automated monitoring is provided by [1]. Similarly, there are studies in which the various environmental factors (housing, litter, enrichment) are examined and how these factors affect behaviour [2][3][4].
Observing the behaviour of the animals over long periods of time cannot be done manually, so automated and sensor-based systems are usually used. Classical ear tags or collars can be located in their position, but have the disadvantage that the transmitter cannot provide information about the orientation of the remaining parts of the animal's body. In addition, the sensor must be purchased and maintained for each individual animal. This is why computer vision is increasingly used, where the entire barn with all animals can be monitored with a few cameras. An overview of different applications with computer vision in the pig industry can be found in [5].
Based on 2D or 3D images, the position of the individual animals and their movements can be detected. From the positions alone, a lot of information can be extracted. By means of defined areas, the position can be used to identify e.g. food or water intake [6]. Furthermore, interactions and aggression between the animals can be detected if they touch each other in certain ways (mounting, chasing) [7][8][9]. The behaviour of the entire group can also be evaluated. Certain patterns when lying down can reveal certain information about the temperature in the barn [10]. Or the changes in positions over time can be converted into an activity index [11] or locomotion analysis [12]. Even though camera recording has many advantages due to its low-cost operation and non-intensiveness, the task of detecting animals reliably, even in poor lighting conditions and with contamination, is difficult. Previous work used classical image processing such as contrast enhancement and binary segmentation using thresholds or difference images to separate the animals from the background [6,9,13,14]. Later, the advantages of more sophisticated detection methods based on learned features or optimization procedures were presented [15,16]. With the recent discoveries in the field of deep learning, the detection of pigs with neural networks has also been addressed. Either the established object-detector networks were applied directly to the pigs, or the detections found were post-processed to visually separate touching pigs [17][18][19]. Although the detection rate with these object detection methods is very good, the resulting bounding boxes are suboptimal, because depending on the orientation of the animal, the bounding box may contain large areas of background or even parts of other animals (see Figure 1). Therefore, Psota et al. [20] proposed a method that avoids the use of bounding boxes and tries to directly detect the exact pose of the animal with keypoints on specific body parts (e.g. shoulder and back).
In this work we close the gap between the too large bounding boxes and the sparse keypoints and try to identify the animals' bodies down to pixel level. We believe that the exact body outlines can help to classify the animals' behaviour even better. The movement of individual animals can be depicted much better than with a bounding box and the body circumference resulting from the segmentation can also be used to draw conclusions about the size and weight of the animals.
The main contribution of this thesis is the presentation of a versatile framework for different segmentation tasks on pigs together with the corresponding metrics.
The remainder of this work is organized as follows. In Section 2 the basic concepts of object detection based on bounding boxes, pixel-level segmentation and key-points are listed. The proposed method is described in Section 3 followed by the evaluation in Section 4. The findings are discussed and concluded in Sections 5 and 6.

Background
In recent years, methods based on neural networks have gained enormous importance in the field of image processing. Based on the good results in classification tasks, soon adapted network architectures were shown, which can also be used for the detection of objects [21,22]. The current generation of detection networks uses a combination of region proposals (bounding boxes) and classification parts, which evaluate the proposed regions [23][24][25]. With Deepmask [26,27] and Mask-RCNN [28] even object detectors have been shown, which generate a pixel-level segmentation mask for each region found. Although these detectors provide very good results, the generated region proposals have the problem that only one object can be found at each position. This limitation is usually irrelevant, because in a projective image each pixel is assigned to exactly one object anyway and two objects at the same position cannot be seen. However, if two elongated objects overlap orthogonally, the center of the objects may fall on the same pixel, which  Figure 2. Visualization of the different experiments presented in this work. The binary segmentation distinguishes only between foreground and background (b). A categorical segmentation can be used to separate the individual animals (c) or to classify body parts (e). Or the network is trained to directly tell the affiliation of the pixels to the individual animals (d).
cannot be mapped by such region proposal network. Another area in which neural networks are very successful is (semantic) segmentation, in which each pixel is assigned a class (pixelwise classification) [29][30][31]. However, the classic semantic segmentation does not distinguish between individual objects, but only assigns a class to each pixel. In order to separate the individual objects, an instance segmentation must be performed. For this purpose, the semantic segmentations are extended, for example, such that the output of the network is position-sensitive in order to identify the object boundaries [32]. Another solution is to count and recognize the animals in a recursive way. For this purpose, one object after the other is segmented and stored until no more objects can be found [33,34]. Since the networks are designed to predict certain classes, the classes can also be chosen to help distinguish the instances. For example, Uhrig et al. [35] use the classes to encode the direction to the center of the corresponding object for each pixel. Since the direction to the center of the object is naturally different at object boundaries, the individual instances can be separated. To assign the pixels to individual instances an embedding can also be used. As described by De Brabandere et al. [36], a high-dimensional feature space is formed and for each pixel in the image the network predicts the position in space. Via a discriminative loss, pixels belonging to the same object are pushed together in the embedding space and pixel clusters of different objects are pushed apart. With a subsequent clustering operation the instances in the embedding can then be separated.
Another approach for the segmentation of individual instances is the detection of certain key points, which are then meaningfully combined into the individual instances using skeleton models [37][38][39].
As described in the introduction, detection with bounding boxes and detection via key points has already been demonstrated on pigs. This work follows the relatively new definition of a panoptic segmentation [40] and aims at the pixel accurate segmentation of the individual pigs.

Proposed Method
The goal of the proposed method is a panoptic segmentation [40] of all pigs in images of a downward-facing camera mounted above the pen. Panoptic segmentation is defined as a combination of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). So the semantic segmentation part differentiates between the two classes background and pig whereby the instance segmentation part is used to distinguish the individual pigs (see Figure 2b and 2d).
The proposed method for the panoptic segmentation is an extension of classical semantic segmentation. Therefore, in this paper the complexity of segmentation is increased step by step resulting in four separate experiments. First a simple binary segmentation is tested (see Figure 2b). In the second experiment the individual animals are extracted from a semantic (or categorical) segmentation (see Figure 2c). The third experiment shows a pixel precise instance segmentation based on a combination of the binary segmentation . Schematic representation of the proposed framework. The auto-encoder is an U-Net architecture (depiction adopted from [41]). The individual stages consist of several blocks, each with several layers. Scaling down or up is done between the stages. Skip connections are used to combine the results of the encoder and decoder stages. The network is equipped with different heads for the different experiments. The output is processed afterwards to yield the desired results. and an pixel-embedding (see Figure 2d). And in the last experiment the embedding is combined with a body part segmentation (see Figure 2e) for an additional orientation recognition.
All experiments are based on the same network architecture. Only the last layers are adjusted to get the required output. This way, the presented framework can be easily adapted to each of the experiments. An overview of the framework is given in Figure 3.

Representation of the pigs
In order to perform panoptical segmentation instance by instance and as accurately as possible, manual annotation should contain such instance information with pixel accuracy. Since the choice of the annotation method always requires a trade-off between effort and accuracy, a pixel accurate annotation is preferable, but also very costly. In contrast, bounding boxes can be drawn quickly, but would contain large background areas in addition to the marked pig, especially if the pig is standing diagonally to the image axes (see Figure 1c). Based on existing work [6,10,13], ellipses were therefore chosen as annotations. They are also very easy to draw (due to the two main axes) and adequately reproduce the pigs' bodies on the images of a downward facing camera. Except for small mistakes (e.g. when the animal turns its head to the side), the pixels belonging to the individual animals can thus be easily captured. By aligning the ellipse (first main axis), the orientation of the animals is also stored. If animals overlap, the order in which the pixels in the label image are drawn must correspond to the reversed order of the animals in the camera's visual axis. This ensures that the pixels of the animals on top overwrite the pixels of the covered animals (see Figure 2c and 2d). Since the area of the ellipses correlates approximately with the volume of the animals, the ellipses have the further advantage of allowing conclusions to be drawn about the volume respectively the weight of the animals.

Network architecture
The typical network architecture for semantic segmentation consists of an encoder and a decoder part. The encoder transforms the input image into a low-dimensional representation, which the decoder then converts into the desired output representation (this combination is often also called auto-encoder). The encoder is structured similarly to a classification network whereas the decoder is a combination of layers symmetrical to the encoder, but with upsampling steps instead of the downsampling steps. To further improve the segmentation results, skip connections are often added. To make the information from the different resolution levels usable, these connections merge the intermediate results of the encoder with the corresponding upsampling levels in the decoder. Famous versions of such networks are for example U-Net [30] and LinkNet [42]. Another approach to use the information from the different downsampled layers from the encoder to obtain a dense segmentation in the original resolution is a feature pyramid network (FPN) [43]. Here the predictions are made at the different scales and merged afterwards.
In this work an U-Net was used as it gave the best results. In addition, its modular design allows the use of different classification architectures as encoders. Thus it is possible to benefit from the latest developments in this field. With ResNet34 [44] and Inception-ResNet-v2 [45] two established classification networks were used as encoder backbone. They both consist of single blocks that combine different convolution operations with a shortcut connection. With these shortcut connections, the optimizer does not have to learn the underlying mapping of the data, but simply a residual function [44]. The blocks are organized in different stages and each stage is followed by a downscaling. The decoder part imitates the stages but uses an upscaling layer instead of downscaling. Via the skip connections the stages of the encoder are connected to the stages of the decoder where they are combined with the results of the encoder (see Auto-encoder in Figure 3).
More details on the implemented architecture can be found in subsection 4.4. In subsection 4.6, an ablation study evaluates additional backbones and hyper parameters as well as the FPN architecture.

Binary segmentation
A binary segmentation is the basis for many of the classical approaches to pig detection [6,10,13,14]. At the same time, it is a comparably simple task for a neural network. Once solved, however, foreground segmentation can also be used to simplify more complex procedures, e.g. to apply them only to the important areas of the image (see subsection 3.5).
For the binary segmentation, the network learns which pixels belong to the pigs and which to the background. So for each pixel x i it predicts a probability p(x i ) with which the pixel belongs to a pig (with the corresponding opposite probability (1 − p(x i )) the pixel belongs to the background). The training data consist of binary label images based on the manually annotated ellipses (see Figure 2b), where each pixel in the label image is a binary variable y i , indicating whether the pixel belongs to the background (value 0) or to a pig (value 1).
The network is set up with the architecture described in section 3.2 but with only one output layer. The output has the same spatial dimension as the input, but with only one channel and a sigmoid activation function that generates the probability estimate for each pixel. The loss function is the cross-entropy loss: During inference, the predicted probability values are thresholded to create the final binary segmentation.

Categorical segmentation
In the second experiment a semantic or categorical segmentation is applied to be able to separate the individual instances. Based on the direction-based classes described in Uhrig et al. [35], the semantic segmentation is set up with the classes background, outer edge of an animal and inner core of an animal (see Figure 2c) to recognize the outer boundaries of the animals. Or in other words, it defines a distance-based classification which encodes the distance to the pigs center in discrete steps. Whereby the inner-core area is just a scaled down version of the original manually annotated ellipse. With these three classes, the training data are categorical label-images with an one-hot vector t i at each pixel, indicating one positive class and two negative classes.
In the existing network architecture, only the last layer is adapted, such that the number of channels corresponds to the number of classes (C = 3) defined in the experiment. Since each pixel can only belong to one of the C classes, the vector x i along the channel axis at each pixel-location is interpreted as a probability distribution over the C classes. Such a probability distribution can be generated with the softmax activation function on the output-layer. The loss function is the categorical cross-entropy loss over all N pixels and the C classes: While in the binary segmentation the individual instances blend when they overlap, the centers of the animals and thus the individual instances can still be reconstructed with this method. A detailed description of the extraction process follows in subsection 4.3.

Instance segmentation
The categorical segmentation is a rather naive approach, where the boundaries should prevent the individual animals from blending together. So in the third experiment each pixel in the image should be assigned to a specific animal (or the background). For this task De Brabandere et al. [36] have introduced a discriminating loss function which uses a high dimensional feature space in which the pixel of the input image are projected in (pixel embedding). The network learns to place the pixels belonging to one object in this space as close together as possible, while pixels belonging to other objects are placed as far away as possible (see Figure 4). Illustration of the forces acting on the pixels to form the clusters (image adopted from [36]). With the variance term (yellow arrows) the pixels are drawn in the direction of the cluster mean (crosses). The distance term (red arrows) pushes the different clusters apart. Both forces are only active as long as the threshold values are not reached (inner circle for the cluster variance and outer circle for the distance).
The loss function is a weighted combination of three terms, which act based on the individual instances given by the annotated data: 1. Variance term The variance term penalizes the spatial variance of the pixel embeddings belonging to the same instance. For all pixels that belong to the object (according to the annotated data), the mean is calculated and then its distance of all object-pixels is evaluated. This forces the points in the feature space to cluster. 2. Distance term The distance term keeps the calculated means of the clusters at a distance.

Regularization term
The regularization term keeps the expansion of all points in the feature space within limits and prevents them from drifting apart.
Following the definition from [36], for each training example there are C objects (or classes) to segment (the pigs plus the background). N c is the number of pixels, covering object c and x i is one pixel embedding in the feature space. For each object c there is a mean of all its pixel embeddings µ c . · is the L1 norm.
In addition the loss is hinged to be less constrained in the representation. The pixel embeddings of the objects do not need to converge to exactly one point but should reach a distance below a threshold δ v . In the same way, the distance between two different mean embeddings must only be greater than or equal to the threshold δ d . This is mapped with the hinge-function [x] + = max(0, x). Now the three terms can be formally defined as follows: The final loss function L with weights α, β and γ is given as:

Postprocessing
After the network has been used to create the pixel embedding on an input image, the individual instances must be extracted from it. De Brabandere et al. [36] propose the use of the mean-shift algorithm to identify cluster-centers and afterwards assign all pixels belonging to the cluster (in terms of the δ v threshold) to the same object. In this work the hierarchical clustering algorithm HDBSCAN [46] is used instead, as it shows improved performance in high dimensional embedding spaces. HDBSCAN is a density based hierarchical clustering and therefore optimally suited for the required clustering. It starts with a thinning of the non-dense areas. Then the dense areas are linked to a tree, which is converted into a hierarchy of linked components. Thus, a condensed cluster tree can be created by the parameter of minimum cluster size, and from this tree the final flat clusters can be extracted.

Combined segmentation
Since each pixel is mapped in the embedding, there are many data points that have to be clustered. At normal HD camera resolutions, this quickly adds up to a million data points. To accelerate clustering, a combined solution of discriminating and binary segmentation was designed. With the binary segmentation, a mask is created that contains only the pixels that belong to the animals. Thus only those pixels are fed into the clustering process that are relevant for the differentiation of the individual animals. Figure 5 shows an example of the distribution of pixels in a two-dimensional embedding and the clustering applied to the binary segmentation.
The network architecture only needs to be adapted slightly, since the architectures of the two experiments only differ in the last layer. In order to generate both outputs simultaneously, the network is equipped with two heads, which generate the corresponding outputs from the outputs of the autoencoder. The two heads are trained with the appropriate loss functions and feed the gradient updates equally weighted into the auto-encoder network.

Orientation recognition
If the found pixel segmentation approximately equals an ellipse shape, the fit of the final ellipses will align the ellipses so that the major axis corresponds to the orientation of the animal. However, since ellipses are symmetrical from a rotation of 180 degrees, the orientation of the animals can only be detected correctly up to this 180 degree ambiguity. Since the correct orientation was captured during manual annotation, this ambiguity can also be resolved. For this the combined method described in the previous section uses a categorical segmentation with the classes background, body and head instead of a binary segmentation (see Figure 2e). In the postprocessing the classes then can be used to determine the orientation of the animals as described in subsection 4.3.

Dataset
The data used in this work are images from a conventional piglet rearing house. Five cameras were installed, with each camera covering two 5.69 m 2 pens, each with a maximum of 13 animals. The animals were housed at the age of 27 days and remained in the facility for 40 days. The recordings of this dataset covered a period of four months. From all available videos 1000 frames with a resolution of 1280x800 pixels were randomly selected and manually annotated. The images from one of the five cameras were declared as a test set, so that the evaluation is based on images of pens that the network never saw during the training. The images of the remaining four cameras make up the training and validation set. The data sets contain normal color images from the daytime periods and night vision images with active infrared illumination from the night periods. In addition, the cameras occasionally switched to night vision mode during the day due to dirty sensors. In the evaluation, however, a distinction is only made between color images and active night vision, regardless of the time of day. An overview can be found in Table 1.
In Figure 6 some example images from the data set are shown. Some of the challenges of working in pigsties can be clearly seen. For one thing, the camera position cannot always be chosen optimally, so that occlusions cannot be avoided. Furthermore, the lighting and the natural incidence of light cannot be  controlled, so that the exposure conditions are sometimes difficult. And last but not least, the cameras get dirty over time, resulting in disturbances and malfunctions (such as, for example, the erroneously active night vision).

Evaluation metrics
For the task of panoptic segmentation Kirillov et al. [40] also proposed a metric called panoptic quality (PQ). It is very similar to the well known F1 Score, but takes into account the special characteristic that each pixel can only be assigned to exactly one object. It first matches the predicted segments with the ground truth segments and afterwards calculates a score based on the matches.
Since each pixel can only be assigned to one object, the predicted segments cannot overlap. Therefore it can be shown that there can be at most one predicted segment for each ground truth segment, with an intersection over union (IoU) of strictly greater than 0.5 [40]. Each ground truth segment for which there is such a matching predicted segment counts as a true positive (TP). Predicted segments that do not sufficiently overlap any ground truth segment count as false positives (FP) and uncovered ground truth segments count as false negatives (FN). For all the predicted segments p and the ground truth segments g, the PQ is defined as: For better comparability with other work, the F1 Score, precision and recall are also evaluated in the experiments (see Subsection 4.5). F1, precision and recall are based on the same TP, FP, and FN as the PQ.

Ellipse extraction
To capture all pixels belonging to an animal in the manual annotation, the ellipses must be able to overlap (see Figure 7a). While the depth sorting described in subsection 3.1 ensures that each pixel is uniquely assigned to a single animal (see Figure 7c), the pixel-level segmentations can not be compared to the originally annotated ground truth ellipses. If the animals overlap, the original ellipses and the found segmentations differ in size. To solve this issue and to generate comparable data, new ellipses were extracted from the label images by fitting ellipses into the segmentations (see Figures 7c and 7d). These new ground truth ellipses are then compared to the ellipses extracted from the segmentation-output of the networks.
Depending on the experiment the ellipses are extracted differently from the predicted outputs of the network. For the categorical segmentation all pixels of the class inner core of an animal (see section 3.4) are searched first using a blob search. The individual separate blobs are then interpreted as individual animals. For this an ellipse is fitted to the segmented pixel with the algorithm of Fitzgibbon [47]. Since the core of an animal was generated from the scaled-down version of the manually annotated ellipse, the ellipse adapted from the blob can then simply be scaled up accordingly.
When using the segmentation with the discriminative loss and the clustering, the ellipses can simply be fitted to the pixels of the individual clusters, after backprojecting the pixels from the embedding into image-space. As described in subsection 3.5, the binary mask of the combined approach is used thereby to process only the pixels that belong to the animals while masking out the background. If the orientation of the animals is also detected, the classes body and head can be combined to achieve the binary segmentation. Once the ellipses are fitted, the original categorical segmentation can be used to identify the side of the ellipse where the head was detected.

Implementation Details
The network was implemented with the segmentation models library [41]. As described in subsection 3.2 an U-Net with ResNet34 and Inception-ResNet-v2 encoder-backbones was used. The backbones were initialized with weights pretrained on ImageNet [48]. Four skip connections were added for both backbones, one after each major resolution reduction. In the case of ResNet34 accordingly after each of the four stages [44]. For Inception-ResNet-v2 two directly in the Stem-block, one after the ten repetitions of the Inception-A block and one after the 20 repetitions of the Inception-B block [45]. Exact details on the structure of the blocks in the encoder backbones can be found in the corresponding papers. The decoders are assembled from similar blocks, but instead of MaxPooling layers, they use upSampling layers between blocks to reproduce the original resolution. For all experiments the Adam-Optimizer [49] with an initial learning rate of 1e-4 was used.
To speed up the calculation of the network and any subsequent clustering, the images were scaled down to a resolution of 640 x 512 pixels. Additionally the training images were augmented during the training with the imgaug library [50], to achieve a better generalization. The augmentation included different distortions, affine transformations and color changes (e.g. grayscale to simulate active infrared illumination) and increased the amount of training images by a factor of 10. For all the image related preand post-processing tasks (such as the ellipse fitting) the OpenCV-library [51] was used.
For the pixel embedding an eight-dimensional space was used. The thresholds in the discriminative loss in Equation 4 and 5 were set to δ v = 0.1 and δ d = 1.5. The weights in the final loss term in equation 6 were set to α = β = 1.0 and γ = 0.001. The values were taken from the original paper [36], except for the threshold δ v , which was decreased to improve the density-based clustering. For the clustering the HDBSCAN implementation from McInnes et al. [52] was used with the minimal cluster size set to 100.

Evaluation
In order to evaluate the methods described in Chapter 3, they were all run on the test dataset. To investigate the influence of different backbones, all experiments were performed with both backbones. A distinction was also made between day and night vision images to test the robustness of the methods.

Binary segmentation
In binary segmentation, the network predicts a probability that a particular pixel belongs to a pig or the background. This probability is converted into a final decision using a threshold value of 0.5. The binary pixel values can then be compared with the ground-truth images using the Jaccard index. This gives the accuracy of the predictions as listed in Table 2. The ellipses cover the body of the animals only approximately (see subsection 3.1). Therefore, the network sometimes receives ambiguous information, where pixels that can be clearly recognized as background still have the label pig. The network produces mainly elliptical predictions, but the segmented areas also follow the body of the animals (see Figure 8b). Since the label images only contain undistorted ellipses, an accuracy of 100% is never achievable for the network.

Categorical segmentation
For the categorical segmentation the class inner core of an animal was set to 50% of the size of the ellipses (see Figure 8c). The results are shown in the upper part of Table 3. Beside the accuracy of the categorical segmentation (again measured with the Jaccard index), now also the extracted ellipses (see subsection 4.3) were compared to the manual annotated ellipses with the panoptic quality metric. F1 score, precision and recall are listed in detail in Table 6. Table 3. Detection results for the ellipses extracted with the categorical segmentation and the combined segmentation. Regardless of the selected backbone, detection rates of about 95% (F1 Score) are achieved. For detailed information about precision and recall see Table 6. It is noticeable that with the combined segmentation approach the accuracy of the binary segmentation remains unaffected, although the segmentation head and the pixel-embedding head jointly influence the weights in the backbone. The experiments were carried out on all test images, and separately on the daylight (D) and night vision (N) images only.

Instance segmentation
For this experiment a combined network was trained to predict the association of each pixel with the individual animals in an eight-dimensional space together with the binary segmentation. The results are shown in the lower part of Table 3. F1 score, precision and recall are listed in the lower part of Table 6. It is important to note that the combined processing of pixel embedding and binary segmentation in a shared backbone does not affect the accuracy of the binary segmentation. Therefore, a synergy effect of the two tasks can be assumed.

Orientation recognition
For the orientation recognition, the same combined network as before was used, but the binary segmentation was replaced by the body part segmentation (see Figure 8d). The orientation of the ellipses is the reconstructed as described in subsection 4.3. To evaluate the accuracy of the orientation recognition, the orientation of all correctly identified pigs (true positives) was assessed over the complete test-set. The results are summarized in Table 4. Although a categorical segmentation is now applied instead of the binary segmentation, a comparison with the values in Table 3 shows that the accuracy of ellipse detection is not affected.

Ablation studies
The experiments conducted with the two differently complex encoder architectures already suggest that the influence of the backbone is marginal. Nevertheless, additional experiments were carried out to confirm this assumption. To increase the speed of the tests, the resolution of the input images was further reduced to 320x256 pixels. The results are summarized in Table 5.

Classification backbone
As described in subsection 3.2 the chosen U-Net architecture can be set up with different classification backbones. In addition to the classification backbones already introduced, the experiments were also carried out with the EfficientNet [53] backbone.

Network architecture
Although the U-Net architecture delivers good results, all three backbones were additionally evaluated with the FPN architecture (see subsection 3.2).

Clustering hyperparameters
To optimize the density-based clustering, the thresholds δ v and δ d in the discriminative loss are available as hyperparameters (see subsection 3.5). They control how close the clusters are moved together, or how much distance different clusters have to keep from each other. These two threshold values were also evaluated in a grid search with the result that the exact values have no influence on the accuracy of the clustering. Values between 0.05 and 0.3 for δ v and values between 1 and 3 for δ d all yielded approximately the same result.
There is also the minimal cluster size, which refers to the number of pixels that at least belong to one pig. This parameter is therefore primarily dependent on the resolution of the input images and can only be set to a limited extent as a hyperparameter.  Figure 9. Example of the fragility of the categorical segmentation in case of strong overlaps. If the center of the animals is not visible, the segmentation cannot provide meaningful information about the hidden animal (b). The instance segmentation, on the other hand, does not have this problem (c).

Discussion
As shown in Table 3, the quality of the extracted ellipses of the categorical segmentation and that of the combined approach are comparable on average. Especially for more complex overlaps, the categorical segmentation theoretically reaches its limits when the core part of the pigs is hardly visible (see Figure  9b). In such situations the pixel embedding should have shown its strengths, but these situations hardly seem to occur in the actual data set. Therefore, the network was not able to learn these cases and produces correspondingly bad results (see Figure 10c).
It is interesting too that the choice of the backbone and overall architecture also has no influence on the results (see ablation study in Table 5). The small size of the data set will probably play a role here. With so little data (although augmented), the deep architectures like Inception-ResNet-v2 and EfficientNet cannot even show their advantages over ResNet34.
The choice of the PQ as evaluation metric makes sense with the methods presented, since the exact evaluation of the Intersection over Union provides information about how precisely the pixel-accurate segmentation works. Unfortunately, this novel metric does not allow a direct comparison to other works. However, in order to allow a rough comparison, classical metrics like Precision and Recall are listed in Table 6. In the only paper with a publicly accessible data set [20], the authors give 91% precision and 67% for their test set. With our methods on our data set we achieve values around 95% for both metrics. However, it should be noted that although the test data in our data set comes from a different camera, the images in the test set do not differ fundamentally from the images in the training set. In [20], the images in the test set seem to deviate more from the training data. A direct comparison on the data set from [20] was unfortunately not possible because ellipses cannot be reconstructed easily from the given key points. For a comparison, their data set would have had to be annotated completely by hand with ellipses. Other public data sets of pigs do not exist to our knowledge. In general, the correct evaluation is difficult because there is no defined set of rules for annotation. In [20], for example, the pigs that are in the field of view but not in the observed bay were not annotated. A network that recognizes these pigs anyway would be punished with false positives here. Furthermore, there are also borderline cases in our data set where pigs are hardly visible but still marked by the human annotators. If such pigs are not found due to sanity checks like a minimum pixel number in clustering or a bad segmentation, false negatives are counted (see Figure 10b). Here, a publicly accessible data set with fixed rules would be useful in the future.

Conclusions
The methods shown here have achieved very good results on the data used and offer a pixel accurate segmentation of the animals instead of bounding boxes or keypoints. The already described advantage over the existing methods is that more information can be extracted from the segmentation. For example, conclusions can be drawn about the volume and thus the weight of the animals. Weight gain and other health factors can thus be determined and evaluated.
The ablation study has shown that all variants provide approximately the same results. From this it can be concluded that the data used so far do not provide more variance to learn the errors and inconsistencies that occur. An increase of the data set or an enrichment in variance would be the next step to check the generalization of the presented methods.