Next Article in Journal
A Method for Broken Rotor Bars Diagnosis Based on Sum-Of-Squares of Current Signals
Next Article in Special Issue
MRU-NET: A U-Shaped Network for Retinal Vessel Segmentation
Previous Article in Journal
Text Augmentation Using BERT for Image Captioning
Previous Article in Special Issue
A Multi-Layer Perceptron Network for Perfusion Parameter Estimation in DCE-MRI Studies of the Healthy Kidney
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Deep Instance Segmentation of Laboratory Animals in Thermal Images

Magdalena Mazur-Milecka
Tomasz Kocejko
* and
Jacek Ruminski
Department of Biomedical Engineering, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 80-233 Gdansk, Poland
Authors to whom correspondence should be addressed.
Appl. Sci. 2020, 10(17), 5979;
Submission received: 9 July 2020 / Revised: 6 August 2020 / Accepted: 25 August 2020 / Published: 28 August 2020
(This article belongs to the Special Issue Machine Learning for Biomedical Application)


In this paper we focus on the role of deep instance segmentation of laboratory rodents in thermal images. Thermal imaging is very suitable to observe the behaviour of laboratory animals, especially in low light conditions. It is an non-intrusive method allowing to monitor the activity of animals and potentially observe some physiological changes expressed in dynamic thermal patterns. The analysis of the recorded sequence of thermal images requires smart algorithms for automatic processing of millions of thermal frames. Instance image segmentation allows to extract each animal from a frame and track its activity and thermal patterns. In this work, we adopted two instance segmentation algorithms, i.e., Mask R-CNN and TensorMask. Both methods in different configurations were applied to a set of thermal sequences, and both achieved high results. The best results were obtained for the TensorMask model, initially pre-trained on visible light images and finally trained on thermal images of rodents. The achieved mean average precision was above 90 percent, which proves that model pre-training on visible images can improve results of thermal image segmentation.

1. Introduction

Laboratory animals’ behavior analysis is often used in studies of stress, anxiety, depression or neurodegenerative diseases [1]. The automation of this analysis has undoubtedly many advantages, among which objectivity and standardization are one of the most desirable. As far as the position and motion analysis are easy and commonly used parameters [2], the action or behavior recognition are quite difficult to automatize. Existing systems for laboratory animals’ behavior analysis are restricted by, for example, the number or the color of the objects [2]. The spectrum of behavior analysis performed by the available systems is usually limited to exploration, rest or grooming. All these behaviors are detected based on simple object parameters representing the position, speed, direction of the movement or the shape parameters of the observed rodent [3]. Objective detection and analysis of more complex behaviors would open up new possibilities. An important parameter of animals’ health and well-being is its social behavior. It ensures survival of the species and is the specific characteristics of a single individual. Futhermore, deviations and abnormalities of social actions may be related to stress, fear or illness [1]. Social behaviors are divided into three categories: aggressive, defensive and neutral [4]. An example of aggressive behavior is an attack, bite or aggressive grooming. However, gentle grooming can also be a form of defense, it can also be confused with other behavior—climbing. All these insignificant differences make the classification of complex behaviors an extremely difficult task, mainly performed by human observers.
The presence of a man who, after all, is a rodent’s natural enemy, probably adversely affects the results of behavior research [5]. This can be solved by using camcorders and analyze the behavior from the recordings. An additional advantage of this solution is the ability to slow down or stop the action, certainly helpful during fast animal movements, e.g., fights. Nevertheless, the need for a good lighting for cameras is a very unfavorable and stressful factor for nocturnal animals such as rodents [6,7]. This is where technology that offers alternative imaging comes to the rescue. Thermal camcorders record the surface temperature and can work in limited light and even in complete darkness. In addition, they provide data of the object’s surface temperature distribution, proved to be useful in early disease diagnosis [8,9]. Changes in the rodent’s surface temperature can be an indicator of his health condition, behavioral changes, anxiety or stress [10,11]. In paper [12] authors use mouse surface temperature distribution as an object identifier to recover tracking algorithm identity swap after close contact.
The automatic analysis of rodents from video recordings requires the identification and tracking of each animal in a scene. The distinction between the object and the background in animal tracking visual systems is most often based on difference between frames, color threshold or color matching [13]. A clear temperature difference in thermal imaging greatly simplifies foreground-background segmentation. Usually, the Otsu’s thresholding supported by morphology is enough [2,14]. Figure 1b demonstrates the result of Otsu’s thresholding on thermal image. In a cluttered environment there is also a need to segment the object from other parts of the foreground or to distinguish individuals from each other. For small connections of the animals’ bodies, methods such as marker-controlled watershed segmentation technique are sufficient [15]. However, the solution to the segmentation problem for the objects that overlap significantly is not easy and most image analysis methods fail. Figure 1c shows the results of rodent segmentation using watershed technique with markers build from the low gradient regions. There are no clear differences between the images of both objects and no border between them is visible. For these reasons a significant number of systems analyze only one individual at a time [2,13]. That is why the correct segmentation between objects is necessary for further analysis.
Many segmentation techniques have been proposed in literature. However, the methods based on deep-learning have been recently found to provide the best segmentation results in many computer vision tasks. Segmentation techniques based on deep learning may be divided into two approaches: (i) semantic segmentation—pixel classification-based approach and (ii) instance segmentation–object detection and classification based approach.
Segmentation using (i) does not discriminate between instances. In paper [16] we have used popular methods of semantic segmentation—U-Net [17] and V-Net [18] to separate two objects. Algorithms correctly segmented animals in close contact, but they were not able to draw the boundary between the objects at the time of body overlapping. Figure 1d shows the result of U-Net architecture trained on 200 images on five epochs, 2000 steps each. The animals are not separated and are recognized together as one object.
In (ii) each object is detected, represented as a separate segment and labeled with the class name. Instance segmentation is one of the most challenging computer vision tasks. It was introduced by Hariharan [19] and then popularized by COCO [20]. Instance segmentation algorithms classify and localize each object’s bounding box while also precisely segment each instance. Both those subtasks can be performed as a single- or two-stage detection process.

1.1. Two-Stage Methods

Treating segmentation as an extension of the object detection is adopted by most of two-stage methods [21,22,23,24]. They are designed according to the dominant paradigm for instance segmentation which is: detect-then-segment. First, it detects an object with a box and then segments each object using the box as a guide.
Mask R-CNN [23] is one of a state of the art two-stage method of instance segmentation. It extends Fast [25] and Faster R-CNN [26] methods by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition [23]. This method detects object bounding boxes, and then crops and segments them to find the objects.
Many approaches based on Mask R-CNN also have dominated leaderboards of recent segmentation challenges.
In 2019 Mask Scoring R-CNN algorithm was introduced in [24]. This method focuses on scoring the predicted instance masks. The model learns an Intersection-over-Union (IoU) score for each mask and improves segmentation results by rewarding more accurate predictions during COCO AP evaluation.
The method named PANet presented in [27] improves Mask R-CNN in three steps by:
  • creating bottom-up path augmentation,
  • adaptive feature pooling,
  • augmenting mask prediction with fully connected layers.
The results outperform Mask R-CNN architecture but are re-implemented by the PANet authors.
Another extension of the Mask R-CNN model, this time with a cascade architecture, is the Hybrid Task Cascade (HTC) [28]. It improves the segmentation accuracy by incorporating cascade and multi-tasking at each stage and explores more contextual information from the spatial context.
State-of-the-art two-stage methods achieve outstanding results. However, due to the cascade strategy, these methods are usually time- and memory-consuming. That is why recently, performing instance detection and segmentation in a single-stage architecture [29,30,31,32] has become more and more popular, even though it does not yet perform as good as the two-stage methods.

1.2. Single-Stage Methods

Though single-stage methods simplify the procedure by removing the re-pooling step, usually they cannot produce masks that are as accurate as the two-stage methods. Some of those methods works in a per-pixel prediction way, similar to the semantic segmentation.
The algorithm Fully Convolutional One-Stage detection (FCOS) [33] has fully eliminated pre-defined anchor boxes, thus avoids complicated computation such as calculating boxes overlapping during training.
EmbedMask proposed in [32] unifies both segmentation- and proposal-based methods and takes the advantages of both. It is built on top of the detection models and thus has strong detection capabilities, just like the proposal-based methods. It applies extra embedding modules to generate embeddings for pixels and proposals, what enables EmbedMask to generate high-resolution masks without missing details from repooling, like segmentation-based methods.
TensorMask is an example of a dense sliding-window instance segmentation method [34]. The idea is to use structured 4D tensors to represent masks over a spatial domain. It performs multiclass classification in parallel to mask prediction. Authors show that this algorithm yields similar results to Mask R-CNN.
In this paper, we are mainly interested in the detection of each rodent in reference to each other and in reference to the background objects. Therefore, we focus on instance segmentation approach and more specifically deep instance segmentation architectures represented by two different approaches: one- and two-stage methods. In particular, we addressed the problem of animals instance segmentation in close physical contact on thermal images. The aim of this study was also to verify whether thermal data bit depth reduction or re-scaling has an effect on the results.
The paper is organized as follows: Section 2 presents the general description of methods and materials. Results are introduced in Section 3 and discussed in Section 4. The last section concludes the work.

2. Methods

In this section we describe the methods for instance segmentation adopted by us for thermal image of laboratory animals. For this purpose we use thermal database of rats introduced in [35]. The database consists of 300 min of social behavior tests recordings. Every single test was carried out on two healthy male rats of the Wistar strain at the age of 12–16 weeks kept in a plexiglass cage (dimensions: 35 length × 45 width × 46 cm height) for about 17 min in dimly illuminated room with temperature about 22 Celsius degrees. Animals were not accustomed to the cage earlier, it was an unfamiliar environment. There were no food, drink nor bedding material in the cage. The image sequences were recorded by FLIR A320G camera situated 120 cm above the cage with the spatial resolution of 320 × 240 pixels, 60 fps and 16-bit image representation. The results of semantic segmentation on the same dataset are presented in paper [16].
The principles for the care and use of laboratory animals in research, as outlined by the Local Ethical Committee, were strictly followed and all the protocols were reviewed and approved by the Committee.

2.1. Data Preprocessing

Thermal data were measured and stored with a resolution of 16-bit. The reduction of bit depth to 256-level standard images caused data loss. However, some data, e.g., background temperature, were redundant. In order to minimize data loss and investigate the effect of a temperature range selection, we proposed thermal frames processing to achieve five different images. This procedure is described in details in paper [16] and presented in Figure 2 in brief. First, the entire range of recorded raw thermal data (Figure 2a) was scaled to 256-level gray image (Figure 2b) and called orig image, as these were the original raw thermal data re-scaled to 8 bits. Then the gray-scale image was created from only three selected thermal ranges: animal body range (Figure 2c) and two thermal ranges (Figure 2d,e), all marked as ch1, ch2 and ch3 respectively. Minimal temperature value of the object was assumed to be equal to the threshold temperature between the background and the object calculated by the Otsu method, marked as a red line in the Figure 2a and called m a x O t s u t h r e s . The green and blue dashed lines named as T 2 and T 3 respectively defined one- and two-thirds of the temperature interval from m a x O t s u t h r e s to the maximal value for all data. The ch1 range was selected between m a x O t s u t h r e s and the maximal temperature value. The ch2 image was created for the temperature range from T 2 to the maximal temperature value, and analogously, the lower limit of the ch3 image was T 3 value and the upper limit was the maximum temperature. Re-scaling 16-bit raw thermal data representation to 8-bit image caused loss of accuracy. Therefore, the last type of images was an image of the entire range of recorded data saved as a 16 bit image (16-bit). In [16] we proved that proper range selection improved the results of semantic segmentation.

2.2. Training Models

In this study we chose one single- and one two-stage architectures to compare the instance segmentation methods on thermal images of laboratory rodent. The decision was made for the Mask R-CNN [23] and the TensorMask [34].

2.3. Learning Configurations

We made two training and one testing manually segmented datasets. The only difference between training sets was their size: 200 and 500 images respectively. The smaller set was used for the transfer-learning training, where we used pre-trained network: Mask R-CNN with a ResNet-Feature Pyramid Network (FPN) backbone 50 layers deep trained on the COCO dataset [36], and similarly TensorMask with a ResNet-FPN backbone 50 layers deep [37], both implementations from Detectron2 v0.1.2.
The 500 images were used for training the whole architectures from random initialization. In this training the number of training iterations had to be increased so models could converge. However, even despite the very large number of iterations, the TensorMask architecture sometimes had convergence problems: the loss function increased over time. In such cases we have used normalization techniques according to [38], and replaced Frozen Batch Normalization with Group Normalization [39]. Group Normalization’s accuracy was insensitive to batch sizes [39] and allowed for successful TensorMask model training from scratch.
The testing set consisted of 50 images not included in the training sets. Images in all data-sets depicted two rodents during physical contact of varying degrees: from a small part of the body contact (e.g., nose-to-nose) to the body overlapping and covering.
For every training model we performed three independent training sessions. Json files with ground-truth rat regions were manually created for every training and testing set using the VGG Image Annotator (VIA) [40]. The regions were marked on a zoomed image by a set of points forming a closed polygon.
To compare the trained and commonly available models, we tested the inference using weights pre-trained on MS COCO Dataset [20], Citiscapes [41] and LVIS [42] and not trained on our database. We have also used two different Python3 implementations of Mask R-CNN (Mask R-CNN2.1 [36], Detectron2 v0.1.2 [37]) and one for TensorMask Detectron2 v0.1.2 [37].
In total, we tested 24 different model configurations for pre-trained learning and 22 for training from scratch.
For both models and both training methods with the best segmentation results 3-fold cross-validation was performed. The first fold was made for all the model configurations. The second and third folds used the whole testing data-set from the first fold in their training sets and replaced the training set with 50 different images selected from the training set of the first fold.
All experiments were performed using NVIDIA DGX-1 Station with Ubuntu 18.

2.4. Evaluation Metrics

To evaluate the results we are using three different detection metrics for bounding box- (bbox) and segmentation-level (segm): mean Average Precision (mAP), AP under Intersection-over-Union thresholds of 0.5 (AP 50 ) and 0.75 (AP 75 ).
The mean average precision (mAP) is actually the standard for the quantitative evaluation of object detection and instance segmentation methods [43,44]. The general definition for the Average Precision is finding the area under the precision-recall curve. The precision and recall are defined as follows:
P r e c i s i o n = T P T P + F P ; R e c a l l = T P T P + F N
where: TP = True Positive, FP = False Positive and FN = False Negative.
The precision measures how accurate the prediction was, and recall measures how good all the positives were found.
In object detection and instance segmentation it is important to appropriately define a true positive and other types of results. As a current standard, the IoU metric was used. It was defined as the intersection between the predicted segment (bbox) and the actual segment (bbox) divided by their union:
I o U = A r e a o f O v e r l a p A r e a o f U n i o n
In this study the actual (ground truth) segment was manually drawn by the authors on each reference image.
A prediction is considered to be True Positive if the IoU value is higher than the threshold (e.g., IoU > 0.5), and False Positive otherwise. However, different thresholds could be used and the precision recall curve could be obtained. Then, the area under the curve could be calculated as:
A P = 0 1 p ( r ) d r ,
where: p(r) is the precision-recall curve.
In practice, in a majority of current papers about instance segmentation, the interpolated curve is calculated in ranges of IoU threshold values from 0.5 to 0.95 with a step size of 0.05. This method was also used in this paper. The mAP metric calculates the mean of AP across all of the IoU thresholds. We also used AP at fixed IoUs: IoU = 0.5 (AP 50 ) and IoU = 0.75 (AP 75 ). Which means, that a prediction is considered as True Positive if its IoU is >0.5 and >0.75 respectively. Thus, the higher the threshold, the greater the requirement for the accuracy of matching the prediction to the ground-truth. The standard metrics used in this study allow direct comparison of future methods with the approach used in this paper.
We have evaluated bounding box AP for object detection and segmentation AP for instance segmentation. The metrics are measured for all five types of images (orig, ch1, ch2, ch3, 16-bit) and training parameters shown in Table 1.

3. Results

The best results among all the training parameter combinations were achieved by the values presented in Table 1.
The TensorMask’s convergence problems during training from scratch forced the use of a large iteration number, much larger than for Mask R-CNN architecture. In order to compare the effectiveness of both architectures, we performed additional Mask R-CNN training with optimal parameters for TensorMask (2 batches, 100,000 epochs). Values in bold indicate parameters that have been selected for cross-validation.
In Table 2 and Table 3 we compare Mask R-CNN testing results for pre-trained and from scratch learning respectively. Analogous measurements for the TensorMask architecture are presented in Table 4 and Table 5.
Figure 3 presents the exemplary testing results for two images (Figure 3a,b) which constituted quite a serious challenge for segmentation. For a better illustration, the contact area was zoomed and manually segmented with a yellow line in Figure 3c,d. The examples given include animals that overlapped one another. Figure 3e–h present the results of pre-trained learning for Mask R-CNN and TensorMask respectively. The instance segmentation made by those architectures trained from scratch are shown in Figure 3i–l.
Table 6 demonstrates the results of the inference made for orig images using commonly available models pre-trained only on MS COCO Dataset, Citiscapes and LVIS. We also used two different implementations of Mask R-CNN pre-trained on the COCO dataset.
The results of 3-fold cross-validation for the parameters of four batch size with 2000 epochs and two batch size with 100,000 epochs for pre-trained and trained from random initialization models respectively are presented in Table 7. The results are presented in the form of box and segmentation mAP. The folds of model training for three different training and testing datasets were repeated twice. Presented result values are the best selected repetition.

4. Discussion

The objective of the study was to compare two different instance segmentation approaches and two different learning methods for the purpose of experimental animals segmentation on thermal images. Various methods for re-scaling thermal data to a standard image were also compared.
In paper [38] authors showed that the model trained from random initialization, if it only has proper training parameters, can get results similar to the pre-trained models, however, it needs more iterations to converge. In our experiments training Mask R-CNN from scratch needed about 16 times the number of epochs (16,000 or 24,000) (Table 4 and Table 5) than for the pre-trained model (1000 or 2000) to achieve similar results (Table 2 and Table 3). Increasing the number of epochs to 100,000 improved the results only for the orig and ch1 images, but not significantly. TensorMask architecture required a much larger number of epochs for the random initialization training (100,000), and still did not achieve results comparable to the pre-trained model (see Table 4 and Table 5).
Mask R-CNN training from scratch neither is time nor data expensive. Only a slightly larger number of images and epochs allow it to achieve all-layers training results comparable to the pre-trained models.
The values in Table 2 and Table 4 indicate that the best segmentation results for the pre-trained models can be obtained for the 16-bit and ch1 image followed by the orig for Mask R-CNN and also 16-bit with orig this time followed by ch1 using TensorMask. Ch2 images achieved only slightly lower results than the top ones. The bbox and segmentation mAP values for both architectures were close to 90 percent, with a slight TensorMask advantage. Detection for ch3 in both cases achieved the best mAP above 70, while the segmentation best mAP was equal to 65 percent. If we look at the results for all combinations of training parameters, not only the best ones, it can be clearly seen that TensorMask achieves better results for various training models.
The results of training from scratch show the opposite trend (see Table 3 and Table 5). This time it is Mask R-CNN that achieves better results for all images (mAP above 80 percent for almost all images) under various criteria, suggesting that this model trained from scratch catch up not only by chance for a single metric. The differences were especially visible for the ch3 image, for which the best TensorMask segmentation mAP was equal only to 32.752 percent.
The pre-trained model shows generally similar results for instance segmentation and detection for all images except ch3. The image ch3 contains a very narrow range of only the highest animals’ surface thermal values (Figure 2a,e), so it is deprived of a significant part of the body area. Such an object is detectable, but difficult to correctly segment if there is no information about the object’s boundaries. That is why the mAP for segmentation is much smaller than for detection.
The data in Table 3 and Table 5 show that models trained from scratch are likely to perform segmentation more accurately than the detection (except ch3). Detection and segmentation accuracy for the pre-trained models are more similar.
In the Figure 3 we have presented the results of the detection and segmentation for challenging views with the critical regions zoomed in white frames. The image in Figure 3c is difficult to segment, because the snout of one rodent covers the snout of the other one over the entire width of the body. However, there is a small element of the snout that belongs to the animal at the bottom and in this camera view is not connected to the rest of the body. One of animals in Figure 3d has its snout hidden under the body of the other rat, which, as a matter of fact, is not so rare. Here, the cooler water mark left on the fur creates a line, that can be mistaken as a continuation of the body boundary. In this particular way it was segmented by the pre-trained TensorMask (Figure 3h), as a result of which a small area of the body on the border was miss-assigned to the wrong object. It also marked both individuals as the object, however with low probability (7%). The Mask R-CNN architecture set the boundaries more precisely (Figure 3f,j) but left a few pixels’ gap between the objects. TensorMask trained from scratch segmented the rodents’ body areas similarly to the pre-trained model, (Figure 3l), and here also an additional object—a combination of both body parts—was detected.
The separated object from the Figure 3c caused the Mask R-CNN the most problems. The pre-trained model (Figure 3e) considered the separated part of the animal’s mouth as an element of the other rat’s body. In addition, as the only one, it inaccurately determined the bbox of the detected object. Mask R-CNN trained from scratch did not assign this small part of the snout to any object, but correctly detected the bbox (Figure 3i). The pre-trained TensorMask (Figure 3g) classified the problematic snout partly to both individuals—this line of segmentation seems to be the closest to the correct one. TensorMask architecture trained from scratch assigned both animal snouts to both objects at once (Figure 3k).
Although the segmentation results were not always satisfactory, it should be remembered that difficult cases were presented here. For the vast majority of images, the prediction was very similar to ground-truth images and succeeded in its mission of rodent segmentation, in contrast to the semantic segmentation algorithms presented in paper [16]. The results of U-Net segmentation (see Figure 1d) show that the boundary between objects disappears during close contact, although it was previously visible during small connection.
The results of segmentation made by the commonly available models (see Table 6) are much worse than those trained on the target images, and do not exceed 8 mAP. The Citiscapes and LVIS datasets achieved extremely low values—bellow 0.75 mAP. During the evaluation, the correctness of class assignment is also taken into account. Although the COCO data-set has a “mouse” class [20], however, rat objects were not assigned to it, so in this case the assignment correctness was zero.
The 3-fold cross-validation results presented in Table 7 are very similar. The top segmentation and detection results (marked as bold in Table 7) on average was achieved by the pre-trained TensorMask. Both pre-trained models show greater accuracy than those trained from scratch, which is consistent with the previous results (see Table 2, Table 3, Table 4 and Table 5). It is probably related to a much smaller training set comparing to the COCO set [20]. As far as a random initialized models are concerned, the difference between the both models is significant. Mask R-CNN (marked as bold italic in Table 7) obtains mAP even 10 percent higher than TensorMask. It is possible that TensorMask needs more data and/or epochs number to achieve a Mask R-CNN-like results.

5. Conclusions

The deep instance segmentation algorithms are able to distinguish between two individuals in close contact where overlaps may appear. The segmentation mAP almost reach value of 90 percent. The detection results are slightly higher and usually oscillate around 90 percent. The top results are achieved by the single-stage (TensorMask) pre-trained network, however, two-stage method (Mask R-CNN) works better than single-stage when trained from scratch. Training Mask R-CNN model from random initialization is neither time nor data consuming. It can be used for training non-standard images, however, keeping in mind that networks trained from scratch focuses more on segmentation than detection. In turn, the cost of training TensorMask is large, and still does not achieve results similar to the pre-trained version. The pre-trained TensorMask model shows the best performance. The research indicates that thermal images can be successfully analyzed by architectures pre-trained on standard images. However, some layers of the network must be trained on thermal images, because the models pre-trained only on publicly available databases achieve very poor segmentation results and even worse in detection.
The thermal data conversion from different thermal data range does not improve the quality of segmentation. The results for 16-bit (16-bit) and 8-bit (orig) image representation as well as for image deprived of the background (ch1) are comparable. Results of images with narrower thermal range (ch2) do not differ much from the others. Unlike the results for ch3 image, where only the information about the warmest parts of the body is visible. However, the bounding box mAP values of this images for pre-trained model suggest that detection for limited-data images is possible.
The main contributions of this paper are the following:
  • the adopted deep instance segmentation algorithms have been experimentally verified for the laboratory rodents detection from thermal images,
  • it was shown that laboratory rodents can be accurately detected (and separated from each other) from the thermal images using the Mask R-CNN and TensorMask models,
  • the obtained results demonstrated that the adopted TensorMask model, pre-trained using visible light images and trained with thermal sequences gave the best results with the mean average precision (mAP) greater that 90,
  • it was verified that thermal data conversion from narrower raw thermal range does not improve the quality of segmentation,
  • single-stage pre-trained networks achieves better results than two-stage pre-trained models, however two-stage methods seem to work better than single-stage when trained from scratch,
  • network pre-training using visible light images improves the segmentation results for thermal images.
The conclusion of this work is that segmentation algorithms can be used to segment experimental animals in thermal images. Depending on the needs, one can customize architectures, learning methods or image types for the best performance. Instance segmentation methods work better than the semantic segmentation methods. The presented approach will work well in social behaviour tests, but not only that. It can be used wherever identification and tracking of experimental animals is required, especially in numerous groups.
In the future it is worth training both architectures with a more diverse thermal database. Increasing the amount of training data can also improve the results, especially for the TensorMask.

Author Contributions

Conceptualization, M.M.-M. and J.R.; data curation, M.M.-M.; formal analysis, M.M.-M. and J.R.; funding acquisition, J.R.; investigation, M.M.-M.; methodology, M.M.-M. and J.R.; project administration, M.M.-M. and J.R.; resources, M.M.-M.; software, M.M.-M. and T.K.; supervision, J.R.; validation, M.M.-M., J.R. and T.K.; visualization, M.M.-M.; writing—original draft preparation, M.M.-M.; writing—review and editing, J.R. and T.K. All authors have read and agreed to the published version of the manuscript.


This work has been partially supported by Statutory Funds of Electronics, Telecommunications and Informatics Faculty, Gdansk University of Technology

Conflicts of Interest

The authors declare no conflict of interest.


  1. Lezak, K.; Missig, G.; Carlezon, W.A., Jr. Behavioral methods to study anxiety in rodents. Dialogues Clin. Neurosci. 2017, 19, 181–191. [Google Scholar] [PubMed]
  2. Franco, N.H.; Gerós, A.; Oliveira, L.; Olsson, I.A.S.; Aguiar, P. ThermoLabAnimal—A high-throughput analysis software for non-invasive thermal assessment of laboratory mice. Physiol. Behav. 2019, 207, 113–121. [Google Scholar] [CrossRef] [PubMed]
  3. Junior, C.F.C.; Pederiva, C.N.; Bose, R.C.; Garcia, V.A.; Lino-de-Oliveira, C.; Marino-Neto, J. ETHOWATCHER: Validation of a tool for behavioral and video-tracking analysis in laboratory animals. Comput. Biol. Med. 2012, 42, 257–264. [Google Scholar] [CrossRef] [PubMed]
  4. Grant, E.; Mackintosh, J. A comparison of the social postures of some common laboratory rodents. Behaviour 1963, 21, 246–259. [Google Scholar]
  5. Kask, A.; Nguyen, H.P.; Pabst, R.; von Hoorsten, S. Factors influencing behavior of group-housed male rats in the social interaction test—Focus on cohort removal. Physiol. Behav. 2001, 74, 277–282. [Google Scholar] [CrossRef]
  6. Aslani, S.; Harb, M.; Costa, P.; Almeida, O.; Sousa, N.; Palha, J. Day and night: Diurnal phase influences the response to chronic mild stress. Front. Behav. Neurosci. 2014, 8, 82. [Google Scholar] [CrossRef] [Green Version]
  7. Roedel, A.; Storch, C.; Holsboer, F.; Ohl, F. Effects of light or dark phase testing on behavioural and cognitive performance in DBA mice. Lab. Anim. 2006, 40, 371–381. [Google Scholar] [CrossRef]
  8. Manzano-Szalai, K.; Pali-Schöll, I.; Krishnamurthy, D.; Stremnitzer, C.; Flaschberger, I.; Jensen-Jarolim, E. Anaphylaxis Imaging: Non-Invasive Measurement of Surface Body Temperature and Physical Activity in Small Animals. PLoS ONE 2016, 11, e0150819. [Google Scholar] [CrossRef] [Green Version]
  9. Etehadtavakol, M.; Emrani, Z.; Ng, E.Y.K. Rapid extraction of the hottest or coldest regions of medical thermographic images. Med. Biol. Eng. Comput. 2019, 57, 379–388. [Google Scholar] [CrossRef]
  10. Jang, E.; Park, B.; Park, M.; Kim, S.; Sohn, J. Analysis of physiological signals for recognition of boredom, pain, and surprise emotions. J. Phys. Anthropol. 2015, 34, 1–12. [Google Scholar] [CrossRef] [Green Version]
  11. Tan, C.; Knight, Z. Regulation of Body Temperature by the Nervous System. Neuron 2018, 98, 31–48. [Google Scholar] [CrossRef] [PubMed]
  12. Sona, D.; Zanotto, M.; Papaleo, F.; Murino, V. Automated Discovery of Behavioural Patterns in Rodents. In Proceedings of the 9th International Conference on Methods and Techniques in Behavioral Research, Wageningen, The Netherlands, 27–29 August 2014. [Google Scholar]
  13. Koniar, D.; Hargaš, L.; Loncová, Z.; Simonová, A.; Duchoň, F.; Beňo, P. Visual system-based object tracking using image segmentation for biomedical applications. Electr. Eng. 2017, 99, 1349–1366. [Google Scholar] [CrossRef]
  14. Fleuret, J.; Ouellet, V.; Moura-Rocha, L.; Charbonneau, É.; Saucier, L.; Faucitano, L.; Maldague, X. A Real Time Animal Detection And Segmentation Algorithm For IRT Images In Indoor Environments. Quant. InfraRed Thermogr. 2016, 265–274. [Google Scholar]
  15. Kim, W.; Cho, Y.B.; Lee, S. Thermal Sensor-Based Multiple Object Tracking for Intelligent Livestock Breeding. IEEE Access 2017, 5, 27453–27463. [Google Scholar] [CrossRef]
  16. Mazur-Milecka, M.; Ruminski, J. Deep learning based thermal image segmentation for laboratory animals tracking. Quant. InfraRed Thermogr. J. 2020, 1–18. [Google Scholar] [CrossRef]
  17. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  18. Milletari, F.; Navab, N.; Ahmadi, S. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the IEEE 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
  19. Hariharan, B.; Arbelaez, P.; Girshick, R.B.; Malik, J. Simultaneous Detection and Segmentation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 297–312. [Google Scholar]
  20. Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  21. Dai, J.; He, K.; Sun, J. Instance-aware Semantic Segmentation via Multi-task Network Cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3150–3158. [Google Scholar]
  22. Chen, L.; Hermans, A.; Papandreou, G.; Schroff, F.; Wang, P.; Adam, H. MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features. arXiv 2017, arXiv:1712.04837. [Google Scholar]
  23. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  24. Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask Scoring R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6409–6418. [Google Scholar]
  25. Girshick, R.B. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  26. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [Green Version]
  27. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  28. Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid Task Cascade for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4974–4983. [Google Scholar]
  29. Yao, J.; Yu, Z.; Yu, J.; Tao, D. Single Pixel Reconstruction for One-stage Instance Segmentation. arXiv 2019, arXiv:1904.07426. [Google Scholar]
  30. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
  31. Xiang, C.; Tian, S.; Zou, W.; Xu, C. SAIS: Single-stage Anchor-free Instance Segmentation. arXiv 2019, arXiv:cs.CV/1912.01176. [Google Scholar]
  32. Ying, H.; Huang, Z.; Liu, S.; Shao, T.; Zhou, K. EmbedMask: Embedding Coupling for One-stage Instance Segmentation. arXiv 2019, arXiv:cs.CV/1912.01954. [Google Scholar]
  33. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
  34. Chen, X.; Girshick, R.B.; He, K.; Dollár, P. TensorMask: A Foundation for Dense Object Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 2061–2069. [Google Scholar]
  35. Mazur-Milecka, M.; Ruminski, J. Automatic analysis of the aggressive behavior of laboratory animals using thermal video processing. In Proceedings of the IEEE Conference of the Engineering in Medicine and Biology Society, EMBC, Seogwipo, Korea, 11–15 July 2017; pp. 3827–3830. [Google Scholar]
  36. Abdulla, W. Mask R-CNN for Object Detection and Instance Segmentation on Keras and TensorFlow. 2017. Available online: (accessed on 1 February 2020).
  37. Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: (accessed on 1 May 2020).
  38. He, K.; Girshick, R.; Dollar, P. Rethinking ImageNet Pre-Training. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 4917–4926. [Google Scholar]
  39. Wu, Y.; He, K. Group Normalization. Int. J. Comput. Vis. 2019. [Google Scholar] [CrossRef]
  40. Dutta, A.; Zisserman, A. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; ACM: New York, NY, USA, 2019. [Google Scholar] [CrossRef] [Green Version]
  41. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
  42. Gupta, A.; Dollár, P.; Girshick, R. LVIS: A Dataset for Large Vocabulary Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5356–5364. [Google Scholar]
  43. Lu, Y.; Lu, C.; Tang, C. Online Video Object Detection Using Association LSTM. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2363–2371. [Google Scholar]
  44. Oksuz, K.; Cam, B.C.; Akbas, E.; Kalkan, S. Localization Recall Precision (LRP): A New Performance Metric for Object Detection. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 521–537. [Google Scholar]
Figure 1. Results of the most popular algorithms used for segmentation of (a) original image by: (b) Otsu’s thresholding, (c) watershed with gradient regions markers, (d) U-Net—semantic segmentation.
Figure 1. Results of the most popular algorithms used for segmentation of (a) original image by: (b) Otsu’s thresholding, (c) watershed with gradient regions markers, (d) U-Net—semantic segmentation.
Applsci 10 05979 g001
Figure 2. Range selection: (a) histogram of one frame raw thermal data with values of: minimal rodent’s surface thermal value named maxOtsuthres, marked with red solid line; the minimum values for images ch2 and ch3 marked as green (T2) and blue (T3) lines respectively, (b) orig image of the whole thermal range, (c) ch1 image of the first range: maxOtsuthres to Tmax, (d) ch2 image of the second range: T2 to Tmax, (e) ch3 image of the third range: T3 to Tmax.
Figure 2. Range selection: (a) histogram of one frame raw thermal data with values of: minimal rodent’s surface thermal value named maxOtsuthres, marked with red solid line; the minimum values for images ch2 and ch3 marked as green (T2) and blue (T3) lines respectively, (b) orig image of the whole thermal range, (c) ch1 image of the first range: maxOtsuthres to Tmax, (d) ch2 image of the second range: T2 to Tmax, (e) ch3 image of the third range: T3 to Tmax.
Applsci 10 05979 g002
Figure 3. Examples of segmentation and detection prediction made for (a,b) the original images; (c,d) proper segmentation zoomed and marked with yellow lines. Results for: (e,f) pre-trained Mask R-CNN; (g,h) pre-trained TensorMask; (i,j) random initialized Mask R-CNN; (k,l) random initialized TensorMask.
Figure 3. Examples of segmentation and detection prediction made for (a,b) the original images; (c,d) proper segmentation zoomed and marked with yellow lines. Results for: (e,f) pre-trained Mask R-CNN; (g,h) pre-trained TensorMask; (i,j) random initialized Mask R-CNN; (k,l) random initialized TensorMask.
Applsci 10 05979 g003
Table 1. Training parameters that achieved the best results in training.
Table 1. Training parameters that achieved the best results in training.
Transfer LearningTraining from Scratch
Mask R-CNNTensorMaskMask R-CNNTensorMask
Table 2. Testing results for Mask R-CNN pre-trained learning and 200 images.
Table 2. Testing results for Mask R-CNN pre-trained learning and 200 images.
mAPAP 50 AP 75 mAPAP 50 AP 75
origbbox segm
2 batch, 1000 epochs88.64110098.9989.00710098.99
4 batch, 1000 epochs89.02510010089.444100100
4 batch, 2000 epochs88.7811009989.06810099
ch1bbox segm
2 batch, 1000 epochs87.33210010089.289100100
4 batch, 1000 epochs86.94210010089.715100100
4 batch, 2000 epochs90.20510010089.655100100
2 batch, 1000 epochs85.2521009987.323100100
4 batch, 1000 epochs85.84710098.9888.256100100
4 batch, 2000 epochs87.91510098.9888.893100100
2 batch, 1000 epochs63.71999.11473.40952.90193.64959.722
4 batch, 1000 epochs69.02899.41780.9855.18197.28460.683
4 batch, 2000 epochs73.90899.95287.82465.68798.90283.376
16-bitbbox segm
2 batch, 1000 epochs88.12810099.0189.604100100
4 batch, 1000 epochs89.96510010089.194100100
4 batch, 2000 epochs90.54910010089.417100100
Table 3. Training results for Mask R-CNN from scratch learning and 500 images.
Table 3. Training results for Mask R-CNN from scratch learning and 500 images.
mAPAP 50 AP 75 mAPAP 50 AP 75
origbbox segm
4 batch, 16,000 epochs81.81110098.30785.586100100
4 batch, 24,000 epochs82.56710010086.864100100
8 batch, 16,000 epochs81.65810010084.72810098.584
8 batch, 24,000 epochs84.46510010086.224100100
2 batch, 100,000 epochs86.69510010089.569100100
4 batch, 16,000 epochs82.08510010083.666100100
4 batch, 24,000 epochs84.36810010085.821100100
8 batch, 16,000 epochs81.42610010084.038100100
8 batch, 24,000 epochs85.15310098.77286.389100100
2 batch, 100,000 epochs86.56410010087.208100100
4 batch, 16,000 epochs80.62910098.9580.896100100
4 batch, 24,000 epochs82.97910010084.834100100
8 batch, 16,000 epochs80.98310010079.931100100
8 batch, 24,000 epochs83.09110010084.428100100
2 batch, 100,000 epochs82.5510097.5582.397100100
4 batch, 16,000 epochs60.94797.76673.00448.18393.32547.4
4 batch, 24,000 epochs67.73998.11379.81658.78296.88269.92
8 batch, 16,000 epochs65.35597.87474.0954.85596.87364.968
8 batch, 24,000 epochs66.01498.478.15657.45295.88763.699
2 batch, 100,000 epochs66.36998.50484.89156.11694.89167.155
4 batch, 16,000 epochs82.82610098.84284.561100100
4 batch, 24,000 epochs84.2910010086.111100100
8 batch, 16,000 epochs83.49310097.8285.313100100
8 batch, 24,000 epochs85.59610010086.607100100
2 batch, 100,000 epochs84.86210010085.877100100
Table 4. Training results for TensorMask pre-trained learning and 200 images.
Table 4. Training results for TensorMask pre-trained learning and 200 images.
mAPAP 50 AP 75 mAPAP 50 AP 75
origbbox segm
2 batch, 2000 epochs90.64610098.0189.82100100
4 batch, 1000 epochs91.26410099.0189.87710099.01
4 batch, 2000 epochs90.91610099.0190.158100100
8 batch, 2000 epochs90.67210098.9689.824100100
2 batch, 2000 epochs89.09110010089.808100100
4 batch, 1000 epochs89.62410010089.54710099.01
4 batch, 2000 epochs90.29910010090.17110099.01
8 batch, 2000 epochs90.05210010090.08710099.01
2 batch, 2000 epochs88.31910010089.496100100
4 batch, 1000 epochs88.1710010087.949100100
4 batch, 2000 epochs87.91210098.93189.335100100
8 batch, 2000 epochs88.82810010089.264100100
2 batch, 2000 epochs74.3799.93388.43365.06299.93380.743
4 batch, 1000 epochs71.32599.83484.89560.25199.60269.757
4 batch, 2000 epochs75.03199.96191.55864.84999.42379.397
8 batch, 2000 epochs76.15910091.40665.3079983.105
2 batch, 2000 epochs91.24810010090.249100100
4 batch, 1000 epochs90.44310098.95189.85710098.951
4 batch, 2000 epochs91.60910010090.199100100
8 batch, 2000 epochs90.93910010090.17110098.99
Table 5. Training results for TensorMask from scratch learning and 500 images.
Table 5. Training results for TensorMask from scratch learning and 500 images.
mAPAP 50 AP 75 mAPAP 50 AP 75
origbbox segm
2 batch, 100,000 epochs78.23510092.4179.61510095.535
4 batch, 100,000 epochs76.41399.9995.43580.37199.9991.678
2 batch, 100,000 epochs78.86210098.30777.72410090.687
4 batch, 100,000 epochs77.72510094.53581.25910095.188
2 batch, 100,000 epochs64.4299.1869.51662.30796.20671.867
4 batch, 100,000 epochs75.27599.83390.34274.43199.83389.674
2 batch, 100,000 epochs44.55784.36740.40213.09760.1330.012
4 batch, 100,000 epochs55.98295.07257.00932.75289.02516.88
2 batch, 100,000 epochs78.11299.9997.2980.19899.9994.435
4 batch, 100,000 epochs76.29610090.66281.26710094.887
Table 6. Segmentation results for commonly available models.
Table 6. Segmentation results for commonly available models.
mAPAP 50 AP 75 mAPAP 50 AP 75
Mask R-CNNbbox segm
MS COCO—implementation [37]4.1413.391.45.3614.441.05
MS COCO—implementation [36]7.6213.57.62---
MS COCO—implementation [37]1.693.411.293.066.312.6
Table 7. 3-fold cross-validation mean Average Precision (mAP) results for pre-trained (four batch, 2000 epochs) and trained from scratch (two batch, 100,000 epochs) models of Mask R-CNN and TensorMask.
Table 7. 3-fold cross-validation mean Average Precision (mAP) results for pre-trained (four batch, 2000 epochs) and trained from scratch (two batch, 100,000 epochs) models of Mask R-CNN and TensorMask.
From Scratch
From Scratch
1. fold88.889.183.685.990.990.278.279.6
2. fold92.089.584.787.691.990.573.970.6
3. fold89.789.287.788.090.590.270.676.6
1. fold90.289.784.687.290.390.278.977.7
2. fold88.489.188.388.991.390.478.777.8
3. fold89.689.488.389.090.289.87574.4
1. fold87.988.982.682.487.989.364.462.3
2. fold84.786.486.688.289.189.06657.2
3. fold89.588.185.087.788.388.965.754.3
1. fold73.965.766.456.175.064.844.613.1
2. fold63.051.965.65770.961.940.211.2
3. fold72.662.966.662.471.362.536.611.2
1. fold90.589.484.985.991.690.278.180.2
2. fold89.289.485.888.592.590.876.877.6
3. fold90.089.186.687.890.689.971.573.9

Share and Cite

MDPI and ACS Style

Mazur-Milecka, M.; Kocejko, T.; Ruminski, J. Deep Instance Segmentation of Laboratory Animals in Thermal Images. Appl. Sci. 2020, 10, 5979.

AMA Style

Mazur-Milecka M, Kocejko T, Ruminski J. Deep Instance Segmentation of Laboratory Animals in Thermal Images. Applied Sciences. 2020; 10(17):5979.

Chicago/Turabian Style

Mazur-Milecka, Magdalena, Tomasz Kocejko, and Jacek Ruminski. 2020. "Deep Instance Segmentation of Laboratory Animals in Thermal Images" Applied Sciences 10, no. 17: 5979.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop