Dynamic Detection and Recognition of Objects Based on Sequential RGB Images

: Conveyors are used commonly in industrial production lines and automated sorting systems. Many applications require fast, reliable, and dynamic detection and recognition for the objects on conveyors. Aiming at this goal, we design a framework that involves three subtasks: one-class instance segmentation (OCIS), multiobject tracking (MOT), and zero-shot fine-grained recognition of 3D objects (ZSFGR3D). A new level set map network (LSMNet) and a multiview redundancy-free feature network (MVRFFNet) are proposed for the first and third subtasks, respectively. The level set map (LSM) is used to annotate instances instead of the traditional multichannel binary mask, and each peak of the LSM represents one instance. Based on the LSM, LSMNet can adopt a pix2pix architecture to segment instances. MVRFFNet is a generalized zero-shot learning (GZSL) framework based on the Wasserstein generative adversarial network for 3D object recognition. Multi-view features of an object are combined into a compact registered feature. By treating the registered features as the category attribution in the GZSL setting, MVRFFNet learns a mapping function that maps original retrieve features into a new redundancy-free feature space. To validate the performance of the proposed methods, a segmentation dataset and a fine-grained classification dataset about objects on a conveyor are established. Experimental results on these datasets show that LSMNet can achieve a recalling accuracy close to the light instance segmentation framework You Only Look At CoefficienTs (YOLACT), while its computing speed on an NVIDIA GTX1660TI GPU is 80 fps, which is much faster than YOLACT‘s 25 fps. Redundancy-free features generated by MVRFFNet perform much better than original features in the retrieval task.


Introduction
Vision-based localization and recognition of objects on a conveyor (LROC) is an important type of application in the industry. In a typical scenario, such as vision-based automatic check-out in unmanned supermarkets or automatic sorting in factories, all categories are first registered in a database (saving features). Then, the features of objects on the conveyor are extracted and used to retrieve the categories from the database. It is a fine-grained recognition task in which we have to distinguish not only whether the object is Coca-Cola or milk, but also its flavor, volume, and packaging. The challenge of LROC consists of two large gaps between the registered images and the on-conveyor images. First, different views of a 3D object should be registered together, as they would be different; furthermore, only a group of close views of the object on a conveyor can be captured for prediction. Second, the registered images can be captured in an environment with stable illumination and background (source domain), whereas the on-conveyor images are often captured in open environments (target domain). Multi-view methods [1,2] can bridge the first gap, and domain-aligning methods can deal with the second gap. However, when objects of new (unseen) categories occur and we cannot obtain their onconveyor images for training, we need to train the network by learning a map from the source to target the domain that has heterogeneous views different from the seen categories. It is a generalized zero-shot learning (GZSL) problem [3,4].
In a usual detection-based multiobject tracking (MOT) [5,6] task, which mainly focuses on the final tracking performance of a few categories of objects (human, car, or some special categories), the detection module [7][8][9][10] is treated as an object detection task with coarse classification branches, and the reidentification module is treated as a fine-grained classification task for apparent features. The latter is an additional matching constraint to relieve ID switching, which is caused by the detection difference between frames. As object detection of unseen objects is one of the most challenging visual tasks and has not been exploited comprehensively, decomposition of localization and recognition would be a better choice. One advantage of decomposition is that some detection-free methods can be used to improve the computing speed of the localization subtask. Another advantage is that zero-shot classification architectures can be used to deal with new unseen categories. Considering the simple texture and the fixed motion direction of conveyors, LROC is a special MOT task with an easier localization subtask but a harder zero-shot learning fine-grained recognition subtask.
In this study, a framework of LROC involving three subtasks (one-class instance segmentation (OCIS), multiobject tracking (MOT), and zero-shot fine-grained recognition of 3D objects (ZSFGR3D)) is developed. First, the OCIS module segments all objects on the conveyor. Then, the MOT module tracks the motion trajectory of all objects with the masks obtained by the OCIS module. Finally, the ZSFGR3D module retrieves the segmented objects from the gallery of registered features. Because there are many effective methods for MOT subtasks, we mainly focus on the first and third subtasks in this study.
In the OCIS subtask, the conveyor is the background, and objects on it all belong to one same category named "Goods". OCIS is a special task different from semantic segmentation in which segmentation of instances is unnecessary. Because instance segmentation frameworks are designed for multiple categories and are extended from object detection frameworks with a lot of anchors, they often have a deep architecture and suffer from low speed. It hinders them from running on some high cost-effective devices [11]. In this study, a level set map neural network (LSMNet) is proposed for the OCIS subtask. LSMNet is inspired by the level set algorithm, which is a traditional active contour method. In the past three years, two other deep-learning-based active contour methods, level set loss [11] and deep snake [12], have been developed and applied in segmentation tasks. Unlike these methods, which comprise additional active contour constraints on the segmentation losses to improve the quality of masks indirectly, LSMNet utilizes the level set map (LSM) directly. The annotation of instance segmentation is often in the form of a multichannel binary image, in which each channel is the mask of one object. In the proposed method, we use a single level set map (LSM) to annotate the image instead of the multichannel binary image, where each peak of the LSM represents one object. LSMNet adopts a semantic segmentation architecture, for instance, a UNet or a pix2pix GAN, to predict LSMs. Once the LSM of an image is accurately predicted, the objects can be localized according to the peaks and their areas. An OCIS dataset about on-conveyor objects is established to validate the performance of LSMNet. An experiment on a subset of COCO2017 [13] is also conducted. The results show that LSMNet can achieve an accuracy close to You Only Look At CoefficienTs (YOLACT) [14], which is one of the fastest instance segmentation methods, on big and medium objects with a much higher speed. On the NVIDIA GTX1660TI GPU, the speed of LSMNet is 80 fps, and that of YOLACT is 25 fps.
For the FGR3D subtask, the registered multiview features are treated as the semantic attributes of every category, the features of old on-conveyor objects as the seen categories, and the features of new on-conveyor objects as the unseen categories. A multiview redundancy-free feature network (MVRFFNet) is proposed to learn the map from multiview features to single view on-conveyor features based on the GZSL framework proposed by Han et al. [3]. Multi-view features of an object are extracted with a pretrained network and combined into a compact registered feature. The registered features are the attributes for training. A Wasserstein generative adversarial network (WGAN) [15] is trained to learn a map function that maps the original retrieve features to a new redundancy-free feature space [1].
The proposed method has two advantages. First, LSMNet realizes instance segmentation with a semantic segmentation network by introducing the innovative LSM annotation. It has a much higher speed than YOLACT and a close performance on objects of big and medium size. Second, the redundancy-free feature is introduced into the multiview framework for fine-grained 3D object recognition. The redundancy-free features generated by MVRFFNet perform much better than the original features in the matching task.
This paper is organized as follows. Section 2 reviews the related work. Section 3 proposes the main algorithms. Experiments are presented in Section 4. At last, some conductions are drawn in Section 5.

Related Works
Instance segmentation is the most difficult task among the four classic visual tasks [16], which include classification, localization, detection, and segmentation. Its target is to obtain the pixel-level segmentation of individual objects, which combines the requirements of semantic segmentation and object detection. Over the past few years, deep learning has yielded a new generation of instance segmentation models with remarkable performance improvements and results in a paradigm shift in the field. There are two types of deep learning models for instance segmentation: two-stage models and one-stage models.
The most important two-stage method is Mask-RNN [17] proposed by He et al., which is extended from their earlier important work Faster RCNN [18]. Although twostage methods have better performance, their computing burden is too heavy to run on some embedded devices. Thus, some researchers have proposed several excellent onestage methods to reduce the computing burdens under the inspiration of the one-stage object detection frameworks [7][8][9][10]. Dai et al. [19] and Li et al. [20] designed special fully convolutional networks, together with positive-sensitive score maps, to segment instances. Daniel et al. proposed YOLACT [14] and improved it to YOLACT++ [21], which achieved the best balance between speed and accuracy. Xie et al. [22] used a polar mask to annotate the segmentation of an instance, which is a more precise bounding box. Cen-terMask [23,24] was developed from CenterNet [10] by inheriting the anchor-free ideas. Zhang et al. [25] represented the mask into a two-dimensional vector, which can be combined with the box detection branch. BlenderMask [26] combines the top-down and bottom-up methods based on an anchor-free framework.
Zero-shot learning (ZSL), which is one of the typical transfer learning methods, is perhaps the supreme goal of machine learning. For example, if machines could classify new classes accurately [27,28], we could collect labeled data as much as possible for free; if machines could reject samples of unknown classes [29,30], any recognition system would be shielded against outliers. Specifically, the goal of ZSL is to recognize objects of unseen classes, whose labels are not available, by learning high-level semantic information [26,27,30].
In the pre-deep-learning era, researchers focused on the conventional or standard ZSL, in which all test images come from the unseen classes only. Various semantic embedding methods have been developed based on traditional machine learning technologies [27,31,32]. A semantic embedding method learns to embed the original features into a new semantic descriptor space and then predict the classification of features via matching the most similar semantic descriptor.
In the past five years, the more challenging generalized zero-shot learning [3] (GZSL), in which the test set consists of data from both the seen and unseen classes and semantic embedding performs poorly, has attracted increasing attention. In a GZSL task, the training set only contains annotated objects of seen classes, but the test set contains objects from both seen and unseen classes. The extreme data imbalance of GZSL makes semantic embedding methods apt to be highly overfitted to seen classes and fail to classify the unseen classes [27]. Recently, some feature generation methods have been proposed to compensate for the lack of training images of unseen classes in GZSL. Bucher et al. [33] generated features of unseen classes with four different generative models, including generative moment matching network, auxiliary classifier GANs, denoising autoencoder, and adversarial autoencoder. The f-CLSWGAN has also been utilized to generate the unseen features conditioned on the class-level semantic descriptors [34]. Some methods [35,36] introduced reverse regressor networks into the generator network in the form of a cycle-consistent loss or to constrain the feature [37]. Verma et al. [38] designed a variational autoencoder to achieve the same function as f-CLSWGAN. Han et al. [3] proposed a redundancy-free feature generation framework that "limits the information dependence between the mapped features and the original features of the images to an upper bound". In the redundancy-free space, the overfitting problem can be restrained. Besides the generative models, some other innovative methods have also been proposed. Chen et al. [39] designed a semantic-preserving adversarial embedding network to avoid the loss of semantic information. Liu et al. [40] simultaneously calibrated the model confidence of seen classes and the model uncertainty of unseen classes with a special calibration network. Inspired by the information bottleneck method [40], an innovative counterfactual framework to balance seen and unseen classifications was proposed by Yue et al. [41].

Framework of the System
The LROC task consists of three subtasks: OSIC, MOT, and ZSFGR3D. As several mature methods can be used to accomplish the MOT subtask, we focused on the first and the third subtasks in this work. The hardware system and the overall architecture of the proposed method are introduced first in this section. Then, LSMNet and MVRFFNet are proposed to finish the first and third subtasks.
A simple visual recognition system for on-conveyor objects is illustrated in Figure 1. Serial images are captured by an RGB industrial camera mounted above the conveyor. The frame rate is about 30 fps, and the speed of the conveyor is about 10 cm/s.  The overall architecture of the proposed method is depicted in Figure 2. It consists of three main modules: LSMNet, MOT tracker, and MVRFFNet. LSMNet generates instance segmentation results for the MOT tracker, and the tracker provides segmented retrieval images for the MVRFFNet. We can choose several frames for an object to improve its retrieval accuracy. In the inferring process, a matching network is used to retrieve the redundancy-free features of the on-conveyor objects from the gallery of registered multiview features. Other metric learning methods, for instance, the cosine or Euclid distance, can be used to replace the matching network. When some new objects occur, we only need to register their multiview features without any additional work.

LSMNet
For an image X and its annotation set where K is the max step of eroding operation, and h is the biggest interval of contour where max{} ⋅ returns the maximum element of a 2D matrix.
Denote the final LSM of X as LSM  , then its element at ( , ) i j is calculated as below A group of samples is depicted in Figure 3. In It is a typical pixel-level image translation task to predict LSM  with X , and any pix2pix architecture can be adopted. In this study, we chose an UNet as the basic architecture. The UNet contains 16 convolutional layers, where the number of convolutional kernels increases from 16 to 512 in the first 8 layers and decreases to 1 in the last 8 layers. The scale of downsampling and upsampling is 2, the input size is 256 × 256 × 3, and the output size of each branch is 256 × 256 × 1. To improve the quality of the LSM, we trained an LSM branch and a semantic segmentation branch simultaneously. Those two decoders have the same architecture.
The framework of LSMNet is depicted in Figure 4. In the training process, the LSM loss and mask loss, together with a consistent constraint loss, were calculated and used to update the backbone. In the inferring process, small noisy regions of LSM  were filtered first, and the foreground regions were segmented with a threshold > 0.5. Each region corresponds to an instance, and the instance regions are dilated in a ratio proportional to its area.
in which ⋅ is the L1 norm, ⊗ is the element-wise multiplying operation, and 1 ω , 2 ω , and 3 ω are the weights for sublosses. LSM l and mask l are the L1-loss about LSM and mask, respectively, and c l is an additional constraint to make LSM   and mask   consistent.

MVRFFNet
MVRFFNet consists of two modules: the registering module and the feature mapping module. The former extracts multiview features of an object and combines them into a compact registered feature. The latter uses a WGAN-based framework [3] to bridge the gap between registered features and on-conveyor features. E is trained separately, and F is trained together with the mapping module. The architecture is depicted in Figure 5. In the mapping module, the redundancy-free feature mapping framework [3] is adopted. The structure is depicted in Figure 6. The generator G , with the concatenation of a registered feature and a noise vector as its input, generates a synthetic or fake onconveyor feature. The noise represents the properties of difference between the registered and on-conveyor features, which includes environment light, background noise, the motion of the camera, and multiview modal to single-view modal. The mapping function M maps the on-conveyor features into a latent space, which is the redundancy-free feature space. D is the discriminator that is realized in the form of Wasserstein distance.
Wasserstein distance is a symmetrical measurement for the difference between two random distributions. 2 E is a feature extractor to obtain features from on-conveyor images, and C is a final classifier to predict the categories of latent features.
Besides the usual fake loss and Wasserstein distance in WGAN, the mutual information (MI) loss based on Kullback-Leibler divergence and the center loss [3] are also considered when training G and M . The training configuration of MVRFFNet is similar to that of [3], except that the attribute vector is replaced with a registering module and F is updated together with G .

Experiments and Analysis
In this section, we established an OCIS dataset and a fine-grained recognition dataset first, and experiments were conducted on them to validate the performance of LSMNet and MVRFFNet. LSMNet was also tested on the COCO-car dataset, which is a subset of the open dataset COCO2017 [13].

OCIS Dataset and Fine-Grained Recognition Dataset
The distribution of the OCIS dataset is listed in Table 1. The objects in the test set are different from those in the training set. Some groups of samples are depicted in Figures 7  and 8. In addition, we collected 15 video clips of different levels of difficulty. Level 1 means that there is a distance between any two objects, Level 2 means that objects flock without any occlusion, and Level 3 means that there exist overlaps among objects. Due to the smooth surface of the conveyor, the reflection of illumination causes conspicuous noise.   Table 2 presents the distribution of the fine-grained recognition dataset. The objects are the same as those in the training set of the OCIS dataset, as shown in Figure 5. Registered images were captured in a simpler environment, as shown in Figure 3. Each image only contains one object.

COCO-Car Dataset
COCO dataset is a large-scale open dataset for object detection and instance segmentation tasks. It contains 80 classes in total. We collected all samples that contain car instances to establish the COCO-car dataset. The distribution of COCO-car is presented in Table 3. The average value of the number of instances in each image was 3.6.

Results of LSMNet
In this experiment, we used a fully convolutional UNet with skipping connection as the basic architecture of LSMNet. The learning rate was set as 0.0002, the batch size was 6, and the weights were 1 2 3 100 ω ω ω = = = . Cropping, rotation, channel fusion, and color jitter were used to augment the dataset. The curves of the training loss are illustrated in Figure 9. The LSM l curves are above that of mask l , as LSMs are more difficult for the pix2pix network to learn than binary masks. The fluctuations of LSM l , mask l , and c l are almost synchronous, and it means that the consistent constraint is violated more seriously when the predicted LSM and mask are poor. The fluctuations would be eliminated if a bigger training set is accessible. Because the objects in the test set were new for the model, the loss of the test set was unsurprisingly much larger than that of the training set. In Figure 10, a few samples of the predicted LSMs on the training set are shown. LSM output suffers from small noisy regions (Samples 1, 3, and 5), which are filtered in the consistency outputs. However, some tiny objects are missed. The performance of LSMNet on small objects would be discussed further in the experiment on the COCO-car dataset.
(a) Training set (b) Test set We compared LSMNet with YOLACT [14] to illustrate its performance. It should be noted that we fine-tuned the model of YOLACT, which is pretrained on COCO2017 [18]. The results of the videos are listed in Table 4. Because YOLACT is developed based on a one-stage object detection framework, the segmentation results are affected by the detection task. Moreover, YOLACT is apt to recall those objects ever seen in COCO2017 due to that the OCIS dataset is much smaller than COCO2017. As shown in Figure 11, some objects with large volumes are missed by YOLACT. LSMNet performs very well if there is no crowding. However, when there exists adjoin or occlusion among objects, it is hard to determine the threshold for extracting the peaks of LSM. Some poor results of LSMNet are presented in Figures 12 and 13.  To analyze the performance of LSMNet, we compared it with SCNet [41] and YO-LACT on the COCO-car dataset. They are trained with two NVIDIA RTX2080TI GPUs based on the open framework mmdetection [42], which is developed by SenseTime. The results on COCO-car are listed in Table 5. AP is the mean AP@IoU = 0.50:0.95, APL, APM, and APS are the mean AP@IoU on cars of the large, medium, and small size, respectively, and fps is observed on an NVIDIA GTX1660TI CPU. SCNet performed much better than the other two methods due to its introduction of the pregenerated stuffthingmaps. However, its cascade architecture made it much slower than the other two methods. LSMNet performed close to YOLACT on large and medium cars, but much poorer on small objects. The reason is that LSMNet extracts pixel-level information first and aggregates it to global semantic information. Noisy regions in the predicted LSM disturbed the segmentation of small objects. As the process of finding contours cannot be realized in the form of a differentiable module, the label information of the bounding boxes cannot be backpropagated to the pix2pix architecture. The performance of LSMNet would be improved if we can find a method to utilize the box information in the training process.

Results of MVRFFNet
With the same setting as [3], we trained MVRFFNet with four different fusing models. The results are listed in Table 6. The performance of RNN was close to GNN and better than a simple max-pooling or average-pooling layer. The training process is depicted in Figures 14 and 15. As seen classes can provide direct information for classification, the accuracy of seen classes achieves 0.3 in the first 10 epochs. The seen and unseen classes achieve a balance after 60 epochs and keep stable after 200 epochs. MI loss descends rapidly in the first several epochs and then smoothly till the end of the training process, while center loss descends rapidly in the first several epochs and keeps stable in the following epochs. This is because center loss constrains the inner-class distance of redundancy-free features on themselves, while MI loss composes an upper bound on the conveyed information between the original and redundancy-free features.   We also trained a binary matching network to predict whether a registered feature and an on-conveyor feature belong to the same object. The results are listed in Table 7. It can be found that mapping the feature into the redundancy-free space can improve the matching accuracy significantly.

Conclusions
LSMNet and MVRFFNet were proposed in this study for the OCIS and ZSFGR3D subtasks that are involved in the complex LROC task. Experiments were conducted on an OCIS dataset and a fine-grained recognition dataset about objects on a conveyor, respectively. LSMNet could achieve a recalling accuracy close to YOLACT on large and middle objects, while its computing speed on an NVIDIA GTX1660TI GPU was 25 fps, which is much faster than YOLACT's 80 fps. MVRFFNet performed much better than traditional metric learning methods in the retrieval task by mapping the features into a redundancyfree feature space.
Future works will concern three aspects. First, we are going to collect more samples that match the actual situation in Figures 9-11, each sample of which only contains one object that does not cover the case of crowdedness and occlusion. A more comprehensive analysis of the performance of MVRFFNet will be carried out. Second, we will extend the pix2pix architecture to multiple output branches and establish an LSM for each category, LSMNet would be suitable for multiple class instance segmentation tasks. The third is to find contours in a new manner that can transmit the supervised information of detection to improve the performance of LSMNet on small objects. The last task is to combine multiview features and semantic attribution together to improve the performance further. The motivation is that the multiview registered features serve as the attribution of GZSL setting in MVRFFNet, and high-level semantic attribution has not been utilized yet. Data Availability Statement: Not Applicable, the study does not report any data.

Conflicts of Interest:
The authors declare no conflict of interest.