As the fields of Artificial Intelligence (AI) and Robotics mature and evolve, an increasing number of hardware and software solutions have become available, reducing the costs and technical barriers of developing novel robotic platforms. Following these advancements, the members of the European Public-Private Partnership in Robotics (SPARC) have defined a set of critical objectives for the 2014–2020 horizon [1
], which all relate to the integration of autonomous agents in the different urban sectors (e.g., Healthcare, Transport, Manufacturing, Commercial). This strategy also aligns with the vision of a Smart City, where robots carry out tasks in urban environments, while at the same time also acting as moving sensors, thus aiding the real-time collection of data about the city operations [2
]. Differently from static sensors, robots can intervene in their surroundings (e.g., by navigating to specific locations, by grasping and manipulating objects, or by dialoguing with users). This characteristic makes them suitable to assist humans on tasks as diverse as health and safety monitoring [3
], pre-emptive elderly care [4
], and door-to-door garbage collection [5
], just to name a few. To accomplish these tasks, the new generation of robots is expected to make sense of environments which are always changing and evolving (e.g., cities or, at a more local scale, offices, warehouses, households and shops). Equipping robots with effective methods to recognise the different objects they encounter while navigating the environment is thus a crucial prerequisite to sense-making.
Let us consider the case of HanS, the Health and Safety autonomous inspector under development at the Knowledge Media Institute (KMi) [3
]. HanS was conceived for autonomously identifying potentially hazardous situations, e.g., the fire hazard caused by a pile of papers sitting next to a portable heater. To identify the threat, HanS would first need to recognise the objects paper
and electric heater
autonomously. Furthermore, it will need to recognise that the two objects are next to each other. Based on that, HanS could then query external knowledge bases, such as ConceptNet [6
] or DBpedia [7
], which provide useful common-sense facts regarding the two objects, like paper is flammable and electric heater is a type of electrical device. With this knowledge, HanS would then be one step closer to verify that the health and safety rule stating that flammable items must be kept away from ignition sources has been violated.
This example explains why effective object recognition mechanisms are required to interpret the current status of the environment, i.e., to form the robot’s worldview. Moreover, the top-performing object recognition methods available (e.g., [8
]) (i) assume that all target object classes are known beforehand and (ii) require a vast number of annotated training examples to recognise each class. As a result, these methods do not align with the scenario of a learner who refines its worldview incrementally, based on the objects it already knows and with access to only a few instances of the newly-encountered objects (i.e., few-shot learning). In fact, HanS would need to learn how to spot papers and electric heaters from different viewpoints, under different conditions of lighting and clutter. It would also need to recognise new object types (e.g., people and chairs in addition to fire extinguishers and emergency exits) in order to manage different tasks (e.g., checking room occupancy as well as detecting fire hazards).
One way of tackling the few-shot recognition problem is learning how to match newly-incoming objects to their most similar support example [12
]. These approaches are based on Convolutional Neural Networks (CNN) and typically reuse the knowledge acquired on large-scale benchmark datasets, such as ImageNet [14
], to compensate for the lack of training examples. Inspired by the way cognitive imprinting works for humans, Qi et al. [15
] have shown that the very first image available of a new object, if opportunely scaled and L2-normalised, can be used as enhanced weights in a multinomial classification layer, i.e., comprising of three or more classes. However, the weight imprinting strategy has yet to be integrated into a multi-branch CNN. Therefore, we investigate whether state-of-the-art methods for few-shot recognition that use multi-branch CNNs [13
] can be further improved by introducing weight imprinting [15
]. In the case of architectures including binary classifiers [12
], we will instead study the impact of L2-normalising of the image embeddings.
Furthermore, the state-of-the-art few-shot image matching methods discussed in [13
] have only been evaluated on specific tasks (e.g., stowing, grasping) and product domains (e.g., Amazon products). However, the recent availability of large-scale repositories of object models, such as ShapeNet [16
], has opened up opportunities to evaluate how these methods can scale up to different object types, which represent object models agnostically to the specific product domain or end task. The fact that models in ShapeNet are annotated with respect to the WordNet taxonomy [17
] also facilitates the integration of other knowledge bases in this process, which are linked to the same taxonomy, such as ConceptNet [6
], DBpedia [7
], or Visual Genome [18
Based on these premises, we focus on the following contributions.
We test state-of-the-art methods for Deep few-shot image matching on a novel task-agnostic data set, including 2D views of common object types drawn from a combination of ShapeNet [16
] and Google Images.
We compare these against other shallow methods explored as part of our prior work [19
], which are based on colour and shape similarity, to assess whether meaningful features can be transferred after learning on a different image domain, i.e., ImageNet, without any prior feature engineering.
We also evaluate the performance effects of (i) imprinting the weights [15
] of a two-branch CNN [13
], i.e., thus extending the framework in [13
] with a novel weight configuration; (ii) applying L2 normalisation to the embeddings of a Convolutional Siamese Network [12
], to include the case of binary classification in our assessment.
2. Background and Related Work
The development of service robots [3
], has become a very active area of research, concurrently with the recent evolution of urban spaces to embrace the vision of Smart Cities, integrating different sensors, actuators and knowledge bases for the real-time management of the city operations [1
]. This scenario calls for effective methods that can allow robots to make sense of fast-changing environments, presenting unpredictable combinations of known and novel objects.
The problem of real-time object recognition has reached satisfactory solutions [8
] only in experimental scenarios where a very large amounts of human-annotated data is available and all object classes are assumed to be predetermined, also known as the closed world assumption [20
Naturally, problems such as the paucity of training data or the adaptability to new learning environments are pervasive across all sub-fields of Artificial Intelligence (AI), and not only specific to the object recognition area. All efforts ever devoted towards tackling these problems are broadly ascribed to the Lifelong (Continual, or Never-ending) Machine Learning framework [20
]. Lifelong Machine Learning (LML) is the term used to define the AI discipline concerned with the autonomous systems’ capability to (i) improve at learning to perform a single task over time (Incremental Machine Learning), as well as (ii) reuse the acquired knowledge across different tasks [20
]. The challenge of learning incrementally also implies to effectively manage the initial stages of learning, i.e., when a minimal number of training examples are likely to be available. Another objective is thus “learning to learn” (Meta-learning [20
]), by focusing on how a new task can be learned effectively from a minimal number of training examples and, ideally, by producing task-agnostic models, i.e., models which can generalise to many different learning problems.
Most recently, the LML field has seen the rise of several methods based on Artificial Neural Networks (NN), given their promising performance on static tasks [21
]. However, NNs are particularly prone to the issue of catastrophically forgetting previously-learned concepts, as soon as new concepts are introduced [23
]. A typical, static setup usually entails: (i) shuffling the training examples to resemble iid conditions, (ii) randomly initialise the model parameters, and then (iii) gradually regularise them through successive full passes over the entire data set (epochs). This setup does not scale up to the case of learning in dynamic environments, where models need to be updated incrementally to reflect any changes in the environment, as well as the availability of new data points. These considerations fall under the so-called stability-plasticity dilemma, ascribed initially to Biological Neural systems [24
] but also affecting Artificial NN design. In essence, a desirable architecture should offer a trade-off between the ability to encode new knowledge and the ability to retain prior knowledge.
In the attempt to achieve this trade-off, many strategies have been proposed [21
], which can be grouped as replay-based, dynamic architectures and prior-focused [26
]. Replay-based methods, as the name suggests, rely on jointly feeding the model with: (i) raw or compressed representations of previously-stored examples (i.e., memory replay), and (ii) newly-retrieved examples [25
]. Within the second group of methods, instead, the building blocks of the Neural architecture are automatically re-organised at each update [21
]. Lastly, prior-based methods use the model parameter distribution learned on certain tasks and data sets as prior for new learning tasks [30
]. A common prior-based practice is to first initialise the model parameters with optimal values obtained on larger data collections and to then fine-tune the model [13
In the specific context of NN-based object recognition, these broader issues translate into two main challenges: (i) building object representations which can be reused across different tasks and environmental conditions, (ii) even from the few training examples initially available (i.e., few shot object recognition).
With respect to the first challenge, the object vectorised representations produced by a Neural Network (i.e., embeddings) can be optimised either based on their classification among a set of pre-defined classes or by learning a feature space where similar objects are mapped closer to one another than dissimilar objects (a task often referred to as metric learning or image matching). One advantage of the latter training routines is that even novel objects, i.e., unseen at training time, can be classified at inference time, based on their nearest neighbours.
Nearest neighbour-based matching can also be related to cognitive studies on prototype-based object categorisation [33
]. According to prototype theory, the cognitive process of categorisation (i.e., the human process of learning how to group objects into categories by visual stimuli) is achieved by converging towards a prototypical representation of all instances belonging to the same class, through abstraction. This abstract representation appears to be correlated with the measures of central tendency within a class, such as the arithmetic mean [34
] or mode [35
]. Once the prototypes are established, both the prototypical objects and the objects lying in closest resemblance to the prototype (or proximity, in the case of a feature space) can be recognised almost instantly by humans. This phenomenon is also known, in the visual cognition field, as the prototype effect.
Deep image matching has been applied successfully to object recognition tasks [12
], including robotic applications [13
]. In [12
] pairs of images belonging to the same dataset, or visual domain, were binary-classified (i.e., as similar/dissimilar) based on the sigmoid activation of the L1 component-wise distance between the two image embeddings representing the pair. These embeddings are produced by two Convolutional Neural Network (CNN) branches, which share the same weights. This characteristic of the proposed architecture, also known as Siamese Network, ensures that: (i) the Network is symmetric with respect to the order chosen to feed the input image pairs, and (ii) predictions are consistent for significantly similar images, since the twin branches are each computing the same function. Furthermore, Zeng et al. [13
] have shown that configuring the two CNN branches to learn weights independently can lead to higher performance than that achieved through Siamese-like architectures when images are matched across two different domains, i.e., robot-collected, real-world images and Amazon products. In the following, we refer to the latter configuration as two-branch Network, to differentiate it from Siamese Networks. The study conducted in [13
] addressed the second challenge (i.e., the paucity of training examples), by using a pre-trained ResNet [37
] for all the Image Matching methods under comparison. In a prior-based fashion, weights were transferred from the less challenging ImageNet dataset [14
], which provides abundant training examples [32
The object representations achieved in [13
], however, were obtained in the context of the 2017 Amazon Challenge and were thus tailored on (i) a domain-specific product catalogue (i.e., Amazon’s products), (ii) a specific end-task (i.e. object stowing and grasping).
The recent availability of pre-segmented and annotated large-scale image collections (e.g., [14
]) has provided new benchmarks to evaluate object categorisation methods. These data sets are task-agnostic in nature and comprise of common, everyday objects. Compared to everyday object collections such as MS-COCO [39
] or NYU-Depth V2 [38
], the ShapeNet 3D model base [16
] was organised with respect to the WordNet taxonomy [17
], thus facilitating the further integration of other open, large-scale knowledge bases adhering to the same taxonomy, like ConceptNet [6
], DBpedia [7
], Visual Genome [18
] and others. Already used for learning object intrinsics [40
], and 3D scene understanding [41
], ShapeNet has never been used, to our knowledge, for training multi-branch Convolutional architectures on the task of few-shot object recognition. We made a first attempt at capitalising on the visual knowledge provided with ShapeNet as part of our prior work [19
], where we explored different colour-based and shape-based shallow representations to match ShapeNet models by similarity. In what follows, we further extend this investigation to assess whether the Deep representations learned by similarity matching on ShapeNet can: (i) outperform the previously-explored shallow representations, as well as (ii) generalise to new object classes.
All the reviewed image matching methods, on the other hand, require that a support set of examples is provided, at inference time, for each new query image. Motivated by this limitation, Qi et al. [15
] have proposed a method to use each newly-incoming query image directly as classification weights on a single-branch CNN. This weight initialisation strategy was named weight imprinting by analogy with how cognitive imprinting develops for humans, who can recognise new objects from the very first exposure, drawing connections from their prior knowledge. Similarly, the image embeddings produced by a CNN, if opportunely scaled and L2 normalised, can be directly added the weight matrix of the classification layer, as soon as the first image of a novel object type is observed. Crucially, they found that the top performance could be achieved when weight imprinting was combined with a further fine-tuning of the CNN. Moreover, averaging the embeddings by class before applying them as weights was proven beneficial, in cases when more than one exemplar per class is available. In our view, this result shows an interesting fit within the prototype theory of cognition [33
] and, more specifically, with existence of a correlation between central tendency measures and prototype formation [34
Inspired by these results, we aim at investigating whether embedding normalisation, which weight imprinting can be seen as a special case of, can aid the formation of more effective “prototypes” in the context of Deep metric learning. Weight imprinting was initially designed as an alternative to multi-branch metric learning and, therefore, has never been integrated within two-branch Networks. We interrogate on whether or not these within-class averaged representations can improve fine-tuning also in the case of the two-branch architectures of [13
], combining the benefits of both approaches. In the context of binary (or binomial) classification, where weight imprinting cannot be applied by design, we rely on the more general L2 normalisation of the input embeddings.
6. Results and Discussion
The results obtained in [19
], which are also summarised in Table 3
, albeit providing an improvement from random label assignment in all configurations, were not satisfactory to discriminate different object classes. After transferring weights from a ResNet architecture, pre-trained on abundant natural images depicting different objects [14
], we obtained a significant improvement in the object recognition accuracy (as shown in Table 3
). Indeed, introducing Deep CNNs and transfer learning helped to separate the feature space. Given the promising results obtained even with the baseline solution, we included weights learned on ImageNet also in all the other tested models.
The test results of all experimented solutions on both known and novel object classes are summarised in Table 4
. The baseline NN feature space was already 80% accurate in the mixed scenario but degraded to a sub-optimum when fine-tuning the alternative solutions derived from [12
] and [13
]. Thus, we can speculate that, with the few examples available, it was challenging to separate the starting embedding space further, likely because it included semantically-related class pairs (e.g., electric heaters and radiators, books and paper documents). It is also worth noting that the F1 score obtained when applying the baseline to the unseen classes is higher than in the case of known objects. This evidence may be related to how the object classes assigned to the novel objects were clustered within the initial feature space transferred from ImageNet. One precaution we followed during the data preparation phase was splitting the classes shared in ImageNet equally across the known and novel sets (see Section 4
). Nonetheless, the feature space transferred from ImageNet carried over a set of biases which, on average, made the novel classes inherently easier to separate than the known classes. For instance, books
both belong to the known object group but represent highly-related concepts, both semantically and visually. The same behaviour can be observed when looking at the F1 scores of both Siamese-based methods. Even further, in the case of SiamResNet50, the results obtained on known and novel objects were equal to the baseline NN, suggesting that the learned feature space has not diverged much from the initial one. The fact that L2norm L1 SiamResNet50 also performed better on novel objects than on seen categories can be explained by considering that both ablations share the same backbone layers and loss function. The introduction of the embedding L2-normalisation is, in fact, their most differentiating trait.
There were no significant differences, in terms of training times, between all methods that involved re-training (i.e., excluding the baseline NN). On the largest data set (SNG-20) and with the hardware configuration described in Section 5.2
, all the experimented methods converged in less than 1.5 h of training. In terms of computational complexity, the Imprinted K-net required an additional iteration over the data to compute the average class embeddings. However, we verified that this process could be run efficiently on the GPU, in 0.258 s for the largest data set.
Overall, the Siamese-like architecture derived from [12
] led to higher performance than the N-net and K-net alternatives. This result seems to verify our initial hypothesis, i.e., that weight sharing is more suitable for treating images belonging to the same visual domain, and further supports the findings in [13
]. The improvement may have also been caused by the different distance function imposed in the two cases. Introducing weight imprinting in the K-net architecture ensured that we could reach near-perfect precision on the known object classes, at the expenses of performance on the novel objects. Ultimately, applying L2-normalisation to the embeddings of a Siamese ResNet50 derived from [12
], before the L1 component-wise distance layer, led to the most robust performance, on both known and novel objects.