Zero-Shot Recognition Enhancement by Distance-Weighted Contextual Inference

: Zero-shot recognition (ZSR) aims to perform visual classiﬁcation by category in the absence of training samples. The focus in most traditional ZSR models is using semantic knowledge about familiar categories to represent unfamiliar categories with only the visual appearance of an unseen object. In this research, we consider not only visual information but context to enhance the classiﬁer’s cognitive ability in a multi-object scene. We propose a novel method, contextual inference , that uses external resources such as knowledge graphs and semantic embedding spaces to obtain similarity measures between an unseen object and its surrounding objects. Using the intuition that close contexts involve more related associations than distant ones, distance weighting is applied to each piece of surrounding information with a newly deﬁned distance calculation formula. We integrated contextual inference into traditional ZSR models to calibrate their visual predictions, and performed extensive experiments on two different datasets for comparative evaluations. The experimental results demonstrate the effectiveness of our method through signiﬁcant enhancements in performance.


Introduction
Demands to expand the scale of categories available for object recognition have been aroused by a rapid increase in the sizes and types of image data and the recent success of large-scale recognition systems [1]. However, manually constructing additional annotations and retraining existing image classifiers through supervised learning is impractical and costly, which limits the scalability of existing systems. To alleviate that limitation, a variant of transfer learning, zero-shot learning (ZSL) has drawn the attention of the computer vision community [2][3][4][5].
ZSL is inspired by the human capacity to recognize objects without any visual samples using background knowledge already read or heard. The key is to transfer semantic knowledge about familiar (seen) objects to imagine unfamiliar (unseen) objects. For instance, a human can easily recognize an unseen zebra based on visual experiences with a horse and a watermelon, if it is known that a zebra looks like a horse with stripes on its body.In the same way, the objective of ZSL methods is to increase the cognitive capability of a visual classifier by using annotated training sets of seen class labels and external knowledge about the semantic relations between seen and unseen categories to allow the classifier to infer the class labels of novel objects. In this context, external knowledge is generally represented as non-visually using attributes [6,7], semantic embedding [8], and knowledge graphs [9,10]. To transfer knowledge, zero-shot recognition (ZSR) assumes that visually similar objects tend to also be semantically similar, which implies that the vector representations of their class labels are close. Most existing ZSL methods thus focus on learning to recognize inherent visual features (e.g., color, shape, and texture) and providing a map between the visual and semantic representations.
However, the correlation between visual and semantic information is not always assured. For example, the unseen object, monitor, may be confused with a frame or a window because it has a square, edged shape with inside contents. Thus, existing ZSR models have critical limitations because they rely on visual information about a single unseen object. In contrast, when people try to identify an unseen object, they naturally refer to the circumstances surrounding the object, such as other objects and their relative positions, as well as the visual characteristics of the target object. In addition, it is common to infer more from relatively close objects than from more distant objects (e.g., the toothpaste beside a toothbrush and not the mirror in the scene). When modeling a zero-shot classifier, the correct label monitor could be inferred as appropriate when the surroundings suggest an office environment through the appearance of seen objects such as a keyboard and a notebook. That is, the classifier should use surrounding information to determine the type of object appropriate in a given environment.
In this paper, we use those intuitions to propose a novel ZSR method that leverages context based on similarity measurements and distance-weighting between a target unseen object and surrounding objects. We aim to enhance the performance of existing instance-level ZSR that relies only on individual visual information about the target. To identify the context, our method uses cognitive information about the surrounding objects and obtains their similarity information for the target object using three different measures on a knowledge graph, a semantic embedding space, and both together. Moreover, distance weighting is applied to each piece of similarity information to focus on nearby surrounding objects by defining a distance calculation formula. To evaluate the effectiveness of our method, we adopted several existing ZSL models as baselines and performed extensive experimental evaluations on two different datasets containing small and large amounts of target categories.
The main contributions of this work are as follows. (1) We propose an advanced ZSR method that references to the similarity-based contextual information in a multi-object scene to alleviate the dependence of traditional methods on visual information about an unseen object. (2) With the intuition that nearby objects have more reliable relationships with the target object than distant objects, we newly formulate a distance calculation between the objects' bounding boxes to enable distance-weighted contextual inference. (3) Our method maintains modularity and can be integrated with any instance-level zero-shot classifier because it does not require an additional training process for the contextual inference. (4) Extensive experiments on two datasets with different target category scales show that our system offers performance enhancements compared with existing instance-level and context-aware ZSL models, often by a large margin. This paper is structured as follows. Section 2 outlines related works and sets baselines for the evaluation. We detail our method for contextual inference and distance-weighting in Section 3. Section 4 presents our experiments and results, and Section 5 gives our conclusions.

Instance-Level ZSL
Existing ZSL models differ in their use of semantic embedding spaces or other external knowledge sources. Early models used manually constructed attribute spaces [11][12][13][14][15][16][17] to represent categories as binary vectors that implied the presence of attributes. In general, the use of attributes has seemed promising [18][19][20][21] in various research fields, including ZSL, but it has limited scalability due to domain dependency and the cost of manual construction. To relax those limitations, semantic word-vector spaces that are automatically trained on a textual dataset have been used used in more recent ZSL works [6,[22][23][24][25]. The word-vector spaces from text corpora, such as word2vec [26] and GloVe [27], motivated the use of large-scale ZSR with many unseen categories because they are unrestricted and less costly than manually annotated attributes. Some ZSL works has used knowledge graphs instead [9,[28][29][30]. In particular, some recent works [9,30] based on a graph convolutional network (GCN) [31] have used the WordNet [32] taxonomy to propagate classifier weights from seen to unseen categories.
Most existing ZSL works make instance-level inferences about an individual unseen object by transferring external knowledge from seen categories based on visual similarities, and they have shown promising prediction results in test sets with a specific type of categories such as Caltech-UCSD Birds (CUB) [33], Stanford Dogs (Dogs) [34], and Animals With Attributes (AWA) [35]. However, despite many attempts of ZSR with large-scale datasets such as ImageNet [36], which contain various types of categories for generic objects, instance-level ZSL works have produced comparatively poor performance, which implies that different types of categories have irrelevant semantic characteristics despite their visual similarities. Therefore, we consider not only visual prediction about an unseen object but also its surrounding information to determine the most likely category.

Contextual Recognition
Many works [37][38][39][40][41][42][43][44] have emphasized the importance of context and tried to enhance recognition or detection using context as an additional resource. However, they are inappropriate for ZSL, in which supervised learning cannot be applied. Exploiting context in a ZSR task is still challenging and not well standardized. Only a few recent works have proposed that a ZSR task be aware of context. For instance, the authors of [45] leveraged visual context and the geometric relationships between multiple objects using a conditional random field. The authors of [46] presented a method based on conditional likelihoods that combined three independent models of contextual, visual, and prior information. Both of those works complement our research, but we consider visual candidates of surrounding objects as potential contexts based on similarity measurements and apply distance-weighting according to the positional differences of the objects. In other words, we propose a method that effectively exploits contextual information, and we validated our proposed model by comparing it with several instance-level ZSL models [6,9,22,23] and one context-aware model [45] through the experiments described in Section 4.

Problem Definition
Let C be a set of class labels c that is split into two disjoint subsets, S = {s m } and U = {u n } that, respectively, denote a set of M seen class labels and a set of N unseen class labels. Under the setting of the ZSR task, the two subsets of labels satisfy S ∩ U = and S ∪ U = C. Each class label has its representative semantic embedding vector e in a semantic embedding space E ∈ R d e .
For training, a labeled dataset of images I s = {(i m , s m )} is given, in which each image is represented by a d i -dimensional feature vector, i m ∈ R d i , and a class label, s m ∈ S. A test dataset I u = {(i n , u n )} is provided for testing in which i n ∈ R d i and u n ∈ U. In general, the goal of ZSL is to learn a classifier f to produce the correct class label for an unseen individual image.

Model Overview
Traditional ZSR methods classify an unseen object using only its individual visual information. On the contrary, our proposed model aims to classify an unseen object with the help of contextual information obtained from its surroundings in a multi-object scene. We define potentially related surrounding objects under the following assumptions; (1) they represent non-unseen classes and are not the targets of our recognition task; (2) they are detected by a pre-trained object detector or classifier before the ZSR of the target object; and (3) each one includes predicted candidates and corresponding prediction probabilities. Consequently, the surrounding information (SI) of an unseen image feature i with multiple surrounding objects is specified by the following equations, where si denotes the surrounding information for a single surrounding object and SI consists of multiple si s with the number, N sobj . c k and p k are the class label of the k-th predicted candidate of a surrounding object and the prediction probability of the corresponding candidate c k , respectively. In particular, c k ∈ S, and p k is a soft-max value of the prediction score among N cand candidates. Our proposed model takes an unseen image feature i and its surrounding information SI(i) as its inputs and predicts the most likely class label c as its output. Specifically, let the aforementioned classifying function be f : I → E between a visual feature space I ∈ R d i and a semantic embedding space E ∈ R d e for class labels. The classifying function f outputs the semantic embedding vector e that maximizes the scoring function F as follows and our model finally produces a prediction for the corresponding class label c of the output semantic embedding vector.
For the process of ZSR, as shown in Figure 1, our model acts on two collaborating branches, the instance-level visual inference of an unseen object and the contextual inference with surrounding information. The visual inference basically follows the process of traditional ZSL models: the extraction of a feature vector from a target unseen image and the prediction of a visual score for each target class label using a zero-shot classifier. The contextual inference, a novel process, performs similarity measurements between each of target class labels and visual candidates of the surrounding objects to obtain a contextual score.Similarities are measured using a knowledge graph and/or a semantic embedding space. The prediction of the visual inference is calibrated and advanced by the result of the contextual inference. More detailed processes and formulations are described in Sections 3.3 and 3.4. Consequently, the scoring function F for a specific class label can be specified by combining the instance-level visual inference function G and the contextual inference function H as follows, where α is a balancing factor for an usage ratio between the visual and contextual scores: F(i, e, SI(i)) = α · G(i, e) + (1 − α) · H(e, SI(i)).
(4) Figure 1. The architecture of our proposed model with an example in an office environment. The model performs Zero-Shot Recognition (ZSR) on two main branches: the instance-level visual inference and the contextual inference.The visual inference first infers the class label of the extracted feature vector i 1 of the target object, which is a process used by existing methods. A distance-weighted, similarity-based calibration is then performed between the target and its surrounding objects to refer to the contextual information, which is the novel process we propose.
As shown in the intuitive and assumptive example in Figure 1, an office environment contains several objects including the target unseen object (orange bounding box), a monitor, and its surrounding objects (blue bounding boxes). The visual inference first predicts the orange-boxed image as a frame, which has the highest rank due to its square appearance and raised edge, with a monitor, which is the correct label, ranked third. Using only visual information about a target object can thus lead to failed prediction. Therefore, we exploit additional context information from the surrounding objects to calibrate the visual prediction result. Most of the surrounding objects are ranked as likely to be computer devices, such as a keyboard and a mouse, by the pre-trained image classifier for seen class labels. However, an unwanted object in the context, such as a hat, could disturb the contextual inference.Under the assumption that closer objects are more relevant than more distant objects, the contextual classifier gathers more information from the keyboard and mouse than from the hat. That proper use of surrounding information enables the classifier to correctly rank the target as a monitor that is "computer-related", square-shaped, and edged. We experimentally validate this overall intuition in Section 4.

Visual Inference
The visual inference process aims to classify an unseen object using only its individual visual feature as input. As a result of that process, the classifier predicts a visual score for each unseen class label as output. In general, classifiers use a model trained with fully supervised learning for seen class labels or an indirectly trained model based on semantic relations between seen and unseen class labels. Because the main concept of this paper is the additional use of context, we adopt several existing ZSR models [6,9,22,23] for the visual inference function G : I → E and use them as baselines in the comparison of models with and without the contextual inference in Section 4. The G functions used by the baseline methods to measure the visual scores are specified in Table 1. Table 1. Visual inference functions of baseline zero-shot recognition (ZSR) models.
Structured Joint Embeddings (SJE) [6] trains an image-semantic embedding matrix W ∈ R d i ×d e and infers a class label of an unseen object by measuring the distance between an embedded image feature of the object and unseen semantic embeddings. Latent Embedding (LatEm) [22] tries to relax the limitation of linearity in SJE by using multiple image-semantic embedding matrices. LatEm proposes a nonlinear compatibility function with K indexes over the latent choices.
The convex combination of semantic embeddings (ConSE) [23] uses a pretrained image classifier trained on seen class labels with full supervision. For the inference, softmax-output values p of the classifier for an input unseen image are used to create its representation vector by weighting the semantic vectors of seen class labels e c . An unseen class label for the semantic vector nearest to the representation vector is then predicted as an appropriate class label.
A multi-layer GCN model [9] begins by training an image classifier in the same manner, but it uses classifier weights from the image classifier as ground-truth to learn predicted classifier weights with semantic embeddings for class labels and their adjacency matrix as input. At test time, the visual inference for the GCN conducts a dot-product estimation between an image feature vector i and a predicted classifier weightŵ for unseen class labels.

Contextual Inference
When encountering an unseen object, humans unconsciously refer to not only its visual appearance but also the surrounding environment and the types and relationships among nearby objects to infer its identity. We propose a novel approach that derives a contextual score based on the associations between an unseen object and its surrounding objects, and calibrates the prediction results from the visual score of existing ZSR methods. In particular, the associations are obtained through similarity measures that use external knowledge sources. Furthermore, a distance calculation formula ensures that nearby surrounding objects are more important to the contextual score than distant objects.

Similarity-Based Association Measurements
To grasp the context of an unseen object, we determine associations through similarities between the class label of the target object and those of its surrounding objects. Note that we assume that the surrounding objects are recognized before beginning ZSR for the unseen object, as explained in Section 3.2. When measuring similarities for contextual inference, we consider not only the top-1 classified label of the surrounding objects, but also all potentially-ranked labels of their candidates. This helps to alleviate the problem of recognizing misclassified surrounding objects as the representative context and allows our system to consider multiple potential contexts. External knowledge sources-a knowledge graph, a semantic embedding space, and both together-are used for three different similarity measurements: semantic similarity (SM), cosine similarity (CS), and the harmony of both (HM), respectively. •

Semantic Similarity Measurement
The SM metric is defined over documents and is based on the likeness of conceptual meanings [47]. In this paper, we use a hierarchical knowledge graph, ontology, representing the hierarchical concepts of objects for the SM measurement. Various measures [48] can be used with a knowledge graph, such as path and depth measures, information content-based measures, feature measures, and hybrid measures. We adopt three typical path and depth-based measures for our experiments, as presented in Section 4.2.2. A SM-based association, S SM , between one unseen class label and a single surrounding object, is given by where the inputs are a semantic embedding e of an unseen class label c e and the surrounding information si for a single surrounding object. c k denotes a class label of the k-th predicted candidate for the surrounding object. Each SM value between c e and c k is weighted by p k , a prediction probability for the candidate that implies the reliability of the measured similarity. An association's output is the maximization of the weighted similarities for candidates because finding the best combination of an unseen class label and all candidates for all surrounding objects is equivalent to finding and combining the best candidate for each surrounding object. •

Cosine Similarity Measurement
To measure associations, we next use CS, which is available in semantic embedding spaces. Various types of semantic embedding spaces are used in ZSR, such as manually-annotated attributes [11,12], text descriptions of images [49], word embeddings [26,27], and rdf graph embeddings [50]. Among those, the attribute space represents fine-grained concepts of each class label with a binary value depicting the presence/absence of an attribute, based on human annotated description (e.g., for attributes horse-like, stripe, and green, e zebra = [1, 1, 0], e tiger = [0, 1,0], and e watermelon = [0, 1, 1], respectively). The word embedding space contains vectors learned by neural net with a certain feature dimension according to the mutual frequency of words in the context in the text corpus (e.g., due to simultaneous occurrence of words, "monkey" and "banana", the magnitude and direction of e monkey and e banana can be similar). In terms of ZSR performance, embedding spaces based on manual construction such as attributes and text descriptions of images are generally more effective than a word embedding space [6,12,51,52]. However, a word embedding space constructed in an unsupervised manner still has higher versatility and utility than the other options because it costs much less and enables large-scale recognition with many class labels [53]. We thus use word embedding spaces as the semantic embedding space E for the evaluation, although our proposed model works independently on the type of semantic space. A CS-based association, S CS , is measured by calculating two vectors, as follows, S CS (e, si) = max 1≤k≤N cand p k · cos(e, e c k ).
The overall metric is similar to Equation (5) except for the use of external knowledge and the similarity measurement. e c k denotes a semantic embedding vector for c k in the semantic embedding space E. •

Harmonic Similarity Measurement
As a harmonic approach, we combine the association results of the SM and CS, which implies the use of both a knowledge graph and a semantic embedding space when referring to the surrounding information. As a harmonic association, S HM is specified simply with a balancing factor on the two measurements as follows,

Distance-Weighted Calibration for Multiple Surrounding Objects
Associations for all surrounding objects are eventually synthesized to derive the contextual inference to calibrate the results of the visual inference. Using the intuition that objects near a target object offer better context than objects farther away, the context inference applies a distance-weighted average of associations rather than a simple average. We assume that the bounding boxes, b = (x, y, w, h) of the objects needed to calculate distance are given, where (x, y) is the center point of a bounding box and w and h denotes its width and height, respectively. However, we do not simply use the Euclidean distance between center points. In Figure 2, the Euclidean distances between the two objects in panels (a,b) are the same, but the actual relative distances in panel (b) are much smaller. By generalizing all the related cases, we define a distance calculation equation that relaxes the Euclidean distance with the size of two objects.
. A distance is calculated when center points are not exactly the same, and it is fixed to 0 otherwise. When calculating the distance, the application of the Euclidean distance is adjusted to the average width and height of two objects, where the width and height are referenced using the ratio of the x-axis and y-axis intervals between the objects.
As previously explained, because the context of close objects should be exploited more, the reciprocal of the distance is considered as a weight. By weighted-averaging all obtained associations, the set of S(e, si), and the corresponding weights are combined to derive the contextual score for an unseen class label as follows, where r = 1/(d + ) and is a smoothing factor whose value is fixed at 0.001 in the experiments. Figure 2. The Euclidean distances are the same in (a,b) for each case, but the actual distances seem remarkably different. The upper and lower case of (b) are likely to show an "on" and "attached" relation, respectively, which represents the definite nearness of the objects

Overall Experimental Scenario
The main task in the experiments is to predict an appropriate class label for an unseen object from among target class labels by using its visual information and the contextual information of surrounding objects in a multi-object scene. Recall that we assume that the surrounding objects are detected and recognized by the existing object detector, which is pretrained only on non-unseen class labels and never on unseen ones. YOLOv2 [54] is mainly used as the object detector, the pretrained image classifier in Figure 1 and is trained for 80 categories on the COCO dataset [55]. Among those categories, the ones that share concepts with the unseen class labels are excluded for detection. The namespace of the rest is matched and anchored to that of a knowledge graph and semantic embedding space (presented in the following subsections). Moreover, confidence scores provided by YOLOv2 are exploited as the reliability values p in the aforementioned equations for those whose values are higher than the fixed threshold of 0.3.
We conduct two experiments with different dataset scales and types of class labels: (1) experiments on ImageNet categories with less unseen class labels, and (2) experiments on Visual Genome categories based on the split in [45,56], with relatively more unseen class labels and a larger scale test set.

Dataset
In the first experiment, we use the ImageNet dataset [36] and Visual Genome (VG) dataset [57] for training and testing, respectively. Each image in ImageNet contains an annotation for only one category, whereas an image in VG represents a multi-object situation and has multiple annotations.
We use ImageNet (ILSVRC) 2012 1K, which is composed of 1K class labels and more than 1.2 million images for training, but we consider only 944 class labels with available semantic embedding as seen classes, with 1,211,266 training images that are used to train some of aforementioned baseline models for the visual inference. The weight matrix in SJE and 6 weight matrices (i.e., K = 6) in LatEm are trained with 150 epochs and a learning rate of 0.001. For the GCN, we use the same training settings as [9], which are 6 convolutional layers with an output channel D of 1024. The classifier weights and image features are L2-normalized for the GCN and the ADAM [58] optimizer is used with 300 epochs, a learning rate of 0.001 and a weight decay of 0.0005. The adjacency matrix in the GCN is constructed based on the WordNet knowledge graph and considers sibling classes as adjacent classes as well. More details about the knowledge graph are provided in the next subsection.
To adopt 34 unseen class labels for testing, we define and observe four criteria: (1) they are selected among 360 categories in ImageNet 2010 1K that are disjoint from ImageNet 2012 1K, which is similar to the settings in [53], (2) they are categories for generic objects, (3) they have available semantic embeddings, and (4) they are also annotated in VG with more than 20 image instances. Additionally, we randomly sample a maximum of 200 image instances per class label, which indicates that some might have fewer than 200. Thus, we have 4720 test images in VG for 34 unseen class labels.

Visual/Semantic Embeddings and Knowledge Graphs
For the visual inference in this experiment, we use the entire model of Inception-V1 [57,58] on the ConSE baseline model and 1024-dimensional outputs from the top-layer pooling units of Inception-V1 on the other baselines to extract visual features of image instances, which is the same technique used in other recent ZSL researches [49,59]. We consider Inception-V1 to be reasonably appropriate for our experimental setting because it is pre-trained on the ImageNet 2012 1K dataset, so that all of our seen class labels are involved but our unseen class labels are not. Z-score normalization on the dimensions is applied to extracted image features, and the mean and standard deviation values of the training visual features are used on testing visual features for normalization.
Our approach uses a semantic embedding space to measure the CS between the class labels of a surrounding object and a target object. Recall that word vectors are considered to be a semantic embedding space E in this evaluation. In this experiment, we use a skip-gram model [26] trained on Wikipedia English in February 2015 with a window size of 10. The dimension of the word vectors is set to 1000, which is a column dimension of embedding matrices for baselines. All the word vectors are L2 unit-normed, and all class labels are anchored to their own semantic embedding. Specifically, each class label that consists of multiple words has a representative semantic embedding that is the averaged word vector for all the words.
For the SM measurement, a sub-graph of WordNet [32] is used in the proposed model. WordNet is a large-scale hierarchical database with more than 100K English words. The concepts in WordNet are represented as synset IDs, which is the same as in the ImageNet dataset and allows us to integrate the namespace of the class labels into the form of synset IDs. We adopt three SM metrics provided by the NLTK library in WordNet, path, lch, and wup, from the work in [48]. In particular, the distance relationships for path are scaled to the similarity measurement in NLTK to range the value from 0 to 1. We conduct a performance evaluation separately for each metric, as presented in the next subsection.

Results
The experimental evaluation is performed under two ZSL settings: classic and generalized. Only unseen class labels and both unseen and seen class labels are considered as target categories for prediction in the classic setting and the generalized setting, respectively. Note that the ground-truths of the test image instances represent only unseen class labels. The performance of each model is evaluated in terms of the average per-class accuracy, "per-class" and an overall accuracy for all instances, "per-instance". Note that the accuracy values are expressed in percentage in all our experiments.
We apply the proposed contextual calibrations to the visual inferences of four baseline models for validation, giving seven types of evaluation per baseline model-three SMs (path, lch, and wup), CS, and HMs-for each of the three semantic similarities.

•
Comparative Evaluation to Baselines Table 2 shows the results of our comparison with the baselines. All the prediction models using any type of the contextual inferences significantly outperform the baseline models in both settings. In most cases, the contextual inferences with harmonic measures are more effective than the others.  16.36 (40%), and 20.55 (54%) in the classic setting, respectively and 3.31 (95%), 3.31 (324%), 3.31 (670%), and 4.54 (53%) in the generalized setting, respectively. The enhancements in the generalized setting are mostly higher than those in the classic setting, which indicates that the influence of contextual inference increases as the number of target categories increases. That in turn implies that the surrounding objects do have a common relationship, and thus, the context induces predictions of relevant types of categories among the various types of the entire categories.
CS shows better performance than SM in the classic setting but poor performance in the generalized setting. In several evaluations, it has no effect at all. We attribute that difference to the properties of the external knowledge sources. A semantic embedding space is trained from a text corpus in a sub-symbolic manner; thus, it contains less intuitive information, such as concept relations or category types, than a symbolic knowledge graph, which explains the relatively poor effectiveness of CS when the number of categories is large.
Moreover, for SJE, LatEm, and ConSE, the calibrated prediction performances are mostly the same in the generalized setting, which indicates that the visual inference results are not referred in the predictions in those cases. In other words, our experiments have verified the importance of considering contextual information. •

Top-n Result
The top-N evaluation is conducted on the HM path method which shows generally flat performance in the previous hit@1 evaluation (Table 3). Our proposed method enhances performance compared with the baselines even in the top-n. As n increases, the rate of performance improvement tends to decrease slightly. In particular, according the performance of the ConSE, baseline at top-n is relatively higher than top-1, the improvement rate becomes lower. In other words, it implies that context plays a particularly important role in giving the correct category the highest ranking. Moreover, the significant effectiveness of similarity-based contextual inference is validated in the top-n prediction results.

• Ablation Study
We conduct an additional experiment to validate whether a nearby object is highly relevant to the target object. The ablation study is performed by excluding the module that applies the defined distance calculation from the entire model. For the contextual inference, the HM path calibration is applied with a normal average instead of the distance-weighted average in Equation (9). As seen in Table 4, even with only the similarity measurement, our proposed approach outperforms the existing models by a large margin. In most cases, however, the ablation study confirms that distance weighting boosts performance, verifying the validity of our defined distance calculation.

Dataset
VG contains more than 100K images, each of which has 35 object categories on average, and it is separated into two subsets: part-1, with about 60K images, and part-2, with about 40K images. In this experiment, we use categories and images in VG for both training and testing.
Our main goal in this experiment is to validate our method against both the baseline model and a recently proposed context-aware ZSR model [45] that is already evaluated on the VG dataset using the same setting as in [56]. We thus adopt the same split of seen and unseen class labels used in [56], 478 seen class labels and 130 unseen class labels, for a larger number of unseen class labels than in the first experiment. Similar to the work in [45], we use 55,038 images with 621,770 instances from part-1 of the VG dataset for training and 7818 images with 33,921 instances from part-2 for testing. The exact number of images differs slightly, but it is still considered within tolerance.

Visual Classifier and Semantic Embedding Space
Following the same experimental setting and using ConSE [23] as the baseline model for the comparative evaluation, prediction results from a visual classifier are needed to obtain a visual score for the target objects. We fine-tune ResNet-50 (without freezing the conv. layers) pretrained on ImageNet 2012 1K by building the dimension of the output layer to be the same as the number of seen class labels. The SGD optimizer is used for fine-tuning with approximately 240,000 iterations, and the learning rate, momentum, weight decay, and batch size are set to 0.001, 0.9, 0.0001, and 8, respectively.
For the sementic embedding space in this experiment, we have applied GloVe [27] with 300 dimensions pretrained on Wikipedia 2014 and Gigaword 5 [60]. All seen and unseen class labels individually have individually corresponding semantic embeddings or representative semantic embedding, just as in the first experiment. In addition, note that we use the same knowledge graph, WordNet, and SM measurements as well.

• Evaluation Result
This experiment is conducted by matching its experimental configuration to Context-Aware (CA) ZSR [45] to the hilt, but the performance of the ConSE baseline is reproduced slightly differently due to a minute difference in the datasets, fine-tuning of the visual classifier, and the word embedding space.
We thus evaluate performance in terms of the improvement rate using the results given in Table 5. Recall that the numerical values represent accuracy in percentage. In the top-1 of the classic setting, CA presents its performance as 19.6 (increase rate of −1.51%) and 30.2 (9.02%) for a baseline performance of 19.9 and 27.7 on per-class and per-instance, respectively, whereas our proposed method has a performance of 15.34 (2.33%) and 32.62 (3.16%), for 14.99 per-class and 31.62 per-instance. Compared with CA, our method shows enhanced per-class performance, but the per-instance influence is slightly deficient. In the top-1 in the generalized setting, our method produces relatively low enhancement (0.05-0.27 and 0.29-0.81, for per-class and per-instance, respectively) compared with that of the CA (0.1-5.8 and 0.6-20.7). In the top-5 evaluations, our absolute performance of 33.58 on per-instance in the generalized setting outperforms CA's 29.4, although the performance of the ConSE baseline is lower than that of CA.
Consequently, the proposed method offers fair performance improvement in classifying an appropriate category compared with the existing method, and it more often gives categories related to the category type of the target object a high ranking, as shown by the first experimental results. This is furthermore validated through qualitative evaluations by actual exemplary analysis, which is detailed in the following subsection. •

Detector Comparison
In the previous experiment, surrounding information is obtained by detecting and recognizing surrounding objects using YOLOv2 on 80 categories. However, of those 80 categories, we use the results from only 65 that are disjoint from VG's unseen categories. That limited reference to surrounding objects for a somewhat small amount of classification categories could lead to poor performance improvement. For ascertainment, we conduct an additional experiment by applying another pretrained detector with a large scale category, YOLOv2-9K [54], and applying the other source, ground-truths for seen class labels.
YOLOv2-9K produces detected results above the confidence threshold of 0.1 for 8955 categories out of 9K that have available semantic embedding and disjoint from unseen categories. In both YOLOv2-80 and YOLOv2-9K, we exclude detections whose intersection-over-union with the bounding box of the target object is 0.8 or higher.
Ground-truths for seen class labels, GT S , are applied as surrounding information with a fixed confidence by exploiting bounding boxes from annotations on the seen class labels in the test images. The value of the prediction probability p in Equation (2) is fixed to 1.0 in GT S . To alleviate the problem of noise caused by high-frequency objects, such as a category, window in the ground-truths, we set the system to reference only one randomly selected object per class label.
The evaluation result of the variation in detection of surrounding objects is presented in Table 6. The performance of YOLOv2-9K is generally poor because of its tendency to return detection results for higher-level categories or abstract concepts such as whole, instrumentality, and creation. That negative tendency causes confusion by misreferencing the surrounding information. GT S outperforms the models to which other detectors are applied in most cases, which means that the contextual inference is well harmonized with the visual inference in ZSR. Table 6. A comparative evaluation in the differentiation of sources for surrounding information with the HM path method.  However, there is not a significant gap between the performance of YOLOv2-80 and that of GT S . In other words, our method is easy to apply practically to existing detectors without needs to provide accurate surrounding information because it can reference the potentials of classification candidates. Furthermore, it is promising to show optimized performance with a detector that stably recognizes a wide variety of categories. •

Qualitative Analysis
We also use GT S -based surrounding information for qualitative evaluations to clearly analyze the effects of contextual inference. The HM path method on ConSE is applied to the contextual inference, and its optimized values of balancing parameters, α and β on GT S , are used for the evaluation. Figure 3 depicts several qualitative experimental results. The upper three and lower two results show the positive and negative effects of our method, respectively. In the first example, the ground-truth class label of the target object (orange box) is chair. The visual inference predicts chair-like bathhub and toilet in the first and second rank, respectively, with chair in the third. By referring to the nearby objects, table and sofa, which are related to a living room, the contextual inference calibrates the prediction results to rank chair first. Another surrounding object, fan, far from the target, does not adversely affect the positive calibration because the degree of reference is reduced by distance-weighting. This verifies our assumption that related objects are likely to be located nearby. In fact, the CSs of sofa and table with chair in GloVe are both above 0.4, whereas that of fan is approximately 0.11. Similar positive phenomena are confirmed from the second and third examples, as well. Our methods infers the correct class label, flower, which is not even in the top-5 of the visual-only prediction, into the top-3 by using the contexts close to the target, leaf and vase, in the second example. In the third example, which has little useful surrounding information, the rank of collar is slightly increased by the closest object, dog, with our distance formula.
Although the overall performance is enhanced by the positive calibration of contextual inference, it still has some negative effects, particularly in the cases of general categories irrelevant to specific objects. For example, the visually fourth-ranked ground-truths, writing and sky are excluded from the top-5 and re-ranked to ninth and sixth, respectively. The tableware objects surrounding writing negatively affect the prediction and produces high rankings for tableware-related categories such as coffee and chair. Sky, which appears in most outdoor images regardless of the particular environment, generally does not have a semantic relation with a specific object. In the last example, the ground-truth category is undervalued by the contextual inference and rather structure-related window and building are overvalued due to the low similarities between sky and its surrounding roof and train. However, our results generally validate that our methods based on the contextual inference positively affect ZSR in an advanced way, as shown by the previous evaluation results. Figure 3. Qualitative examples evaluated on HM path in the classic setting using the ground-truths of seen categories instead of the pretrained detector. The orange and blue boxes indicate the target unseen objects and ground-truth surrounding objects, respectively.

Conclusions
We have proposed a novel approach to ZSR that enhances the performance of existing ZSL methods. Our method uses surrounding information as context by measuring the similarities of each surrounding object and applying distance-weighted averaging with a defined distance calculation formula to calibrate the visually predicted results. We performed experimental evaluations with various combinations of similarity measures to validate the comparative performance of our proposed method on two different datasets with ImageNet and Visual Genome categories. Our experimental results demonstrate that our method enhances performance by a large margin compared with existing methods. The ablation and differentiation in detectors studies verified the effectiveness of distance-weighting and the potential practicality of our method, respectively. Future research could consider the topological relationships among objects in an image and optimized semantic embedding for ZSR using annotations of train images as sources.