You are currently viewing a new version of our website. To view the old version click .
Journal of Imaging
  • Article
  • Open Access

15 December 2022

A Framework for Enabling Unpaired Multi-Modal Learning for Deep Cross-Modal Hashing Retrieval

,
and
Department of Computer Science, School of Science, Loughborough University, Loughborough LE11 3TT, UK
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Advances and Challenges in Multimodal Machine Learning

Abstract

Cross-Modal Hashing (CMH) retrieval methods have garnered increasing attention within the information retrieval research community due to their capability to deal with large amounts of data thanks to the computational efficiency of hash-based methods. To date, the focus of cross-modal hashing methods has been on training with paired data. Paired data refers to samples with one-to-one correspondence across modalities, e.g., image and text pairs where the text sample describes the image. However, real-world applications produce unpaired data that cannot be utilised by most current CMH methods during the training process. Models that can learn from unpaired data are crucial for real-world applications such as cross-modal neural information retrieval where paired data is limited or not available to train the model. This paper provides (1) an overview of the CMH methods when applied to unpaired datasets, (2) proposes a framework that enables pairwise-constrained CMH methods to train with unpaired samples, and (3) evaluates the performance of state-of-the-art CMH methods across different pairing scenarios.

1. Introduction

Information retrieval refers to obtaining relevant information from a dataset when prompted by a query. When a dataset comprises samples from different modalities such as text, images, video, and audio, this field is known as Multi-Modal Information Retrieval (MMIR). Research towards MMIR methods is gaining considerable interest due to the expansion of data, resulting in a need for efficient methods capable of handling large-scale multi-modal data [,,,,]. Cross-Modal Retrieval (CMR) is a sub-field of MMIR which focuses on retrieving information from one modality using a query from another modality. An example of CMR is retrieving images when using text as a query and vice versa. These image-to-text and text-to-image tasks are the focus of CMR within this paper.
Where retrieval speed and storage space are considered top priorities, Cross-Modal Hashing (CMH) networks have recently been favoured over other traditional retrieval methods due to the computational efficiency and compactness that comes with the binary representations produced by hash-based methods [,]. CMH networks are typically constructed with parallel text and image modules that work in unison to learn the objective hash function. Through the use of the hash function, image and text samples can be mapped to a joint hash subspace []. Within this subspace, similarity comparisons can be made when prompted by a query to rank data points by relevance for the top results to be provided as the result of the retrieval task.
Recent state-of-the-art CMH methods include Deep Cross-Modal Hashing (DCMH) [], Adversary Guided Asymmetric Hashing (AGAH) [], Joint-modal Distribution based Similarity Hashing (JDSH) [] and Deep Adversarial Discrete Hashing (DADH) []. Jiang et al. [] proposed DCMH by applying a deep learning approach to feature learning and hash generation, an end-to-end deep learning framework for CMH. As opposed to previous CMH methods that employed a relaxation of the discrete learning problem into a continuous learning problem, hash code generation within DCMH was learnt directly without relaxation. DCMH laid the foundation for forthcoming methods such as AGAH proposed by Gu et al. [], which through an adversarial approach to learning and the introduction of three collaborative loss functions, similarities of similar pairs were strengthened while disassociating dissimilar pairs. Liu et al. [], with the proposed JDSH network, employed a joint-modal similarity matrix to preserve semantic correlations and exploit latent intrinsic modality characteristics. Further, the Distribution-based Similarity Decision and Weighting (DSDW) module was proposed as part of JDSH to generate more discriminative hash codes. Bai et al. [], with the adversarial-based method DADH, maintained semantic consistency between original and generated representations while making the generated hash codes discriminative.
The data most often used during training of information retrieval networks are paired, e.g., there is one-to-one or one-to-many correspondence between the text and image samples being used. However, such paired data are not always present in real-world data, and often constructed for training machine learning models. Unpaired data where no relationship is given between text and image samples are a common scenario present in the real world, which is not currently accounted for in many proposed CMH methods. Once unpaired samples are introduced into the data being used, pairwise-constrained methods cannot process such unpaired data in their baseline configuration and could therefore be unsuitable for real-world use cases where the data to be used are unpaired.
This paper proposes a framework that facilitates the use of unpaired data samples for the training of CMH methods. The proposed framework can be employed to enable CMH methods to include unpaired samples in the learning process. The contributions of this paper are as follows.
  • A comprehensive overview of CMH methods, specifically in the context of utilising unpaired data. The current state of CMH is surveyed, the different pairwise relationship forms in which data can be represented are identified, and the current use or lack of unpaired data is discussed [,,]. However, the literature does not provide an overview of CMH methods applied to unpaired data. The aspects which bind current CMH methods to paired data are discussed.
  • A new framework for Unpaired Multi-Modal Learning (UMML) to enable training of otherwise pairwise-constrained CMH methods on unpaired data. Pairwise-constrained CMH methods cannot inherently include unpaired samples in their learning process. Using the proposed framework, the MIR-Flickr25K and NUS-WIDE datasets are adapted to enable training of pairwise-constrained CMH methods when datasets contain unpaired images, unpaired text, and both unpaired image and text within their training set.
  • Experiments were carried out to (1) evaluate state-of-the-art CMH methods using the proposed UMML framework when using paired and unpaired data samples for training, and (2) provide an insight as to whether unpaired data samples can be utilised during the training process to reflect real-world use cases where paired data may not be available but a network needs to be trained for a CMR task.
This paper is organised as follows: Section 2 surveys the current state of CMH methods and unpaired data usage. Section 3 provides an overview of the proposed UMML framework. Section 4 discusses the datasets, the experiment methodology and the evaluation metrics used. Section 5 describes experiments that have been conducted using state-of-the-art CMH methods employing the proposed UMML framework for training across various data pairing and unpairing scenarios.

3. UMML: Proposed Unpaired Multi-Modal Learning (UMML) Framework

To make CMH methods compatible with the processing of unpaired samples, a suitable approach must be employed to include unpaired samples in the datasets to be used. The need for a suitable approach is due to CMH methods often not being compatible with the processing of unpaired samples, as is the case with methods DADH, AGAH and JDSH as discussed in Section 2.5. The input to the methods tested are pairs of image and text, and as such, the methods do not allow for a single unpaired sample to be used for training. To tackle this issue, the Unpaired Multi-Modal Learning (UMML) framework is proposed, illustrated in Figure 4. The implementation of this framework is provided (https://github.com/MikelWL/UMML, accessed on 9 December 2022).
Figure 4. Unpaired Multi-Modal Learning (UMML) framework workflow. The diagram shows an example of 50% of images being unpaired where 50% text Bag of Words (BoW) binary vectors are emptied. Similarly, in the case of text being unpaired, the image feature matrices would be emptied (CNN: convolutional neural network).
The unpairing process. Let X be a n × i matrix where n is the number of image samples and i is the number of image features, and let Y be a n × j matrix of n text samples and j text vectors. The dataset D to be unpaired is defined as D = ( X , Y ) , where each row X n X is paired to each row Y n Y . The samples to be unpaired are selected according to the percentage of the training set being unpaired: if 20% of data is unpaired, the first 20 out of every 100 samples of the training set are selected to be unpaired, and if 40% of data is unpaired, the first 40 out of every 100 samples are selected, and so on. Once the samples to be unpaired are selected, in the case of X n images being unpaired, text vectors are extended to include unpaired sample markers at the end of the vector, and the corresponding paired Y n text vectors are replaced by an empty vector and marked as unpaired samples. In the case of Y n texts being unpaired, the corresponding paired X n image features are replaced by an empty vector and marked as unpaired. Finally, the dataset is constructed with the newly unpaired samples, and appropriate data loaders provided within UMML are used to feed the unpaired datasets to the methods being trained. These data loaders ensure the input requirements of the methods match the dimensions of the newly constructed unpaired dataset.
UMML is needed because pairwise-constrained CMH methods do not support actual unpaired data, thus requiring the use of empty samples to enable unpaired scenario evaluations. Once these unpaired samples are fed to the method being trained, no semantic value will be present within the emptied samples, resulting in semantically unpaired training. The newly unpaired samples are finally bundled with the samples which have been kept paired, and the unpaired dataset variant is constructed. Note that only the training set of the dataset is made unpaired by the UMML framework, as the query/retrieval set is left unaltered to avoid using empty samples as queries/retrievals during testing.

4. Experiment Methodology

4.1. Datasets

MIRFlickr-25K [] contains 25k image-tag pairs, with these samples being assigned to at least one label from 24 classes. ‘Tags’ refer to the tags associated with each image, which are used as the text counterpart to the images for retrieval. ‘Labels’ are used as ground truth (i.e., classes) to measure performance at the test stages. For the experiments conducted, only the image-tag pairs which contain at least 20 textual tags are used, which brings down the overall image-tag pair count to 20 , 015 . The query set consists of the samples which are used as the queries for the test stage, and the retrieval set contains the retrieval candidates to the query. The training set is formed as a subset of the retrieval set.
NUS-WIDE [] consists of 269 , 648 image-tag pairs categorised into 81 manually annotated classes. For our experiments, 195 , 834 image-tag pairs are selected, which belong to the 21 most frequent concepts. Table 1 shows the number of samples utilised for training and testing of the MIR-Flickr25K and NUS-Wide datasets. Table 2 shows examples of MIR-Flickr25K and NUS-Wide images, along with their tags and labels.
Table 1. MIRFlickr-25K and NUS-Wide dataset characteristics.
Table 2. Example of images, paired tags, and labels from the MIR-Flickr25K and NUS-WIDE datasets. Example images (1) and (2) reprinted under Creative Commons attribution, (1) Author: Martin P. Szymczak, Source, CC BY-NC-ND 2.0 (2) Title: Squirrel, Author: likeaduck, Source, CC BY 2.0.

4.2. Methods

The experiments compare state-of-the-art cross-modal hashing methods for information retrieval and evaluate their performance in paired and unpaired data scenarios using the proposed UMML framework. These methods are: Adversary Guided Asymmetric Hashing (AGAH) [], Joint-modal Distribution-based Similarity Hashing (JDSH) [] and Deep Adversarial Discrete Hashing (DADH) []. These methods were chosen due to their relevance within the CMH information retrieval field and the full source code for each of the methods being publicly available. Unsupervised methods such as JDSH could opt to drop samples, i.e., skip samples in the training process, to replicate unpaired sample behaviour. During experiments using the proposed UMML framework, samples were not dropped to ensure consistency when comparing the methods. For all the methods evaluated, the 64-Bit setting is used. Experiments were not conducted with other bit settings because the study focuses on the usage and effect of training with unpaired samples instead of comparing the efficiency of different bit lengths.

4.3. Evaluation Metrics

To evaluate the performance of the CMH models, the widely adopted [] retrieval procedure of Hamming ranking is used, that sorts search results based on Hamming distance to the query samples. The metric mean average precision ( m A P ) is then used to evaluate Hamming retrieval performance. Precision is the fraction of retrieved samples that are relevant to a query. Recall is the fraction of the relevant samples that are successfully retrieved. Average Precision ( A P ) is the mean of the precision scores for a given query. Mean Average Precision( m A P ) is the mean of the AP across a number of queries. Mean average precision is the primary performance measure employed in CMH retrieval, and it is obtained as the mean of the average precision ( A P ) values at 100% Recall.
Performance difference compares the results when training with unpaired samples to results when training with paired samples. Let m A P p be the performance obtained during paired training and m A P u be the performance obtained during unpaired training. The percentage of performance difference is calculated as follows.
Perf . Diff . = m A P u m A P p 1 × 100

5. Experiment Results

This section presents the results of experiments conducted using the proposed UMML framework applied to facilitate unpaired learning using the DADH, AGAH and JDSH methods and the MIR-Flicker25K and NUS-WIDE datasets. The training performance of these CMH methods is evaluated across different sampling scenarios: unpaired images, unpaired text, a combination of unpaired images and text, and random sample discarding. The abovementioned methods are also compared to other unpaired CMH methods that can learn unpaired data and thus do not require the UMML framework. Finally, the performance of the most promising method, DADH, is investigated at a more granular level that involves analysing DADH’s performance across each class of the MIR-Flickr25K dataset.

5.1. Training with Unpaired Images

Image to text ( i t ) and text to image ( t i ) evaluation results using unpaired images within the training set are presented in Table 3 and illustrated in Figure 5 for the MIR-Flickr25K and NUS-WIDE datasets. On the left-hand side of Table 3, results when training with a fully paired training set are provided. Results when using different unpaired sample sizes are then shown in increments of 20%. DADH, being the method built as an improvement to AGAH, outperforms the other methods, followed by AGAH. JDSH sees more limited performance when compared to DADH and AGAH, mainly due to it being an unsupervised method.
Table 3. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images, i.e., images with no corresponding text. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired images in the training set.
Figure 5. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images, i.e., images with no corresponding text. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired images in the training set in increments of 20%.
For MIR-Flickr25K, as shown in Figure 5a, DADH and AGAH show similar behaviours when using unpaired images for training; ( t i ) results see a marginal decrease in performance, while the ( i t ) task sees a more gradual decrease in performance as more unpaired images are introduced to the training set. In the case of JDSH however, both tasks are affected in a similar manner. For NUS-WIDE, the results obtained show a different pattern when compared to that observed in MIR-Flickr25K. For DADH, AGAH and JDSH, the performance decrease as more unpaired images are introduced is gradual for both ( i t ) and ( t i ) tasks. Based on the results shown in Table 3 and Figure 5, the following main observations can be made when training with unpaired images.
(1)
Dataset impacts the performance of models. Different datasets provide different behaviours when unpaired samples are introduced into the training set. With MIR-Flickr25K, DADH and AGAH see different patterns of performance decrease for the ( i t ) and ( t i ) tasks, while with NUS-WIDE, DADH and AGAH see similar patterns for the two tasks. JDSH, on the other hand, shows similar patterns for both tasks on both datasets.
(2)
Percentage of Unpairing may impact performance. For MIR-Flickr25K, the performance of methods DADH and AGAH for the ( i t ) task is negatively affected as the percentage of unpaired images increases. For the ( t i ) however, with the exception of 100% image unpairing, performance was unaffected when the percentage of unpaired images increased. Once all images in the training set are fully unpaired (i.e., 100% unpaired), the performance of both tasks across all methods is measured at an average of 0.564 mAP for MIR-Flickr25K and 0.268 mAP for NUS-WIDE. These results will later be compared to random performance evaluations in Section 5.4 to determine the extent to which the methods are learning from training with 100% unpaired images.

5.2. Training with Unpaired Text

Evaluation results using unpaired text within the training set are presented in Table 4 and shown in Figure 6. For MIR-Flickr25K, in the case of DADH, ( i t ) retains its performance as more unpaired samples are used for training while the ( t i ) is the task which sees a gradual performance decrease. This behaviour is the opposite of what occurred when unpaired images were used for training, which indicates that the task mostly affected by using unpaired samples for training in the case of DADH is the task which uses the unpaired modality as the query of the retrieval. In the case of AGAH and JDSH, however, both tasks saw similar rates of performance decrease as more unpaired samples were used for training. This behaviour deviates from what would be expected when considering the unpaired text scenario, where AGAH saw the two tasks being affected differently. For NUS-WIDE, on the other hand, for all three methods DADH, AGAH and JDSH, both ( i t ) and ( t i ) tasks saw a performance decrease with the ( i t ) task being the task which saw the most decrease.
Table 4. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired text, i.e., text with no corresponding images. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired text in the training set.
Figure 6. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired text, i.e., text with no corresponding images. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired text in the training set in increments of 20%.
In addition to the observations made when unpairing images, unpairing text provides the following observation: whether ( i t ) or ( t i ) will be the most affected task when training with unpaired samples depends on the method used, the dataset being evaluated, and the modality being used. In the case of DADH, when training with unpaired text on the MIR-Flickr25K dataset, it was the ( t i ) task that was most negatively affected. For AGAH, however, when unpairing text on the NUS-WIDE dataset, it was the ( t i ) task that was most negatively affected. It is essential to evaluate the method being studied to verify its behaviour when training with unpaired samples. This behaviour will determine whether it is feasible to adapt the CMH method to the unpaired sampling scenario.

5.3. Training with Unpaired Images and Text

Evaluation results using unpaired images and text within the training set are presented in Table 5 and shown in Figure 7. A noteworthy observation is concerning the 50%/50% training set that comprises 50% of unpaired images and 50% unpaired text. DADH and AGAH do not see the same drop in performance as was seen with the 100% image or 100% text unpaired evaluations (see Table 3), where performance dropped to an average of 0.546 mAP for MIR-Flickr25K and 0.267 mAP for NUS-WIDE. Instead, performance when using the 50%/50% training set was measured at an average of 0.761 mAP for MIR-Flickr25K and 0.680 for NUS-WIDE with DADH, and 0.711 mAP for MIR-Flickr25K and 0.566 for NUS-WIDE using AGAH. These results indicate that DADH and AGAH can learn more from the 50%/50% unpaired training set when compared to the 100% unpaired image and 100% unpaired text sets.
Table 5. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images and text, i.e., images with no corresponding text and vice versa. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired images and text in the training set, for example, ‘10% 10%’ refers to 10% of the training set being unpaired images (UI) and another 10% being unpaired text (UT) for a total of 20% of the dataset being unpaired samples.
Figure 7. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images and text, i.e., images with no corresponding text and vice versa. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired images and text in the training set, for example, ‘10%/10%’ refers to 10% of the training set being unpaired images and another 10% being unpaired text for a total of 20% of the dataset being unpaired samples.

5.4. Training with Sample Discarding

Previous experiments evaluated the retrieval performance of CMH methods when using unpaired samples during the training process. Overall, a gradual decrease in performance was observed when the percentage of unpaired samples increased within the training set. The objective of the experiment described in this section is to investigate whether including unpaired samples in the training set improves retrieval performance compared to discarding them.
Method for Sample Discarding. Sample discarding refers to removing a given percentage of paired samples from the training set, which results in a smaller but still fully paired training set. For example, the datasets utilised for the experiments comprise 10,000 paired samples each. When 20% (2000) of samples are discarded, the remaining 8000 pairs will be utilised for training. Training when discarding 20% of pairs can then be compared to training with a set of which 20% of samples are unpaired. This comparison is illustrated in Figure 8. The random performance baseline was created by performing cross-modal retrieval using the test set on each model before training.
Figure 8. In (a), 20% of the training set was discarded. In (b), 20% of the training set was unpaired. In this example, for both (a,b), the model will be trained on 8000 paired samples. However, (b) will also train with its additional 2000 unpaired samples. This way, the effect of training with or without the additional unpaired samples can be investigated.
Results. Evaluation results when incrementally discarding samples from the paired training set, along with the random baseline performance results (labelled as Random) are presented in Table 6 and Figure 9. When considering the random performance baseline (i.e, Random column of Table 6), which averaged at 0.546 mAP for MIR-Flickr25K and 0.259 for NUS-WIDE, the baseline results are very close to those obtained when 100% of a given modality was unpaired (as shown in Table 3). When 100% of images or text were unpaired, average mAP scores of 0.564 for MIR-Flickr25K and 0.268 for NUS-WIDE were obtained (as shown in Table 3). As such, training with 100% of a modality being unpaired results in insignificant performance improvement (+0.014/+0.009 mAP for MIR-Flickr25K/NUS-WIDE) over the performance baseline.
Table 6. Results (mAP) on MIR-Flickr25K and NUS-WIDE with sample discarding, i.e., training set being reduced. Column ‘Full’ shows results when training with full training set without any sample discarding. Subsequent columns show results with decreasing amounts of samples, where the given percentage denotes the percentage of samples in the training set which have been discarded. The ‘Random’ column holds the baseline random performance values.
Figure 9. Results (mAP) on MIR-Flickr25K and NUS-WIDE with sample discarding, i.e., training set being reduced. The ‘Full’ points show results when training with the full unaltered training set. Subsequent points show results with decreasing amounts of samples, where the given percentage denotes the percentage of samples in the training set which have been discarded. The ‘Random’ points hold the baseline random performance values.
In terms of sample discarding, DADH and AGAH benefit noticeably from additional training data as shown in Figure 9, showing a steady decrease in performance the more samples are discarded from the training set. JDSH remains more consistent in its performance relative to the amount of training samples being discarded. This suggests that DADH and AGAH would benefit from additional sources of training data, e.g., unpaired samples, while JDSH retains its performance even with reduced amounts of data. To investigate further, we next compare the sample discarding results to previously discussed unpaired sample training results.
Table 7 compares results previously obtained when training with four different training sets, each containing: unpaired images (UI), unpaired text (UT), both unpaired images and text (UIT) and sample discarding (SD). The results for the ( i t ) and ( t i ) tasks are compared separately and jointly. Table 7 shows which of the four cases (i.e., UI, UT, UIT, SD) resulted in the best retrieval performance. The percentage shown in the brackets is the performance difference by which a given unpaired sample case outperformed sample discarding. For example, in Table 7, when DADH is used for the ( i t ) task, using 20% of Unpaired Text (UT) improved retrieval performance by 0.86% compared to when discarding 20% of samples (SD). The actual values can be seen in Table 4 and Table 6, where mAP was 0.824 mAP for SD and 0.831 mAP, respectively.
Table 7. The sampling cases that produced the best retrieval results are indicated by UI: Unpaired Image, UT: Unpaired Text, UIT: Unpaired Image and Text, and SD: Sample discarding. The percentage shown in the brackets is the performance difference by which a given unpaired sample case (shown in Table 3, Table 4 and Table 5) outperformed sample discarding (SD) (shown in Table 6). The first row shows the percentage of training samples being unpaired (UI, UT, UIT), or discarded (SD) depending on the cell value.
Although the DADH, AGAH and JDSH methods were not developed for unpaired training, using the UMML framework, training with unpaired samples resulted in improved results in 53 of the 90 cases evaluated. The performance improvement when including unpaired samples in the training set when compared to discarding the samples outright was more substantial the more limited the amount of paired data was available in the dataset. Therefore, strategies for adapting CMH methods to efficiently train with unpaired samples are needed, particularly when limited paired data is available and unpaired data need to be utilised for improving the learning of the model.

5.5. Comparison to Other Unpaired CMH Methods

The following is a comparison between the pairwise-constrained methods: DADH, AGAH and JDSH when using the proposed UMML framework to enable these methods to learn from fully unpaired samples; and the unpaired CMH methods: Adaptive Marginalized Semantic Hashing (AMSH) [], Robust Unsupervised Cross-modal Hashing (RUCMH) [] and Flexible Cross-Modal Hashing (FlexCMH) [] that can learn from unpaired datasets. These experiments were conducted using the MIR-Flickr25K and NUS-WIDE datasets.
AMSH and FlexCMH are supervised methods that shuffle the training data to create unpaired sample behaviour. RUCMH is an unsupervised method that is independent of pairwise relationships. For the methods DADH, AGAH and JDSH, the 50% image and 50% text UMML unpairing approach is used as discussed in Section 5.3, where half of the text samples are emptied leaving 50% of the images being unpaired, and the other half of image samples are emptied leaving 50% of text being unpaired.
For AMSH, RUCMH and FlexCMH, the results in their respective publications are used for this comparison because the source-code for these methods are not publicly available. For a full specification regarding the training parameters of AMSH [], RUCMH [] and FlexCMH [], please refer to their respective publications.
As shown in Table 8 AMSH outperforms the other methods by a considerable margin on the ( t i ) task, with a 10.23% performance increase over the second best performing method, DADH + UMML. For the ( t i ) task however, DADH + UMML narrowly obtains the best performance. The results obtained by the methods using the UMML extension are important because the methods are not designed for the unpaired scenario; the UMML approach adapts the datasets to the methods for these to be compatible with unpaired data. This indicates there is a need for approaches that fully adapt pairwise-constrained methods to the unpaired scenario.
Table 8. Fully unpaired 64-Bit mAP evaluation results for unpaired CMH methods and traditional CMH methods using UMML.

5.6. Class-by-Class Performance Evaluations

The objective of this experiment is to evaluate the performance of each class in MIR-Flickr25K when training with unpaired samples using DADH. Evaluations were made on a class-by-class basis across the 24 classes in the dataset, where three training set pairing scenarios were evaluated: fully paired, 80% of the training text being unpaired, and 80% of the training images being unpaired.
Table 9 shows the results of the experiments. The first column, ‘MIR-Flick25 Classes’, shows the class number and name. The brackets next to each class indicate the number of queries taken from the given class and the possible number of correct retrievals within the retrieval set (Queries/Relevant Files). The column ‘Performance Difference’ shows the performance difference of the unpaired scenarios when compared to the paired scenario, with the values being computed using Formula (4).
Table 9. MIR-Flickr25K class-by-class mAP@N evaluation using 64-Bit DADH. Fully paired, 80% unpaired images and 80% unpaired text.
Figure 10 shows the performance difference across the classes when training with unpaired samples compared to when training with paired samples, calculated using formula (4). The performance difference across all classes was negative, meaning that the retrieval performance of the the DADH model was always worse when training with unpaired samples. The ‘Performance Decrease’ values shown on the y-axes of the charts of Figure 10, indicate a noticeable variation in performance difference across the classes when training with unpaired samples compared to training with paired samples. For example, classes 8, 9, and 10 were among the five most negatively affected classes when training with both unpaired images and text averaging a performance decrease of 16.98%. On the other hand, classes 4, 18 and 23 were among the five least negatively affected classes averaging a performance decrease of 8.49%. This variation found in the results across classes indicates that the type of data used impacts the degree to which performance is affected when adapting from training with paired data to training with unpaired data. As such, the content used for a given unpaired information retrieval task should be considered when developing strategies to tackle the unpaired scenario.
Figure 10. Percentage of performance change of DADH computed using formula (4) when training with unpaired samples compared to paired training across 24 classes of MIR-Flickr25K. Red bars show the five classes with the most performance change and green bars show the five classes with the least performance change. The remaining classes are marked as blue bars.

6. Conclusions

This paper explores the topic of Unpaired Cross-Modal Hashing (CMH) and the capabilities of state-of-the-art CMH methods with regards to learning from unpaired data in the context of information retrieval. The UMML framework has been proposed to enable pairwise-constrained CMH methods to learn from unpaired data. Through UMML, experiments have been conducted using DADH, AGAH and JDSH with paired and unpaired sample variations of MIR-Flickr25K and NUS-WIDE datasets. Evaluations to determine how unpaired data affect performance were carried out. Below is a summary of the main observations:
Unpaired data can improve the training results of CMH methods. Furthermore, if data from both the image and text modalities are present in the training set, initially pairwise-constrained CMH methods can be trained on fully unpaired data.
The extent to which unpaired data are helpful to the training process is relative to the amount of paired samples. The more scarce the paired samples available, the more helpful it can be to use additional unpaired samples for training.
The performance of the models showcased when using unpaired samples for training is dependent on the modality of the unpaired samples, the dataset being used, the class of the unpaired data, and the architecture of the CMH algorithms. These factors influence whether unpaired samples will be helpful to the training process.
The proposed UMML framework adapts the dataset to enable pairwise-constrained CMH methods to train on unpaired samples. When using UMML to enable DADH, AGAH and JDSH to train with unpaired samples, it was observed that the methods perform well when training with unpaired samples. This suggests that further improvements may be observed if the architectures of these methods are adapted to train on unpaired data.
With the findings obtained in this study, future works include extending the proposed UMML framework to adapt methods to training with unpaired samples at an architectural level. Future work also includes applying the UMML framework using data from case studies provided by industry.

Author Contributions

Conceptualization, M.W.-L. and G.C.; methodology, M.W.-L. and G.C.; software, M.W.-L.; validation, M.W.-L.; analysis, M.W.-L.; investigation, M.W.-L.; resources, G.C. and I.P.; data curation, M.W.-L.; writing—original draft preparation, M.W.-L.; writing—review and editing, M.W.-L. and G.C; visualization, M.W.-L.; supervision, G.C. and I.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code for the experiments presented in this paper can be found in the project’s GitHub repository https://github.com/MikelWL/UMML (accessed on 9 December 2022). Publicly available datasets were analyzed in this study. This data can be found here: MIRFlickr25K: https://press.liacs.nl/mirflickr/ (accessed on 9 December 2022), NUS-WIDE: https://lms.comp.nus.edu.sg/wp-content/uploads/2019/research/nuswide/NUS-WIDE.html (accessed on 9 December 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Baeza-Yates, R.; Ribeiro-Neto, B. Modern Information Retrieval. In Modern Information Retrieval; Association for Computing Machinery Press: New York, NY, USA, 1999; Volume 463. [Google Scholar]
  2. Lu, X.; Zhu, L.; Cheng, Z.; Song, X.; Zhang, H. Efficient Discrete Latent Semantic Hashing for Scalable Cross-Modal Retrieval. Signal Process. 2019, 154, 217–231. [Google Scholar] [CrossRef]
  3. Jin, L.; Li, K.; Li, Z.; Xiao, F.; Qi, G.J.; Tang, J. Deep Semantic-Preserving Ordinal Hashing for Cross-Modal Similarity Search. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 1429–1440. [Google Scholar] [CrossRef] [PubMed]
  4. Kumar, S.; Udupa, R. Learning Hash Functions for Cross-view Similarity Search. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona Catalonia, Spain, 16–22 July 2011. [Google Scholar]
  5. Zhang, D.; Li, W.J. Large-Scale Supervised Multimodal Hashing With Semantic Correlation Maximization. In Proceedings of the AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July 2014; Volume 28. [Google Scholar]
  6. Wang, J.; Zhang, T.; Sebe, N.; Shen, H.T. A Survey on Learning to Hash. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 769–790. [Google Scholar] [CrossRef] [PubMed]
  7. Deng, C.; Yang, E.; Liu, T.; Tao, D. Two-Stream Deep Hashing with Class-Specific Centers for Supervised Image Search. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 2189–2201. [Google Scholar] [CrossRef]
  8. Peng, Y.; Huang, X.; Zhao, Y. An Overview of Cross-Media Retrieval: Concepts, methodologies, Benchmarks, and Challenges. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 2372–2385. [Google Scholar] [CrossRef]
  9. Jiang, Q.Y.; Li, W.J. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3232–3240. [Google Scholar]
  10. Gu, W.; Gu, X.; Gu, J.; Li, B.; Xiong, Z.; Wang, W. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; pp. 159–167. [Google Scholar]
  11. Liu, S.; Qian, S.; Guan, Y.; Zhan, J.; Ying, L. Joint-Modal Distribution-Based Similarity Hashing for Large-Scale Unsupervised Deep Cross-Modal Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, China, 25–30 July 2020; pp. 1379–1388. [Google Scholar]
  12. Bai, C.; Zeng, C.; Ma, Q.; Zhang, J.; Chen, S. Deep Adversarial Discrete Hashing for Cross-Modal Retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, Ireland, 8–11 June 2020; pp. 525–531. [Google Scholar]
  13. Zheng, L.; Yang, Y.; Tian, Q. SIFT Meets CNN: A Decade Survey of Instance Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1224–1244. [Google Scholar] [CrossRef]
  14. Wang, J.; Liu, W.; Kumar, S.; Chang, S.F. Learning to Hash for Indexing Big Data—A Survey. Proc. IEEE 2015, 104, 34–57. [Google Scholar] [CrossRef]
  15. Shen, H.T.; Liu, L.; Yang, Y.; Xu, X.; Huang, Z.; Shen, F.; Hong, R. Exploiting Subspace Relation in Semantic Labels for Cross-Modal Hashing. IEEE Trans. Knowl. Data Eng. 2020, 33, 3351–3365. [Google Scholar] [CrossRef]
  16. Ding, K.; Huo, C.; Fan, B.; Xiang, S.; Pan, C. In Defense of Locality-Sensitive Hashing. IEEE Trans. Neural Netw. Learn. Syst. 2016, 29, 87–103. [Google Scholar] [CrossRef]
  17. Hardoon, D.R.; Szedmak, S.; Shawe-Taylor, J. Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Comput. 2004, 16, 2639–2664. [Google Scholar] [CrossRef]
  18. Liu, X.; Hu, Z.; Ling, H.; Cheung, Y.M. MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 964–981. [Google Scholar] [CrossRef] [PubMed]
  19. Cao, W.; Feng, W.; Lin, Q.; Cao, G.; He, Z. A Review of Hashing Methods for Multimodal Retrieval. IEEE Access 2020, 8, 15377–15391. [Google Scholar] [CrossRef]
  20. Pereira, J.C.; Coviello, E.; Doyle, G.; Rasiwasia, N.; Lanckriet, G.R.; Levy, R.; Vasconcelos, N. On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 521–535. [Google Scholar] [CrossRef] [PubMed]
  21. Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
  22. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  23. Luo, X.; Wang, H.; Wu, D.; Chen, C.; Deng, M.; Huang, J.; Hua, X.S. A Survey on Deep Hashing Methods. Acm Trans. Knowl. Discov. Data 2022. [Google Scholar] [CrossRef]
  24. Strecha, C.; Bronstein, A.; Bronstein, M.; Fua, P. LDAHash: Improved Matching with Smaller Descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 66–78. [Google Scholar] [CrossRef]
  25. He, J.; Liu, W.; Chang, S.F. Scalable Similarity Search with Optimized Kernel Hashing. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 1129–1138. [Google Scholar]
  26. Gui, J.; Liu, T.; Sun, Z.; Tao, D.; Tan, T. Supervised Discrete Hashing with Relaxation. IEEE Trans. Neural Netw. Learn. Syst. 2016, 29, 608–617. [Google Scholar] [CrossRef]
  27. Gionis, A.; Indyk, P.; Motwani, R. Similarity Search in High Dimensions via Hashing. Very Large Data Bases 1999, 99, 518–529. [Google Scholar]
  28. Zhu, X.; Huang, Z.; Shen, H.T.; Zhao, X. Linear Cross-Modal Hashing for Efficient Multimedia Search. In Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, Spain, 21–25 October 2013; pp. 143–152. [Google Scholar]
  29. Ding, G.; Guo, Y.; Zhou, J. Collective Matrix Factorization Hashing for Multimodal Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2075–2082. [Google Scholar]
  30. Lin, Z.; Ding, G.; Han, J.; Wang, J. Cross-View Retrieval via Probability-Based Semantics-Preserving Hashing. IEEE Trans. Cybern. 2016, 47, 4342–4355. [Google Scholar] [CrossRef]
  31. Liu, Q.; Liu, G.; Li, L.; Yuan, X.T.; Wang, M.; Liu, W. Reversed Spectral Hashing. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 2441–2449. [Google Scholar] [CrossRef]
  32. Liu, X.; Yu, G.; Domeniconi, C.; Wang, J.; Ren, Y.; Guo, M. Ranking-Based Deep Cross-Modal Hashing. Proc. Aaai Conf. Artif. Intell. 2019, 33, 4400–4407. [Google Scholar] [CrossRef]
  33. Wang, J.; Liu, W.; Sun, A.X.; Jiang, Y.G. Learning Hash Codes with Listwise Supervision. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3032–3039. [Google Scholar]
  34. Jin, Z.; Hu, Y.; Lin, Y.; Zhang, D.; Lin, S.; Cai, D.; Li, X. Complementary Projection Hashing. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 2–8 December 2013; pp. 257–264. [Google Scholar]
  35. Yang, E.; Deng, C.; Liu, W.; Liu, X.; Tao, D.; Gao, X. Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  36. Li, C.; Deng, C.; Li, N.; Liu, W.; Gao, X.; Tao, D. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4242–4251. [Google Scholar]
  37. Mandal, D.; Chaudhury, K.N.; Biswas, S. Generalized Semantic Preserving Hashing for N-Label Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4076–4084. [Google Scholar]
  38. Hu, Z.; Liu, X.; Wang, X.; Cheung, Y.m.; Wang, N.; Chen, Y. Triplet Fusion Network Hashing for Unpaired Cross-Modal Retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; pp. 141–149. [Google Scholar]
  39. Wen, X.; Han, Z.; Yin, X.; Liu, Y.S. Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 478–483. [Google Scholar]
  40. Gao, J.; Zhang, W.; Zhong, F.; Chen, Z. UCMH: Unpaired Cross-Modal Hashing with Matrix Factorization. Elsevier Neurocomput. 2020, 418, 178–190. [Google Scholar] [CrossRef]
  41. Liu, W.; Wang, J.; Kumar, S.; Chang, S. Hashing with graphs. In Proceedings of the International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
  42. Cheng, M.; Jing, L.; Ng, M. Robust Unsupervised Cross-modal Hashing for Multimedia Retrieval. ACM Trans. Inf. Syst. (TOIS) 2020, 38, 1–25. [Google Scholar] [CrossRef]
  43. Luo, K.; Zhang, C.; Li, H.; Jia, X.; Chen, C. Adaptive Marginalized Semantic Hashing for Unpaired Cross-Modal Retrieval. arXiv 2022, arXiv:2207.11880. [Google Scholar]
  44. Yu, G.; Liu, X.; Wang, J.; Domeniconi, C.; Zhang, X. Flexible Cross-Modal Hashing. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 304–314. [Google Scholar] [CrossRef] [PubMed]
  45. Huiskes, M.J.; Lew, M.S. The MIR Flickr Retrieval Evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, Vancouver, BC, Canada, 30–31 October 2008; pp. 39–43. [Google Scholar]
  46. Chua, T.S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A Real-World Web Image Database From National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, Santorini Island, Greece, 8–10 July 2009; pp. 1–9. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.