Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Internet users are benefiting from technologies of abstractive summarization enabling them to view articles on the internet by reading article summaries only instead of an entire article. However, there are disadvantages to technologies for analyzing articles with texts and images due to the semantic gap between vision and language. These technologies focus more on aggregating features and neglect the heterogeneity of each modality. At the same time, the lack of consideration of intrinsic data properties within each modality and semantic information from cross-modal correlations result in the poor quality of learned representations. Therefore, we propose a novel Inter- and Intra-modal Contrastive Hybrid learning framework which learns to automatically align the multimodal information and maintains the semantic consistency of input/output flows. Moreover, ITCH can be taken as a component to make the model suitable for both supervised and unsupervised learning approaches. Experiments on two public datasets, MMS and MSMO, show that the ITCH performances are better than the current baselines.


Introduction
The last two decades have witnessed a surge of information on the internet. Extensive digital resources in a variety of formats (text, image and video) have enriched our lives, facilitated by a proportional increase in online sharing platforms, such as YouTube, Facebook, etc. Meanwhile, a large number of articles, including texts, images and videos, are continuously generated and displayed on the internet everyday. For example, BBC News provided 1.1 million multimedia articles in 2021, with 72 million daily visitors [1].
This large amount of information provides opportunities for people to obtain what they want from the internet. However, reading such numbers of articles in their entirety is time-consuming work. Consequently, it is necessary to analyze multimedia articles and make summarizations automatically for them so that internet users can read the short summarizations rather than the whole articles.
Recently, research into multimodal abstractive summarization (MAS) has provided approaches for integrating image and text modalities into a short, concise and readable textual summary [2,3]. With the rapid development of deep learning technologies, more and more researchers have explored various methods for solving this task in unsupervised [4,5] or supervised [3,6,7] approaches. In general, the current deep-learning-based schemes are inseparable from the extracting feature then downstream processing [8] paradigm.
In the multimedia field, especially for MAS, there are usually three steps [8], which are (1) feature extraction, (2) multimodal fusion and (3) textual generation. Figure 1 shows details of the common multimodal abstractive summarization framework. Firstly, the step of feature extraction aims at extracting region-or token-level features from multimodal references using their own domain extractors, such as ConvNet and SeqModel for visual and textual data. Next, in the step of multimodal fusion, fusion information is obtained using cross-modal mechanisms (e.g., alignment or projection). After that, a target textual summary is generated by maximizing likelihood estimation or augmentation objectives in the step of textual generation.

Inputs
All three explosions being audible within the stadium, the match was played to a finish with supporters congregating on the pitch at full-time before they were evacuated.

Multimodal Encoder Textual Decoder
… Figure 1. Illustration of the standard multimodal abstractive summarization framework, which consists of a multimodal encoder and a textual decoder. The decoder generates a target summary after extracting the visual semantic features and merging them together.
Current research focuses more on processes of the multimodal fusion and textual generation steps instead of feature extraction, as the feature extractors have already been widely used in the fields of natural language processing (NLP) and computer vision (CV) and obtain good performance. In approaches of multimodal fusion, multiple inputs are fused by attention-based [9] or gate-based [3] mechanisms in order to learn a representation that is suitable for summary generation. Such solutions concentrate on aggregating features from several modalities. However, they ignore the heterogeneity of vision and language and do not consider that there are semantic gaps between images and text. In the research on textual generation, designing a novel decoder and adding objectives are two main approaches. The classic scheme employs recurrent neural network (RNN [10]) or CopyNet [11] as a backbone caused by the sequence properties of language. Recently, transformer-based pre-trained generative language models, such as UniLM [12], BART [13] and ProphetNet [14], have shown remarkable performance on generation tasks, one for the advantages of the self-attention module and the other for the large-scale corpus. Adding extra training goals can lead to better performance for driving the summary generation, whose typical goals are image-text [15] or text-text [16] matching. The recent research also explores a contrastive-based method to eliminate the gap between training and verification [17]. However, the above additional objectives focus more on the textual coherence rather than the semantical consistency of the input image and sentences. To summarize, the existing system has two flaws: (1) a visible gap between vision and language, and (2) a lack of consideration of intrinsic data properties within input-output sentences and semantic consistency among input cross-modal correlation.
To address the aforementioned problems, this paper provides an Inter-and InTramodal Contrastive Hybrid (ITCH) learning framework for the MAS task. It adjusts three points of the vanilla transformer: it (1) uses the pre-trained language and vision models as encoders, (2) adds a cross-modal fusion module and (3) adds hybrid auxiliary contrastive objectives. The pre-trained vision transformer [18] (ViT) and BERT [19] are employed to encode image and text, respectively, to assure the unity of bi-modal information processing. For tackling flaw 1, we propose a cross-modal fusion module to compensate for the featurelevel gap after obtaining the visual and textual features. Taking the textual data as query, the additional information is referenced from visual features for fusion. For tackling, flaw 2, the whole model incorporates two additional contrastive learning objectives based on the end-to-end textual reconstruction loss: an intra-modal objective for input and output utterances, and an inter-modal objective for input image and sentences. In addition, ITCH can be taken as a component to make the model suitable for both supervised and unsupervised learning environments. Experimental results on MSMO and MMS demonstrate that ITCH outperforms previous state-of-the-art methods on the multimodal abstractive summarization task in terms of ROUGE, relevance scores and human evaluation. The main contributions of this paper are: (1) An ITCH framework is proposed for tackling multimodal abstractive summarization in a supervised approach. Moreover, with ITCH as a component and integrated into an existing system, it is appropriate for unsupervised learning environments. (2) A cross-modal fusion module is designed for obtaining textually enhanced representation. It merges contextual vision and language information, and makes visual features align to textual representation. (3) The objectives of the inter-modal and intra-modal frameworks are integrated with a reconstruction objective in summary generation. The inter-modal objective measures consistency for input images and texts, while the intra-modal objective maintains the semantic similarity between input sentences and output summary.
The rest of this paper is organized as follows: Section 2 discusses related work. Section 3 presents the ITCH framework. ITCH-based components used for supervised and unsupervised learning environments are also introduced in this section. Section 4 evaluates the performance of the ITCH framework and discusses the results. A case study is shown in Section 5. Section 6 concludes the paper.

Visual and Semantic Feature Extractors
The feature extractors utilized in NLP and CV differ due to the different properties of text and images. A recurrent neural network (RNN [10]) was proposed to model sequential sentences and represent contextual features. With the increase in sentence length, the gradient dispersion limits its further development. Long short-term memory (LSTM [20]) and gated recurrent unit (GRU [21]) with a gate mechanism can help with this issue, but the technique of encoding tokens (in sentences) one at a time restricts inference efficiency. To address the above problems, transformer [22] with self-attention is proposed to contextualize the entire sentence or paragraph in features in a parallel manner. This facilitates the development of a pre-trained language model which designs specific tasks on a large-scale corpus for training. In a variety of downstream tasks, pre-trained language models such as ELMo [23], GPT [24], BERT [19] and RoBERTa [25] have achieved state-ofthe-art performance. As a result, the current schemes rely heavily on the pre-trained model as a linguistic feature extractor.
For vision, a convolutional neural network (CNN [26]) is the most extensively used deep learning model. It aggregates local spatial features using a kernel and accumulates them with feedforward networks. Moreover, some studies focus on the salient regions of objects or entities using Faster R-CNN [27] in conjunction with ResNet [28] to learn features with rich semantic meaning. To connect the domains of vision and language, Dosovitskiy et al. [18] try to employ a vanilla transformer with patch projection for vision problems.

Multimodal Fusion Methods
Multimodal fusion is intended to fuse heterogeneous information in order to better interpret multimodal inputs and apply them to downstream tasks. The early fusion (EF [29]) aims at embedding features by projection or concatenation. Considering that EF does not accumulate intra-modal information, Zadeh et al. [30] utilizes a memory fusion network to account for modal-specific and cross-modal interactions continuously. The hierarchical attention for fusion is also proposed for addressing multimodal interaction, which was advised by Kronecker [31]. Similarly, using an attention-based mechanism, Pruthi et al. [32] apply a masked strategy for "deceiving", which improves the attention's reliability. Different from focusing on the information across modalities by attention, some studies have tried to fuse multimodal information from the correlation between input and output. Liu et al. [33] employ low-rank tensors of several representations, including output, to perform multimodal fusion. Furthermore, Liu et al. [34] propose a novel TupleInfo to encourage learning to examine the correspondences of input and output in the same tuple, ensuring that weak modalities are not ignored. Recently, a channel-exchanging-network (CEN [35]) was proposed for tackling the inadequacy in balancing the trade-off between inter-modal fusion and intra-modal processing.

Methods for Abstractive Summarization
Multimodal summarization is the task of generating a target summary based on multimedia references. The most significant difference between multimodal summarization and textual summarization is whether the input contains two or more modalities. Based on the distinct methodologies, the multimodal summarization can be separated into multimodal abstractive summarizing and multimodal extractive summarization. The former is consistent with our research, which gathers information from multiple sources and constructs textual sequences using a generation model.
For the MAS task, Evangelopoulos et al. [36] detect the key frames in a movie based on the salience of individual elements for aural, visual and linguistic representations. Replacing frames with tokens in sentence, Li et al. [37] generate a summary from a set of asynchronous documents, images, audios and videos by maximizing the coverage. Sanabria et al. [38] use a multimedia topic model to identify the representative textual and visual samples individually, and then produce a comprehensive summary. Considering visual information as a complement to textual features for generation [7], Zhu et al. [39] propose a multimodal input and multimodal output dataset, as well as an attention model to generate a summary through a text-guided mechanism. The model Select [40] proposes a selective gate module for integrating reciprocal relationships, including a global image descriptor, activation grids and object proposals. Modeling the correlation among inputs is the core point of MAS. Zhu et al. [41] frame a unified model for unsupervised graphbased summarization that does not require manually annotated document-summary pairs. Another unsupervised method which is significantly related to our paper is the generation with the "long-short-long" paradigm [5] combined with multimodal fusion.

Contrastive Learning
Much research utilizes contrastive objectives for instance comparison (gathering similar samples while keeping the distance between dissimilar samples as large as possible) in order to facilitate representation learning in both NLP and CV. For example, noisecontrastive estimation (NCE [42]) is proposed to tackle the computational challenges imposed by the large number of instance classes. Information NCE (infoNCE [43]) maximizes a lower bound on mutual information between images and caption words in cross-modal retrieval. For vision with contrastive learning, MoCo [44] further improves such a scheme by storing representations from a momentum encoder dynamically, and MoCov2 [45] borrows the multi-layer perceptron and shows significant improvements over MoCo. SimCLR [46] proposes a simple framework for large-batch applications that do not require memory representations. For language, ConSERT [47] notices that the native-derived sentence representations are proved to be collapsed in semantic textual similarity tasks. Gao et al. [48] find that dropout acts as minimal data augmentation can achieve state-of-the-art performance by utilizing a contrastive learner. For cross-modal scenarios, vision-language pre-trained methods are representatives that embrace multi-modal information for reasoning [49,50]. Recently, Yuan et al. [51] utilized the NCE [42] and MIL-NCE [52] losses to learn representations using across-image and text modalities.

The ITCH Framework
In this section, we introduce details of our proposed ITCH for a multimodal abstractive summarization task. The ITCH (illustrated in Figure 2) takes bi-modal image and text as inputs and represents their respective features using a patch-oriented visual encoder and a token-aware textual encoder in Feature Extractor. For the purpose of alignment, a Cross-Modal Fusion Module is used to enhance the semantic features. Thereafter, the target summary is generated by the token-aware decoder introduced in Textual Decoder. In addition, the Hybrid Contrastive Objectives introduces the inter-and intra-modal contrastive objectives as auxiliary objectives for the summarization referenced from multiple modalities. Finally, we also show how to use ITCH as a component for the unsupervised learning approach. (4) Hybrid Contrastive Objectives: apart from using the common reconstruction loss for summary generation, an inter-modal contrastive objective is designed to maintain the distance among bi-modal inputs, and an intra-modal contrastive objective is used to gather information between input sentences and output utterances.

Visual and Textual Feature Extractor
Given a set of mini-batch input, is the resolution of image v i , C = 3 denotes the number of channels of v i , and M denotes the number of tokens in the sentences s i . In order to represent the contextual features of images and text, respectively, different pre-trained transformer-based models were used as extractors.
Patch-Oriented Visual Encoder. To obtain visual features, we chose vision transformer (ViT) as extractor, which receives as input a 1D sequence of embedding, while the original image is 3D. We reshaped the image into a sequence of flattened 2D patches v ∈ R N×(P 2 ·C) , where P is the height and width of the patches. Then, N = HW/P 2 is the resulting number of patches. Following the linear projection FC and 2D-aware position embeddings E img pos , the image embeddings can feedforward to the patch-oriented visual encoder. Let D be the hidden dimension of ViT; the visual feature V ∈ R N×D can then be obtained by (1) Token-Aware Textual Encoder. As for the textual branch, the pre-trained BERT is used to extract context-enhanced features. The similar operation linear projection FCs are used for token-level embedding, whose weights are not shared with the visual branch. In addition, the static positional embedding E txt pos is also considered. Following by BERT, we utilized a fully connected layer to map the same D-dimension with V. The textual feature S ∈ R M×D is calculated as follows: where W t and b t are trainable weights in the full-connection layer. Recall that through this section, the original image v ∈ R C×H×W and text s ∈ R M are represented as features V ∈ R N×D and S ∈ R M×D .

Cross-Modal Fusion Module
Given two encoded and unaligned features, V and S, the goal of the cross-modal fusion module is to align semantic features in S to V via query/key/value attention and modified filter (details in Figure 3). We first projected bi-modal features to vectors, i.e., Q = SW Q , K = VW K and V = VW V , where W Q , W K and W V are weights. We assumed that a good way to fuse vision-language information is by providing a latent adaptation from V to S as Formula (3). In addition, an adjustable factor γ together with activation function ReLu(x) = max{x, 0} was used to filter high relevance scores. That is to say, the low-value scores w.r.t unaligned visual feature are abandoned by this process. The temporary fusion feature can be presented as Considering that the final target is a textual summary and the prevention of gradient dispersion, we utilized layer normalization [53] and residual connection [28] to enhance textual information. Then , the fusion feature F ∈ R M×D , which highlights semantic vector among vision and language features, can be calculated by (4) + +

Textual Decoder
The goal of ITCH is to generate a target summaryŶ = {<sos>, . . . ,ŷ i , . . . ,<eos>} which begins and ends by special tokens <sos> and <eos>. The corresponding ground-truth is noted as Y. After obtaining the fusion feature F ∈ R M×D through the cross-modal fusion module, the textual sequence is generated by a token-aware transformer-based decoder. It takes the prediction tokensŷ 0:i−1 and fusion feature F as inputs, and outputs the current state token by model with parameters θ. In detail, the TransDec denotes the function of the decoder and thê y 0:i−1 means the tokens before the ith token, whereŷ 0 = <sos>: For the generation objective, the reconstruction loss L gene is taken into account naturally. It minimizes the negative log-likelihood by

Hybrid Contrastive Objectives
In this section, we introduce two contrastive objectives besides the common generation objective, which can be considered auxiliary tasks during the training process that reinforce the primary summarization task. In detail, text-image consistency loss and IO (Input/Output)-aware coherence loss are proposed to maximize the lower bound on mutual information.
Inter-modal objective for input text-image pair. Natural matches exist between each other due to the pairing of the image and sentences in the existing datasets; although beneficial to the training process, this decreases the generalization of the models and inhibits further model performance improvements. In previous procedures, we obtained the context-enhanced visual feature V and language feature S through feature extractors. In order to facilitate the comparison of images and texts, the pooling strategy was used to abstract features into vectors.
where batch normalization BN() and layer normalization LN() are used for pooling vision and language features to vectors, respectively. Generally speaking, L2 regularization is used to map the matching to the unified space before the similarity calculation [54]. However, we did not truly want to complete the matching in our case, but tried to maintain the consistency between images and sentences. Experimental results show that using different normalization can fuse more information without destroying the distribution of data. Following the motivation aforementioned, we expected that the corresponding image and text pair would have a high consistency, while the irrelevant pairs would have low similarity, especially those with fine-grained interplaying. To achieve this goal, we accumulated the contrastive losses advised by infoNCE directly.
where sim denotes the similarity function, sim(a, b) = a · b T . Intra-modal objective for input/output utterances. The access to the coherence labels of IO utterances often requires extra expert annotations or additional algorithms, which are expensive or which may introduce error propagation. Considering the observation that sentences in reference are inherently related to the generated summary, we instead obtained the coherence by modeling the similarity of IO textual data. The assumption behind this is that utterances within the same description are more similar to one another than those spanning across different paragraphs. Similar to L inter , the loss for measuring the coherence among IO utterances can be expressed as where o y is the sentence embedding obtained using the same method as the textual vector o s . We also visualized the difference between the above two contrastive losses in Figure 4.  In conclusion, the total loss function of ITCH can be defined as Formula (10), where || · || 2 denotes the L2 norm for parameters θ:

Unsupervised Learning Combined with ITCH
The aforementioned description is the processing flows that combine ITCH with supervised learning approach. The ITCH can easily implement unsupervised multimodal abstractive summarizing by taking the ITCH as compression. In detail for unsupervised approach, as shown in Figure 5, we utilized the existing "long-short-long" (CTNR [5] structure: sentences → Encoder-Decoder → summary → Encoder-Decoder → sentences) structure. It fuses multimodal information and generates a summary through a decoder, and then the generated summary is taken into account for reconstructing the input sentences. The framework of ITCH with supervised manner Figure 5. Structure for the unsupervised learning methods using the same structure of ITCH with the supervised approach as an additional component and adding a transformer model with encoder TransEnc and decoder TransDec to reconstruct the input text, which is advised for the existing Compress-then-Reconstruct approach (CTNR). Following Sections 3.1 and 3.2, the textual-enhanced feature F is obtained through the cross-modal fusion module. The generation processing is the same as in Equation (5). We encoded the generated summaryŶ and reconstructed the textual input sequences s because unsupervised learning cannot be trained with a corresponding label. The reconstructor is a transformer model with encoder TransEnc and decoder TransDec. The predicted input textŝ is calculated using the following formula: The reconstruction loss of the unsupervised approach is different from that of the supervised one. The likelihood considers predicted sentencesŝ and input text s rather than the generated summaryŶ and the ground-truth Y, while the function is the same as in Equation (6). The hybrid inter-and intra-modal contrastive losses are also considered (details in Section 3.3), and the total above processes are composed as in ITCH with the unsupervised approach.
The framework of ITCH with the supervised approach is highlighted with a red box in Figure 5 to denote the role of ITCH in the unsupervised approach. In conclusion, compared with the supervised ITCH, there are two different points in the unsupervised approach.
(1) The input and output of the whole model changes from {(v, s) →ŷ} to {(v, s) → s}.
The supervised ITCH takes bi-modal inputs to generate a summary directly, while the unsupervised ITCH generates a summary in the middle of the whole model and takes these sequences to reconstruct the input text. (2) Additional transformers, Encoder and Decoder, are added for reconstructing input sentences, while the supervised ITCH does not consider Encoder and Decoder.

Setup
We evaluated the ITCH on two public multimodal summarization datasets, MMS [55] and MSMO [39]. Each sample in the MMS is a triplet (sentence, image, headline), while the headline is commonly considered a target summary. As Table 1 shows, MMS and MSMO were divided into three groups for experiments. The maximum number of words in the input sentence for the MMS dataset was 439. For the MSMO, the items are from internet news articles with numerous picture captions. After removing special tokens and punctuation, the maximum number of tokens was reduced from 740 to 492, which is applicable to the maximum length of 512 for the transformer model. The word embedding size was set to 300 and the limited vocabulary size was set to 20,004 with four extra special tokens (<unk>, <pad>, <sos> and <eos>). The dimension D of feature is 768 depending on the chosen visual and language pre-trained encoders, which are advised from huggingface (bert-based-uncased: https://huggingface.co/bert-base-uncased, accessed on 13 April 2022, vit-base-patch16-224 : https://huggingface.co/google/vit-base-patch16-224, accessed on 13 April 2022). We also used dropout with a probability equal to 0.3 for the cross-modal fusion module. The batch size was up to 128 limited by the GPU (Nvidia 3090 with 24 GB VRAM) and the overall parameters were trained for 30 epochs with a 2 × 10 −5 learning rate for pre-trained extractors and 2 × 10 −4 for others, which were halved every 10 epochs. We used mean pooling for transforming features to vectors, which has been verified as the most effective way [56] compared to Max pooling or [CLS]. For other hyperparameters, the optimal settings are: adjustable factor in cross-modal fusion module γ = −0.15 and temperature parameter in infoNCE τ = 0.1. Details are shown in Table 2.

Evaluation Metrics
The evaluation metrics are calculated between the generated summary and the groundtruth, which judge: word-overlap, embedding relevance and human evaluation.
• ROUGE [57]: the standard metric to calculate the scores between the generated summary and the target sentences using the recall and precision overlaps (details are R-N and R-L). R-N refers to an N-gram recall between a candidate summary and a set of reference summaries. R-N is computed as follows: where N means the length of N-gram, and gram N and Count match (gram N ) are the maximum number of N-grams co-occurring in a candidate summary and a set of reference summaries. Here, we selected R-1 and R-2 as the evaluation metrics. R-L uses longest-common-subsequence (LCS)-based F-measure to estimate the similarity between two summaries. The longer the LCS of the two summaries is, the more similar the two summaries are. • Relevance [58]: we used embedding-based metrics to evaluate the similarity of the generated summary and the target summary. In particular, Embedding Average and Embedding Extrema use the mean embedding and max-pooled embedding to compute the cosine similarity. Embedding Greedy does not pool word embeddings but greedily finds the best matching words. These metrics are used to measure the semantic similarity of the generated summary and the ground-truth. • Human: we invited twelve native speakers to evaluate the generated summary according to fluency and relevancy. The judges can give a score from 0 to 4, as detailed in Table 3. We randomly sampled 100 results for each dataset and divided them into four batches. The judges were broken into four groups and each batch of samples was annotated by two groups of judges. For each sample, we used above two ratings for each aspect (fluency or relevance) and we took the average as the final rating. The male-to-female ratio was 1:1. Within a batch, if the ratings differed substantially between the two groups of judges, a third group of judges would be invited to annotate the batch. The judges did not have access to the ground-truth response, and saw only the inputs and the predicted summary.

Baselines
In this paper, we used ITCH as a component combined with the supervised and unsupervised learning approaches for the MAS task. Therefore, the baselines were chosen as follows: For unsupervised learning methods, LexRank [59] is a textual PageRank-like algorithm that selects the most salient sentences from a reference. Using embedding similarity for sorting, W2VLSTM [60] is an improvement based on LexRank. With the development of a deep neural generation network, Seq3 [4] is proposed to use the "long-short-long" pattern to automatically generate a summary. The above three methods only refer to the unimodal information to summarize utterances, while the task of an abstractive summary with reference to multimodal information is considered to be a more challenging task. Guiderank [39] is a classic method on the MSMO dataset, which is an unsupervised baseline without considering the ITCH framework. MMR [41] with SOTA performance on both MSMO and MMS uses a graph-based ranking mechanism for extraction.
For supervised learning methods, S2S [10] and PointerNet [61] are generation models based on Encoder-Decoder, where PointerNet can project special tokens to a target summary. With the rise in pre-trained models in the NLP field, UniLM [12] has been proven to have a strong performance in the abstractive summarization task. For a supervised framework that references multimodal information, Doubly-Attn [62] uses multiple attention modules for aggregation. MMAF and MMCF [55] are the modality-based attention mechanism for paying a different kind of attention to image patches and text units, which are filtered through selective visual information. Considering a selective gate network for reciprocal relationships between textual and multi-level visual features, SELECT [40] is the current SOTA baseline.

Experimental Results and Analysis
We carried out experiments to compare the performance of ITCH with baselines on the MSMO and MMS datasets in metrics: ROUGE/Relevance/Human.
For the results and analysis on the MSMO dataset, there were two types of experimental results, unsupervised and supervised. In terms of resource, uni-modal (uniin Table 4) only considers textual data, while bi-modal (bi-in Table 4) contains visual and textual data as inputs, to which our method belongs. As Table 4 shows, our ITCH outperformed unsupervised and supervised competitive baselines on different metrics (ROUGE and Relevance) and created a new state of the art. Compared with the mainstream unsupervised learning model (MMR), ITCH had an average improvement of 10.67% in word-overlap-based metrics and 4.71% in embedding-based metrics; that is, (∑ ∈ITCH − ∑ ∈MMR ) ∑ ∈MMR . The former resulted in more improvement than the latter, which indicates that the textual summary generated by unsupervised ITCH is more accurate and similar to the reference at the word-overlap level. Such a superiority benefits from our two contrastive objectives, which not only enhanced the relevance of input text and output summary but also improved the correlation of the input text-image pair.
A similar situation occurred in the comparison with the mainstream supervised learning model (Select). ITCH still performed 4.38% better in word-overlap-based metrics and 2.68% better in embedding-based metrics. This completely illustrates that our cross-modal fusion module can model and understand unaligned multimodal to reinforce the generation of a target summary. In addition, whether supervised or unsupervised, ITCH still achieved almost the highest level in human evaluation metrics considering the subjectivity. This demonstrates that the summary generated by ITCH is more readable and topic-related than other baselines. It is no exception that the performance of the unsupervised ITCH was worse than that of the supervised one because of the lack of massive manually labeled data.
For the results and analysis on the MMS dataset, our ITCH outperformed both unsupervised and supervised baselines. As Table 5 shows, ITCH with a unsupervised learning model exceeded all the corresponding baselines in ROUGE, Relevance and Human evaluation metrics. In particular, our method outperformed the current state-of-the-art MMR [41] by 10.29% in the ROUGE metric and 4.98% in the Relevance metric, which also indicates the remarkable advantage of our two contrastive objectives. Compared with the supervised mainstream methods, our ITCH still has an obvious superiority. With regards to ROUGE, ITCH surpassed MMAF [55] by 4.86% and MMCF [55] by 6.47%. For the Relevance metric, our approach was also superior to MMAF [55] and MMCF [55] by about 3.34% and 3.90%, respectively. We can conclude that the crossmodal fusion module offers an overall comprehension of several modalities to improve the relevance and similarity of the summary and the inputs under the supervised condition.

Ablation Analysis
In this section, we analyze the roles that different factors play in the ITCH framework. There were three aspects studied on the MSMO dataset for the ablation analysis: hyperparameters, the cross-modal fusion module and hybrid contrastive losses.
A prerequisite for a summary to help users accurately acquire information is that the image be related to the target summary. Therefore, an image-text relevance metric is used to measure the quality of the generated summary and the effect of contrastive losses, which is advised by Zhu et al. [39]. The proposed metric M sim ∈ [−1, 1] considers visual-semantic embedding to calculate cosine similarity between normalized visual and textual features.
The effect of the hyperparameters. We tested the impact of different values of two hyperparameters γ in Formula (3) and τ in Formulas (8) and (9), respectively. γ acts as a balancer to control the value of the activation function ReLu = max{x, 0} for filtering high-relevance scores. According to Table 6, we obtained the best performance under both unsupervised and supervised conditions if γ was set to −0.15. If the value was greater or less than −0.15, the performance was worse. In particular, when γ was set to a greater value than the default value, V would obtain a larger share of the fusion feature, leading to a greater drop in performance. With regards to τ, a larger value had a negative impact on the result, which may have been because the effect of the cosine similarity to the loss function was decreased. In addition, if τ was set to a smaller value than the default τ = 0.1, the ROUGE and Relevance metrics became worse, while M sim improved. We believe that a smaller τ, together with the loss function, facilitates the optimization process of the cosine similarity between the textual feature and the visual feature. Through the above analysis, the proper hyperparameters play a crucial part in keeping our ITCH functioning optimally.  Table 7. If ITCH discards the cross-modal fusion module, the performance decreases obviously, whether in unsupervised or supervised approaches, compared with the original ITCH and corresponding current state-of-the-art (MMR and Select). In particular, M sim was reduced by 25.41% and 16.85% in comparison to unsupervised ITCH and supervised ITCH, respectively. We conclude that the cross-modal fusion module is pivotal for the improvement of the similarity between visual feature and the textual feature. Without this module, the performance of ITCH is still close to MMR's and even exceeds Select's, which indicates the superiority of the additional interand intra-modal contrastive objectives. Similarly, when removing inter-loss, intra-loss, or both, the performance of ITCH suffered universally. Furthermore, inter-modal loss had a greater influence on the M sim , whether using an unsupervised or a supervised method, but intra-modal loss had a stronger influence on ROUGE and Relevance in an unsupervised setting. I/O coherence influenced the fusion feature, which led to reducing the relevance between the generated summary and the corresponding image. Furthermore, the consistency of the input text-image pair played an important role in the word overlapping and embedding similarity. Significantly, ITCH with single inter-loss or intra-loss outperformed the unsupervised baseline and the supervised baseline, which fully indicates the vital function of the extra contrastive losses.

Case Study
To further analyze the ITCH framework and compare it with the baselines, we listed a series of results about a case from the MMS dataset in Table 8. A news article with numerous sentences and one image is provided as input. The text mainly states that Singapore suffers from Zika virus and dengue virus, and that the government has introduced many measures to prevent the virus from spreading. The corresponding image depicts that a firefighter is misting insecticide indoors. The output in Table 8 contains the target summary for the inputs and the summaries generated by approaches of the baselines and the ITCH in the unsupervised and supervised approaches.
For the unsupervised approach, the generated summary of our ITCH has the highest coherence with the target summary. Compared with the uni-model LexRank, the ITCH covers all salient information from the input text and image. Two types of words for "virus" and the obvious symptoms of the disease and their preventive measures appeared in the summary generated by ITCH. This demonstrates that our cross-modal fusion module fully utilizes textual and visual information from references. Moreover, the structure of ITCH's result is the most consistent compared with the structures of generated summaries from unsupervised baselines. This reflects that our ITCH model learns the capacity of narrative logic. It is worth noting that on three metrics, R-1/R-2/R-L, ITCH was superior to both unsupervised and supervised methods. In comparison with the ground-truth, it performed poorly with advanced vocabulary and grammar. For example, the result could not recognize uncommon or complex words such as "Zika" or "mosquito-borne".
Our ITCH generates a more thorough and readable summary that is significantly closer to the ground-truth summary when using a supervised approach. The result contains more important information compared with the unsupervised result, such as the phrases "aggressive spraying", "indoor spraying" and "transmission". Unlike the supervised baseline Select, which ignored information from the first paragraph of the input text, our result took into account all portions of the text and reflected the influence of the I/O contrastive loss. Although ITCHbehaves as the state-of-the-art technique in the unsupervised and supervised fields, there is still space to improve, such as for unknown words.

Input
Text Zika is primarily spread by mosquitoes but can also be transmitted through unprotected sex with an infected person.
Almost daily downpours, average temperature of 30 degrees Celsius (86 degrees Fahrenheit), large green areas in a populated urban setting makes Singapore a hospitable area for mosquitoes. So Singapore is the only Asian country with active transmission of the mosquitoborne Zika virus, the US, Australia, Taiwan and South Korea have all issued alerts advising pregnant women against traveling to Singapore. Singapore is known to suffer widely from dengue virus, a mosquito-borne tropical disease that triggers high fevers, headaches, vomiting and skin rashes in those infected to a considerable extent and therefore may be mistaken for another.
Singapore's government has a long history of using aggressive spraying, information campaigns and heavy fines for homeowners who leave water vesse in the open, in a bid to control mosquito-borne dengue. Indoor spray, misting and oiling were conducted, and daily misting of common areas is ongoing, hundreds of specialist workers conduct island-wide inspections for mosquito breeding grounds, spray insecticide and clear stagnant water.

Target Summary
Singapore has suffered from the Zika virus and dengue virus, both of them are mosquitoborne disease with high fevers. The government employ aggressive spraying and information campaign to prevent its spread.

ITCH
The Singapore take aggressive spraying, indoor spraying and information campaign to prevent <unk> virus and dengue virus spread. They are <unk> disease with high fevers and transmission.

Conclusions
In this paper, we propose the inter-and intra-modal contrastive hybrid (ITCH) learning framework, which learns to automatically align multimodal information and maintains the semantic consistency of input/output flows. We evaluated our framework with unsupervised and supervised approaches on two benchmarks (i.e., MSMO and MMS datasets) for three metrics: ROUGE, Relevance and Human Evaluation. The experimental results on all datasets show that our ITCH consistently outperforms comparable methods, whether with supervised baselines or unsupervised baselines. We further carried out comprehensive ablation studies to confirm that the proper hyperparameters, the cross-modal fusion module and hybrid contrastive losses are essential in ITCH. Furthermore, we showed a successful example from the MMS dataset to provide a more intuitive comparison. In the future, we will improve our model to better understand and summarize complicated vocabulary. Furthermore, we intend to study the multimodal abstractive summarization task on a Chinese dataset.  Data Availability Statement: The datasets (MSMO and MMS) investigated in this work are publicly available at http://www.nlpr.ia.ac.cn/cip/dataset.htm (accessed on 23 November 2021). MSMO corresponds to the "Dataset for Multimodal Summarization with Multimodal Output" proposed by conference EMNLP2018 [39]. Furthermore, MMS corresponds to the "Dataset for Multimodal Sentence Summarization" proposed by an IJCAI2018 conference paper [55].