TIPS: A Framework for Text Summarising with Illustrative Pictures

We propose an algorithm to generate graphical summarising of longer text passages using a set of illustrative pictures (TIPS). TIPS is an algorithm using a voting process that uses results of individual “weak” algorithms. The proposed method includes a summarising algorithm that generates a digest of the input document. Each sentence of the text summary is used as the input for further processing by the sentence transformer separately. A sentence transformer performs text embedding and a group of CLIP similarity-based algorithms trained on different image embedding finds semantic distances between images in the illustration image database and the input text. A voting process extracts the most matching images to the text. The TIPS algorithm allows the integration of the best (highest scored) results of the different recommendation algorithms by diminishing the influence of images that are a disjointed part of the recommendations of the component algorithms. TIPS returns a set of illustrative images that describe each sentence of the text summary. Three human judges found that the use of TIPS resulted in an increase in matching highly relevant images to text, ranging from 5% to 8% and images relevant to text ranging from 3% to 7% compared to the approach based on single-embedding schema.


Introduction
The development of deep neural networks (DNN) has revolutionised issues related to the analysis of images and natural language processing [1]. The ability to generate feature vectors (embedding) from both images and texts has greatly facilitated the semantic analysis of these media. Especially interesting are issues of image-text matching to determine the semantic similarity between them. Image-text matching is an important multi-modal task with a wide range of applications [2]. Research in this area using deep neural networks is relatively new, and many of the relevant results have been published in work from within the last three years.

State-of-the-Art on Image-Text Matching
Modern methods of comparing and matching text to images are based almost exclusively on deep neural networks [3,4]. We can distinguish several basic issues that determine how to select machine learning methods to make this process efficient and effective. The first is the appropriate choice of architecture and scale of the neural network. Convolutional Neural Networks (CNN) [5] are commonly developed on a fixed resource budget, and then scaled up for better accuracy if more resources are available. Tan et al. [6] systematically studied model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. They propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. Due to the fact that pre-training is a dominant paradigm in computer vision, [7] image features are commonly retrieved using well-established CNN architectures. The next issue is the application of the image search algorithm. Image Search is a fundamental task playing a significant role in the success of a wide variety of frameworks and applications [8]. An important method to compare semantic similarity between text and images is CLIP Contrastive Language-Image Pre-Training). Conde et al. [9] proposed that a CLIP approach be used for training a neural network on a variety of art images and text pairs, being able to learn directly from raw descriptions about images, or if available, curated labels. Visual attention not only improves the performance of image captioners, but also serves as a visual interpretation to qualitatively measure the caption rationality and model transparency [10]. The use of the CLIP method enabled us to search for illustrative images in which the images do not have ready-made annotations (compare with Joshi et al. [11]). Huang et al. [12] developed an image-text matching approach using a bi-directional spatial-semantic attention network which leverages both the word to regions relation and visual object to words relation in a holistic deep framework for more effective matching. Image-text matching for short (one-sentence) texts has been realised in practice in the Sentence Bert (SBERT) [13] algorithm. It uses the language module BERT [14]. BERT is a language representation model while SBERT generates feature vectors from both image and text and compares them using CLIP. SBERT was developed to optimise previously used solutions such as InferSent [15], Universal Sentence Encoder [16] and SentEval [17]. Vision models trained on multi-modal data sets can benefit from the wide availability of large image-caption data sets. According to [18] CLIP proved to be a reliable method in multi-modal semantic task solving.
Additionally, the latest research findings in the topics of this article can be found in survey papers. Zhu et al. [19] and Rudinac et al. [20] discuss the latest achievements in story summarising. Baltrušaitis et al. [21] and Guo et al. [22] survey the recent advances in multimodal machine learning and present them in a common taxonomy. In reviews [23,24], Gao et al. and Ramachandram et al. present a survey on deep learning for multimodal data fusion to provide readers, regardless of their original community, with the fundamentals of a multimodal deep learning fusion method and to motivate new multimodal data fusion techniques of deep learning. With the rapid growth of social media, users post large volumes of data in various modalities such as text, image, audio, and video. In surveys [25,26], Huddar et al. and Soleymani et al. define sentiment, sentiment analysis, states problems and challenges in multimodal sentiment analysis and finally review some of the recent computational approaches used multimodal sentiment analysis. The Multimodal datadriven approach has emerged as an important driving force for smart healthcare systems. Cai et al. [27] provide a comprehensive survey of existing techniques which include not only state-of-the-art methods but also the most recent trends in the field.

Study Motivation
In this paper, we propose an algorithm to generate graphical summarising of longer text passages using a set of illustrative pictures, which we refer to as Text Summarising with Illustrative Pictures (TIPS). TIPS is an algorithm using a voting process that uses the results of individual "weak" algorithms to produce a single result. The proposed method includes a summarising algorithm that generates a digest of the input document. Then, a sentence transformer performs text embedding and a group of CLIP similarity-based algorithms trained at image embedding finds semantic distances between the images in the illustration image database and the input text snippets. This is followed by a voting process that extracts the most similar images to the text. Both the TIPS algorithm scheme, the voting process algorithm, and the methodology for evaluating and comparing our algorithm with other image-text matching methods are original achievements presented in this work. Our proposed method matches illustrative images to each sentence that was generated by the summarizer algorithm independently. Therefore, the ability of our algorithm to summarize text using a set of illustrative images relies on the power of the summary-generating algorithm only.

Text Summarising
Single-document summarising is the task of automatic generation of a shorter document version while retaining its most important information [28]. Currently, neural network-based algorithms trained on relevant language corpora are used to generate text summaries [29][30][31]. Text summarising typically use a variety of language representation models. Among the most popular and effective models of this type is the BERT (Bidirectional Encoder Representations from Transformers) algorithm [14]. BERT is also based on a neural network, which, among other things, is designed to model the strong connections between words of a language in order to find their representation, which will then be used to generate a summary. Unlike other solutions of this type, BERT is designed to pretrain deep bidirectional representations from unlabelled text.
Text summarizers currently use summarising-specific neural architectures to enhance document-level features. We decided to use the architecture proposed in [28] called BERTSUM. Suppose we have a text X = [x 1 , x 2 , . . . , x n ], where x i , i ∈ [1, n] are sentences of this text. The task of the summarizer is to select n out of the m sentences that this text consists of (n m). In this algorithm, a feature vector is generated for each sentence that is part of the text. Then the whole text is processed by Inter-sentence Transformer: where: B is an output vector of BERT, LN is layer normalisation [32], AO is the attention operation [33] and FFN is feedforward neural network with depth i.
The final layer is a sigmoid classifier. The output from BERTSUM for each of the sentences included in the text generates a value between [0, 1]. The larger the value of the output signal from BERTSUM, the more confident the sentence should be in the summary. The obtained summary S x = [x j 1 , ..., x j m ] is a subset of the original text X and there are m indices j ∈ {1, ..., n}.

Sentence Transformer
Sentence transformers are a group of methods for generating a vector of features that describe a sentence (sentence embedding) [34]. A feature of such descriptive vectors is that the algorithms that generate them minimise the distance between vectors of sentences that have similar semantic meaning. Sentence transformer algorithms use a language model to generate a feature vector from a given sentence, which is then the input argument to further processing. In practice, the sentence transformer uses a deep neural network that recalculates the input vectors so that a distance metric can be used to calculate the semantic distance between sentences.
The method proposed in [13] uses the BERT language model, which we described in Section 2.1.1. The output vectors from BERT are then processed with a deep architecture that uses the triplet loss function [35] for training. With triplet loss, the distance between embedding sentences that have similar semantic meaning is minimised during training and the distance between embedding sentences that have different semantic meaning is maximised.

Image Embedding
Most common image embedding solutions are trained as models to classify large data sets of diverse images, for example ImageNet [36]. Networks of this type can also be used to generate feature vectors of images. In this case, the initial convolutional layers of the network are used, which generate a single, typically several hundred dimensional vector describing the input image. Classifier layers of the network are then not used.

Image-Text Matching
Image-text matching is a group of methods that allows evaluating the semantic similarity between text and image content by measuring this similarity using a given metric. Modern algorithms of this type use sets of methods that we discussed in earlier sections and have much in common with image classification problems. Image classification, which boils down to assigning an image to one of a predefined class, is a very well-studied and widely applied problem, as we pointed out in Section 2.1.3. The problem of determining image class can also be considered more broadly by using methods that will automatically generate descriptions of the contents of images. This is currently done using deep models that combine image embedding with the ability to generate descriptions using recurrent Long short-term memory (LSTM) neural layers such as [41][42][43]. A large survey on this field can be found at [44,45]. Thus, one could use such automatically generated image descriptions with sentences that have been generated by the text summarizer (see Section 2.1.1) to determine the similarity between the text and the various images we have, in order to select the image that minimises the given distance metric [46]. However, this approach has a major drawback: the quality of image descriptions generated by automatic algorithms is not yet perfect, and semantic comparison of texts is a complex issue. This results in a buildup of errors generated by both approaches, which can significantly affect the quality of the entire text-to-image matching method. For this reason, dedicated and specially trained solutions are used for image-text matching, whose individual components originate from the areas of natural language processing, text embedding and image embedding.
A Contrastive Language-Image Pre-training (CLIP) method based on neural network architecture has been is proposed in the paper [47] to evaluate the similarity between text and image. In order to perform CLIP, a similarity assessment algorithm learns a multimodal embedding space by jointly training an image encoder and text encoder to maximise the cosine similarity of the image and text embedding. The utilised batch construction technique used in CLIP is the multi-class N-pair loss [48]. Image-text similarity is calculated as follows: where: L2 is euclidean norm, • is a dot product, I f is image features (embedding), T f is text features (embedding), W I is learned projection of image to embedding (to be trained during CLIP model fitting), W T is learned projection of text to embedding (to be trained during CLIP model fitting), p is a learning rate. The training procedure of CLIP is based on minimisation of cross entropy loss. CLIP utilises Transformer text encoder (embedder) [49] and image embedding algorithm (i.e., Vision Transformer image encoder (embedder) [50], ResNet etc.).

Data Sets
In this subsection we describe the data set that we have utilised in our research.

Text Data Set
The text dataset was obtained from the Brunel University London website (https:// brunel.figshare.com/articles/dataset/4000_stories_with_sentiment_analysis_dataset/7712 540, (accessed on 27 September 2021)). The study used 426 short stories that are equal to or less than 1000 characters in length. Due to this we will call this data set Stories426. The collection consists of humorous stories or short tales with a moral. The authors include Aesop, Ambrose Bierce, James Baldwin or Kate Chopin. All texts have been summarised using a BERTSUM (see Section 2.1.1 for exact information), then divided into sentences. This data set was chosen because of its use of natural language and the variety of stories.
An additional advantage, is that the sentences are written correctly in terms of style and language. Obtaining linguistically correct texts is more difficult with data from social media platforms because their users often use verbal abbreviations, e.g., 'LOL' which means laugh out loud, 'U' which is equivalent to 'you', do not use complete 'sentences or forget about linguistic correctness. For the stories used, the full story is described, which helps in creating a full illustration for the text. A person can also easily verify if the proposed images match the content they are paired with. Another advantage of this data set is that it is freely available to the public.

Image Data Set
The image data set was built from nearly 25,000 (24,996) publicly available naturethemed images, sourced from the Unsplash platform (https://unsplash.com/ (accessed on 27 September 2021)). Embedding for this set of images was generated using several deep neural networks. We have used a 50-layer residual network (ResNet 50, RN50), 101-layer residual network (ResNet 101, RN101), and a four times scaled RN50 according to the EfficientNet scaling rule [6]. We also used the Vision Transformer image encoder mentioned in Section 2.1.3 in order to generate embedding for two of its architectures: ViT-B32 (ViT32) and ViT-B16 (ViT16). The Vision Transformer image encoder, in comparson to state-of-the-art DNN models we mentioned in Section 2.1.3, requires substantially fewer computational resources to train. We have used the pre-trained web weights provided at https://github.com/openai/CLIP/blob/main/clip/clip.py (accessed on 27 September 2021). The use of Transformer text embedding [49] and ViT32 image embedding is the same as the SBERT solution architecture [13].

Proposed Method for Text Summarising with Illustrative Pictures (TIPS)
In Figure 1, we present the data processing pipeline of our proposed algorithm for generating a graphical text summary using illustrative images. It is an algorithm using a properly designed voting process that uses the results of individual "weak" algorithms to produce a single result.
The first step is to prepare a text summarizer using a summarising algorithm. For this purpose, we use the BERTSUM algorithm described in Section 2.1.1. Then, using the sentence transformer, we perform text embedding using the Transformer algorithm [47]. As a set of illustration images O = {o 1 , o 2 , ..., o r } (r is image count of the data set that contains all potential illustrative images) should contain diversified set of images with a wide range of topics if the texts to be analysed are to cover a wide range of topics. If we assume that texts are to cover a narrower range of specialised topics we use a set of images that are characteristic to those very topics i.e., architecture, sport events etc.
Let us first consider the performance of a single image-text matching algorithm. The algorithm uses the given image embedding method (a suitable deep neural network, see Section 2.1.3). The input text is processed by the sentence transformer algorithm (see Section 2.1.2). The image and text feature vectors are used to train CLIP (see Section 2.1.4). This training only needs to be done once. Assuming that we have a given image database to serve as a source of illustrative images, we can perform an equal embedding of this entire database using image embedding, which is part of the given image-text matching method. For the given database, this process also takes place once.
After the text summary is performed, each sentence extracted from the text is processed by a set of image-text matching algorithms (see Section 2.1.4). Image-text matching is performed for each sentence in the text summary separately. For each summary, a vector is generated which coordinates are numbers that are proportional to the semantic similarity between the text and the illustration: see Equation (2). Each sentence will be represented by an illustration image for which the value of (2) reaches a maximum. Evaluation of the quality of image to text assignment is performed using similarity sim clip . Let: be a set of images O ordered by sim clip between sentence x k and image o l ∈ O in descending order. In other words O(x k ) is a set of images where first element has the highest value of sim clip with x k and the last element has the smallest value of sim clip with x k . The ordering (4) depends on the image embedding E we have used. For this reason, we can write more generally: Let O E (x k ) t be an ordered subset of the set (5), consisting of t initial elements of O E (x k ). With r different embedding methods defined Xi = E 1 , ..., E r we can propose an algorithm that will use (5) to create a voting scheme. The goal of this scheme will be to order the sets of images proposed by each embedding in Ξ order according to the sum of the normalised sim E clip : In the case of O E (x k ) t , the similarity between x k and o l is computed as: That means that the similarities between x k and o l for a given E ∈ Ξ are divided by the sum of the similarities of the first p similarities between O(x k ) s and o l ordered according to (5). This operation is done to pseudo-normalise sim E clip so that the individual orderings of O E t have comparable ranges of sim E clip values. Thus, the ordered set O E (x k ) t consists of the sum of the elements included in the individual ones of O E (x k ) t . The criterion for ordering the images o l that are part of O E (x k ) t is the sum of the (7) that has been assigned to a given o l by the individual E ∈ Ξ. If any O E (x k ) t does not contain o l , then we assume that for this E sim E clip (x k , o l ) = 0. In the simplest case where t = 1 voting schema (6) works like majority voting schema. In case two or more images received the same number of votes as the most similar to the given text, one of these most similar items is returned randomly.
As the value of t increases, the voting algorithm (6) will propose O Ξ (x k ) t , which will contain an increasing number of proposed images. The order of these images might differs. Of course, it is always true that: This means that all images that are in the set O Ξ (x k ) t for a given p are also in the set The above Equation (7) defines our proposed text summarising with illustrative pictures (TIPS) method. In summary, for the text X: We can also write (9) as: Summarizer turns paragraph text into sentences and then passes the tokenized sentences to the BERT model for inference to output embedding. That embedding is then clustered with K-Means in order to select sentences that are closest to the centroid. Those closest sentences are candidates for summary [51]. Due to this fact the summary consists of sentences that are already present in the original text.
The principles of the CLIP method in practice limit its effectiveness to single sentences, so it would be ineffective to use it for the entire text [47]. It is possible that embedding of the certain sentences will not create "spherical" clusters with representative centroid. In that scenario the "centre" sentence might be not representative for the whole text, however already published papers proved for some extend that application of K-means clustering for finding most important sentences resulted in relatively high ROUGE scores in comparison to other approaches [52][53][54]. Due to this fact we did not evaluate the quality of the obtained summary.

Evaluation and Comparison of Image-Text Matching Algorithms
Evaluation of image-text matching in terms of semantic similarity of large texts is a very difficult issue because we do not have a data set for which there would be a ground truth. Additionally, if we want to compare results obtained by two or more different image-text matching algorithms that use the clip similarity measure, we need to remember that each of these algorithms is trained independently, and it would be incorrect to compare these similarity measures directly. For this reason, we have proposed a number of coefficients that are useful for evaluating and comparing image-text matching algorithms.
While we cannot directly compare sim clip values between methods, we can examine the semantic distances returned between successive recommended images: where Due to this m 1,2 (x k ) is a difference between sim clip value of the most similar image and the second most similar image. A high value of this index may indicate that the image data set is diverse as well as that the selected most similar image is significantly more similar to the text than the other images that are in the image set while using certain E. The average value of these indexes can also be counted for the entire test set TS: where #(∑ X∈TS n X ) is the cardinal number of the set, that are summaries of each text X contained in the set TS. Equation (11) are statistics for a single text. Equation (12) are statistics for the entire set ST. We can also examine the common part of the set of recommended illustration images for all algorithms included in Ξ: as well as the sum of such sets: Obtained values inform us how much variation there is in the t first recommendations of each of the algorithms included in Ξ.
Another statistic is the cardinality of the set composed of common part of set of images recommended by (6) for t and set of images recommended by (6) for t + 1 divided by the number of all summaries TIPS(X).
The above equation determines how much successive votes are consistent with each other, and can be thought of as a way of determining the stability of the voting process as well as the consistency of the recommendations returned by each component O E (x k ) t . It can be seen that V(Ξ, t) ∈ [0, 1]. The value of V(Ξ, t) equals 0 when: The value of V(Ξ, t) equals 1 when: Counterintuitively, the phenomenon where V(Ξ, t) = 1 is not necessarily an advantageous situation. This situation means that every E ∈ Ξ always returns an identical first image recommendation, which may imply little diversification in the embedding methods used.
A set of following statistics is also worth investigating because they report the effect of the t parameter on the performance of the TIPS algorithm (this will be discussed in Section 4): sim min (X) = min(TIPS(X)) sim max (X) = max(TIPS(X)) sim mean (X) = mean(TIPS(X)) sim med (X) = med(TIPS(X)) where: min is minimal, max is maximal, mean is mean and med is median value of sim clip among all image-text matching calculated by TIPS.
We have also added an evaluation of the obtained image recommendation results by three human judges. One of the evaluators was a co-author of this paper, (T.H.) and two others were persons not directly related to the research presented in this paper and without a background in computer science. To each judge, the computer program presented a sentence generated by the summary algorithm and six illustrative images selected by TIPS and using non-voting single-embedding schema (4) in which E was ViT16, ViT32, RN50, RN101, and RN50x4. Those six illustrative images were displayed in a random order so as to eliminate the situation where the judge assigned some meaning to the order of the images. Judges were informed of the purpose of the study and that the images would be presented in random order. Each judge independently assessed the relationship between the text and each of the six illustrative photographs using a three-point judging rating scale (JS) in the range [0-2] according to the subjective impressions. Each judge made evaluation separately without contacting two others. This approach is similar to [55]: • JS = 0: the image is not relevant to the text; • JS = 1: the image is relevant to the text; • JS = 2: the image is highly relevant to the text.
In the next section, we will present the evaluation results of our method on the data sets discussed in Section 2.2.1.

Results
In order to find illustrative images for each S X according to the TIPS method described in Section 2.3, we have prepared an implementation of the proposed solution.
Our proposed method for finding illustrative images for text was implemented in Python 3.5. In order to generate S X for each of the short stories included in the Stories426 collection described in Section 2.2.2, we implemented the method described in Section 2.1.1 based on the solution proposed by Miller [51]. For this purpose we used the libraries spaCy 3.1, Transformers 4.1, NeuralCoref 4.0, Summarizer 0.0.7, Sentencepiece 0.1.96, py-torch_pretrained_bert 0.6.2. Computations were performed on the Google Colab platform using Torch 1.6 computational libraries.
Embedding of text was done using the Transformer algorithm [49] using the implementation of https://sbert.net/docs/package_reference/SentenceTransformer.html (accessed on 27 September 2021). Embedding of images was done using the methods described in Section 2.1.3 and CLIP similarity computation was done using the algorithm described in Section 2.1.4. We used Pytorch 1.7 and Torchvision cudatoolkit 11. The source codes and data sets of the programs we prepared can be downloaded from https://github.com/JusMia/TIPS (accessed on 27 September 2021).
We have made recommendations using the illustrative images presented in Section 2.2.2 for each text that was in the Stories426 using the TIPS algorithm utilising the voting schema (6). We used the image embedding methods ViT16, ViT32, RN50, RN101, and RN50x4 in the voting algorithm O Ξ (x k ) t (6). We performed calculations for p values with an interval of [1,100]. For the purposes of evaluation we have also calculated illustrative image recommendations using non-voting single-embedding schema (4), in which E was ViT16, ViT32, RN50, RN101, and RN50x4. Figure 2 shows the values of V(Ξ, t) for different p. The separate results were connected by the segments marked in black. We smoothed the resulting plot using spline approximation. We have marked the spline in green. In red we marked four local maxima of the smoothed function V(Ξ, t). For the corresponding values of p we will perform a more detailed analysis later on. In Table 1, we presented values of coefficients M 1,2 and M 1,3 (12) for recommendations O E (x k ) where embedding E used algorithms ViT16, ViT32 (equivalent to the SBERT method), RN50, RN101, RN50x4, and our proposed TIPS algorithm with t-values equal to 17, 32, 53, and 86, which were the local maxima detected previously (see Figure 2). In plots in Figure 3 we have shown the values of the coefficients sim min (X), sim max (X), sim mean (X), sim med (X) for the TIPS algorithm with parameter values t in the range [1,100]. In plots in Figure 4 we have shown the values of the coefficients In(Ξ, t), Un(Ξ, t), M 1,2 and M 1,3 for the TIPS algorithm with parameter values t in the range [1,100].  In Table 2, we have presented the exact values of the coefficients of the TIPS algorithm shown in Figures 3 and 4 for the selected t values.  In(Ξ, t) Un(Ξ, t) sim min (X) sim max (X) sim med (X) sim mean (X) In Figure 5 we have shown example of the illustrative image recommendation results for three example text summaries x k . Images are calculated as O E (x k ) 1 for the embedding algorithms ViT16, ViT32, RN50, RN50x4, RN101 and O Ξ (x k ) 1 for TIPS, p = 32. Figure 5. Sample illustrative image recommendation results for three sample text summaries x k . The texts are: (a) "A Bull once escaped from a Lion by entering a cave which the Goatherds used to house their flocks in stormy weather and at night"; (b) "A fine hide makes an excellent meal for a hungry Dog, but the water was deep and the Dogs could not reach the hides from the bank"; (c) "The next day, dressed in the skin, the Wolf strolled into the pasture with the Sheep".
For the evaluation based on human judges we used half of all the texts from the Stories426 collection (exactly 1089 one -sentence summaries). We did not conduct the evaluation on the entire dataset because the work of each human judge took several hours and half of the dataset should be sufficient to obtain statistically representative results. In Table 3 and Figure 6, we presented the JS averaged over each illustrative image recommendation algorithm. In Table 4, we also counted the correlation between the JS scores obtained by each recommendation algorithm. In this way, we investigated whether there is a correlation between the scores of each algorithm and the semantic interpretation of their relevance by human judges. To count this correlation matrix, we used the JS scores of all judges simultaneously. In Table 5, we presented the results of the correlation analysis between the JS of the individual judges. We counted this correlation matrix to investigate whether there is a correlation between the scores of individual judges. In Figure 7, we presented the distribution of JS ratings given by each judge to each descriptive image recommendation algorithm.  Figure 6. JS averaged over each illustrative image recommendation algorithm (see Table 3).

Discussion
Unfortunately, we have no way to directly compare the results of the similarity function sim clip computed by each of the algorithms we test, because there are no ground truth values. For this reason, we used the values of sim M 1,2 and sim M 1,3 to compare the recommendations of each method.
As can be clearly seen in Equation (11) the TIPS approach with t > 17 has smaller values of M 1,2 and M 1,3 than all other considered algorithms. This fact means that, respectively, the first two or three recommendations given by the TIPS algorithm contain images that are more similar to each other according to the CLIP distance than is the case for algorithms that do not use the voting scheme. According to Figure 4c The shape of the variation of the value of V(Ξ, t) as a function of t is close to logarithmic, which confirms the great diversity when it comes to the recommendations made by the algorithm included in Ξ. We have chosen to highlight a few local maxima that become visible in the spline-smoothed graph and present detailed computations for TIPS at these points in Table 2. Note, however, that the occurrence of these maxima is not universal and may vary depending on the set of illustrative images and text data used. According to the results in Table 2 and Figure 3, the shape of variation of sim min (X) has a step-function character as the parameter t increases. This means that the recommendation of illustrative images that are among the first t highly scored values returned by the selected component recommendation algorithm might also be recommended by the other component algorithms. This illustrates the situation that, given a certain margin t, the component recommendation algorithms return a similar common portion of the best matching images. The value of sim max (X) is also spiking, but it reaches its maximum value much faster than sim min (X). The variability of the coefficients sim med (X) and sim mean (X) have an increasing character close to logarithmic and are very similar in shape. They represent some intermediate state between the extreme statistics sim min (X) and sim max (X). The entire set of graphs shown in Figure 3 shows that as the parameter p increases, the TIPS algorithm recommends illustrative images with increasing similarity values until it comes to maximising the values of the individual statistics based on sim clip . This fact and the shape of plots of sim med (X) and sim mean (X) prove that the t parameter is a predictable scaling factor of the confidence range of the recommendation result obtained by TIPS. Figure 5 visualises examples of the recommendations proposed by the individual component algorithms and the TIPS algorithm for t = 32. Full texts of those stories can be downloaded from https://github.com/JusMia/TIPS/tree/main/stories (accessed on 27 September 2021).
The value t = 32 was chosen because it was the first local maximum V(Ξ, t) for which the values M 1,2 and M 1,3 of TIPS performed better than the individual component algorithms (see Table 1). We can see that each of the illustrative images proposed for texts by the individual component algorithms has some features that make it similar to the given text. An interesting case is sentence (b), in which ViT16 and RN50 seem to be very loosely related to the text and do not capture the semantic meaning of it. It is noteworthy that the TIPS recommendation O Xi (x k ) 1 need not be among the O E (x k ) 1 of the individual component algorithms. The TIPS recommendation t = 32 for sentence (a) is the same as the recommendation for E = ViT16 and E = Vit32 and for sentence (c) for E = RN101. In the case of sentence (b), TIPS t = 32 proposed a different illustrative image, which is semantically consistent with the content of the sentence but not in the set of first recommendations of the individual component algorithms.
The evaluation performed by human judges confirmed that the average JS value is the highest for the TIPS algorithm: see Table 3 and Figure 6. The mean JS for TIPS is 0.96, 0.98 and 0.82 while for a non-voting single-embedding schema (4) with E ViTB32 it was 0.75, 0.81 and 0.66. Also, the results shown in Figure 7 confirm that individual judges are more likely to rate the recommendations made by TIPS as being relevant or highly relevant to the text. Human judges found 25%, 30% and 21% TIPS recommendations highly relevant and 46%, 39% and 41% relevant to text. In the case of non-voting single-embedding schema (4) with E ViTB32 18%, 22% and 16%, recommendations were found highly relevant and 38%, 36% and 34% recommendations were found relevant. According to the results in Table 4, the average value of JS is medium positively correlated between all considered algorithms. Correlation varies from 0.41 to 0.57. This is an important found because it indicates that from the judges perspective all considered algorithms work in similar manner proposing correlated images to subject of the text. As can be seen in Table 5, there is a moderate (0.66 between judge 2 and judge 3) and strong positive correlation (0.76 and 0.77) between judges JS. That means that all three human judges have evaluated algorithms results in a similar manner. The evaluation results showed some limitations of the proposed method due to the fact that it works on a fixed set of image data. In our case the Unsplash Lite dataset nature-themed images might be not suitable for some topics. The judges estimated that between 29% and 39% of the images selected by TIPS were not relevant to the text.
However, this is a much better result than when using the non-voting single-embedding schema where the number not relevant to the text was between 41% to even 56%.These results show that using a method based on voting schema allows us to increase the quality of matching evaluated with JS.

Conclusions
Designing algorithms to suggest illustrative images for text is a very advanced and relatively new research topic. Based on the results presented in the previous section of our paper, we can conclude that our proposed algorithm allows us to make suggestions of illustrative images that are to some extent semantically consistent with the content of the summaries' sentences of the text to which they relate. The recommendation quality of the TIPS algorithm is a function of the set of illustrative images that have been selected to illustrate the text and the efficiency of sub-recommendation algorithms used by TIPS in the voting schema. An equally important aspect of TIPS is the algorithm that performs text summarization. Since the selection of illustrative images is based on applying a voting schema to the individual sentences included in the summary, this represents a potential weakness in our algorithm if unrepresentative sentences are selected. Regardless, voting schema for individual sentences proved to be a very effective approach. In this paper, we presented a number of coefficients that can be used to test whether a recommendation algorithm O Xi (x k ) t meets its expectations. We have done this by presenting some benchmark solutions based on a Stories426 text database and a set of images described in Section 2.2.2.
Based on the evaluation of the results conducted by the three human judges, it can be concluded that the use of the TIPS voting scheme increased the accuracy of matching illustrative images to texts. The judges estimated that the use of TIPS resulted in an increase in matching highly relevant images to text ranging from 5% to 8% and relevant to text ranging from 3% to 7% compared to the approach based on single-embedding schema with E ViTB32, which was the best evaluated algorithm from the single-embedding schema group.
We have shown that the TIPS algorithm allows the integration of the best (highest scored) results of the different recommendation algorithms by diminishing the influence of images that are a disjointed part of the recommendations of the component algorithms. Our computations and experiments can be replicated as we publish the full source codes of both the TIPS algorithm and the entire evaluation process of our approach.
There are many more available image datasets, including common sense ones which could be useful for story understanding in further research including COCO [56] and the Open Images Dataset [57].
An issue that we believe is worthy of future investigation is the feasibility of applying text-based image generation methods at the end of the image processing pipeline. We anticipate that the use of a sufficiently large database of reference images in conjunction with generative adversarial networks (GANs) may allow even better matches of textillustrating images to be produced [58]. Generating images is an alternative approach to that presented in [59], where a computer method automatically selects already existing images from an album and places them in suitable contexts within a body of text. Using a GAN may result in the ability to generate realistic images instead of picture book-style drawings [60]. It may also be valuable to further validate the resulting illustrative images by re-generating text based on them and comparing it to the original sentences of the summary, for example using the methods described in [61]. Those topics are certainly worthy of further research.