3.1. Wikimedia Commons Dataset
Wikimedia Commons (
https://commons.wikimedia.org/wiki/Main_Page, accessed on through 1 January 2020 to 13 April 2020) is a repository of free-to-use images that is a part of Wikimedia Foundation. Wikimedia Commons files are used across all Wikimedia projects in all languages, including Wikipedia, Wiktionary, Wikibooks, Wikivoyage, Wikispecies, Wikisource, Wikinews, or downloaded offsite use. It comprises approximately 65 million images that take about 250 TB of space. The images also contain captions, descriptions, and timestamps.
To retrieve the images, one must send queries to the Wikimedia Commons website. To this end, we have used two different sets of query words to construct datasets. For retrieving the entire dataset, the dictionary of the BERT model [
1] is used. As for getting the subset that we primarily used in this work, UWA MRC psycholinguistic dataset words are used.
UWA MRC Psycholinguistic Dataset [
17] contains 98,538 words and their properties, such as type, meaningfulnes, concreteness, part-of-speech, familiarity, and many more. Concreteness scores which are used in this research are derived from merging the two datasets provided by References [
51,
52].
In this dataset, 4293 out of 98,538 words have a concreteness rating, rated by human annotators. Human annotators are asked to rate the concreteness of words between (including) 1 and 7, where the higher the score, the more concrete the word is. The mean of all users’ scores is the final concreteness rating of the word, which is scaled between 100 and 700. Overall, the most abstract term in the dataset is “as” with a rating of 158, and the most concrete word is “milk” with a score of 670. The mean rating of all terms is 438, and the standard deviation is 120.
To successfully integrate this dataset into our task, some processing is required. Although the UWA MRC Psycholinguistic dataset successfully identifies the concreteness of words, it considers the words in isolation, unlike this work, where contextual embeddings and language models regard words in their context. Therefore, all the stop-words are removed (stop-words from the NLTK library are used) from the dataset, considering that they can appear in various contexts with different levels of concreteness and therefore can lead to misleading results. It is observed from the dataset that the lowest-rated words are usually stop-words, such as “as”, “therefore”, and “and”. Thus, a lot of abstract words are removed in the lower bound. The most abstract word in the dataset after the removal is “apt” with a rating of 183. The final version of the dataset contains 1674 abstract and 2434 concrete words.
For each word, a query is sent to the Wikimedia Commons website with 1000 as a maximum threshold for the number of results. As a result, we have images, their corresponding captions, descriptions, and concreteness labels.
Figure 1 shows the number of images returned for each query word in UWA MRC psycholinguistic dataset. As seen from the graph, most of the query words returned less than 100 results despite a large threshold. Only around a hundred words have more than 500 images associated with them. The number of samples collected is shown in
Table 1. More than 43 million images are collected using the dictionary of BERT, while approximately
million images are collected using the words in UWA MRC psycholinguistic dataset. We can also observe that not all images have a description and/or caption associated with them. Some images contain only captions, some images contain descriptions but no caption, and, finally, some images do not contain any textual information at all. In total, 630,000 images contain captions, and approximately 2 million images contain descriptions. Overall, there is an overlap between both sets which means that some images contain both captions and descriptions.
The retrieved images have many formats, such as .jpeg, .jpg, .jpe .png, .apng, .gif, .tif, .tiff, .xcf, .webp, and many image modes, such as RGB (3 × 8-bit pixels, true color), CMYK (4 × 8-bit pixels, color separation), I (32-bit signed integer pixels), I;16 (16-bit unsigned integer pixels). Although many of these formats and modes are supported, we eliminated some of them. Images with the extension .xcf and .webp are filtered because mainstream image processing libraries do not support them. In addition to this, images with mode I (and other modes of I, such as I;16, I;16L, I16B, and so on) are eliminated because they are single-channel image modes, and the neural network models that process these images run with multi-channel inputs. Nearly 26,000 images are eliminated after this filtering. In the final version of the dataset, there are approximately 603,000 images with captions, where 177,000 belongs to abstract concepts, while 425,000 belongs to concrete concepts.
Many images in Wikimedia Commons have a very high resolution (resolutions, such as
,
, are very common), therefore requiring huge storage space. In addition to the filters applied above, a resize operation is performed to cope with this storage problem. All images are converted to a resolution of
since all the image models (GoogleNet [
53], VGG [
54], Resnet [
55]) run with those.
Figure 2 shows some example images and their corresponding captions and descriptions from the collected Wikimedia Commons dataset. The selected images have captions and descriptions, except for the bottom-left image where a description does not exist.
One thing to be observed from these images is, indeed, the images and the texts convey different information on the relationship of concepts. For example, there is no textual information in the top-left image, neither in the caption nor in the description, about the buildings that can be seen in the image. However, streets are primarily located near buildings (almost 70% of all images from Wikimedia Commons contains buildings when you search for the keyword “street”), which is captured by the image. Therefore the system can learn a relationship of concrete concepts, such as “street” and “building”, from the pictures without relying on the text. Similarly, the image contains no clue about its location, but it is understandable from both the caption and the description that it is in Mogadishu, Somalia. In the same vein, in the bottom-left image; there is no mention of a sea/lake in the text, but the lighthouse and the sea/lake can be seen together (which occur with almost no exception in real life) in the image, which will help the model to learn their relationships better. So, a language model trained with both images and text can help to improve the performances of language models.
Although the collected dataset contains captions and descriptions, captions are used to train the multi-modal language model. The reason is two-fold. We observed that descriptions in Wikimedia Commons are unclean. They include many additional texts, such as copyright notices, information about the photographer, or information about how the photograph is taken (such an example can be seen in the last sentence of the top-right image of
Figure 2). On the other hand, captions are already cleaned and contain information only about the picture itself. Because of the requirement of tedious cleaning, we relied on captions.
The second but most important reason is the image-text alignment issues. Captions are written to describe the images briefly without giving any other information or making any further comment classified as common-sense knowledge or real-world knowledge. Contrarily, descriptions contain much information that cannot be seen in or referred from the images. Although these additional pieces of knowledge can be essential and valuable in other tasks, they break the image-text alignment and lead to learning noisy contexts in language modeling. If we take the top-right image in
Figure 2 as an example, we can see how this can affect the language models. The description of the top-right image provides many semantically similar words to the context of the image, which is sheep lounging in a field, such as “breeding”, “slaughtered”, and “vegetation”. However, it also provides a lot of different or unrelated words, such as “castle”, “ruin”, “municipality”, which has very little to do with the image itself. Consequently, this leads to learning from an accidental relationship, for example, between the context of “sheep” and the context of “municipality”. On account of this fact, captions are used in all language modeling tasks in this work to provide a better image-text alignment in training samples.
There have been several other multi-modal datasets proposed in the literature that consist of image-text pairs, such as Flickr [
56], MS COCO [
57], Wikipedia, British Library, and ESP Game [
58].
Table 2 shows the collected dataset in comparison with these multi-modal datasets. The Flickr dataset and MS COCO dataset contain image-caption pairs, while the Wikipedia dataset provides the images in Wikipedia with their corresponding articles. The British Library book dataset, on the other hand, contains historical books and the pictures depicted in them. Finally, the ESP game dataset consists of 5 words for each image labeled by human annotators. Although both Wikipedia and BL datasets provide much longer texts, they lack the image-text alignment of caption datasets. Therefore, caption datasets, such as MS COCO, Flickr, or the proposed dataset in this work, are more suited to the task of multi-modal language modeling. Compared with these image captioning datasets, the size of the collected dataset is much greater. As deep neural representations have massive data requirements, it is preferable to have such a large amount of data. Recently, the WIT [
59] dataset was also proposed, with a large number of image-text pairs that can be used for multi-lingual, multi-modal pre-training. It contains 11.4 million unique images with captions and descriptive text from Wikipedia articles for various languages. Among them, 3.98 million images have textual information in English, where 568,000 of them have captions. In addition to captions, the collection also includes contextual data, such as page titles, page descriptions, section titles, etc., with their descriptions. However, the most significant benefit of the proposed dataset is the concreteness labels provided for each image-text pair which might be very useful for various tasks, especially for the multi-modal language modeling. The other datasets mentioned in this section, including WIT, do not contain that information.
3.2. Model
The overall architecture of the proposed model can be seen in
Figure 3. The model is comprised of three main parts: text processing part, image processing part, and a fusion mechanism where the outputs of text and image models are combined. Each piece is explained below in its respective subsection.
3.2.1. Text Model
In this work, BERT is primarily used for processing text input, while we also utilized DistilBERT in some of the tests.
BERT [
1] is a neural network model that uses a bidirectional transformer architecture [
36], a self-attention mechanism to learn contextual word embeddings. It has multiple layers of transformers (12 in BERT-base, 24 in BERT-large) where each layer has 12 attention heads that span the entire sentence from both right-to-left and left-to-right, learning “where to look” by producing probabilistic weights for each word.
Different from the earlier language modeling approaches, BERT does not use next word prediction as an objective. Instead, it uses two training objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). For the MLM objective, randomly selected words are occluded from the model and labeled as masks. The model tries to predict the masked word as the training objective. Attention heads do not span these masked words since it would create a bias for the prediction. Using MLM enables the model to learn contextual dependencies among words very successfully. The embedding of a word is computed depending on the surrounding terms instead of using the same vector in the embedding space for every instance of that word. For the NSP objective, the model tries to predict whether the two sentences provided to the model belong to the same context or not. It helps BERT to consider multiple sentences as context and to represent inter-sentence relations.
In addition to the token (word) embeddings, BERT also uses segment (sentence) embeddings and position embeddings (words’ position in segments) as input. While sentence embedding determines which sentence the word is in, positional embedding acknowledges the word order. Therefore, a word’s embedding is fed to the model as the average of its token embedding, sentence embedding, and positional embedding. This input structure has many benefits: Positional embeddings raise the model’s awareness of word order, while segment embeddings help the NSP objective. In addition, giving multiple sentences as input helps BERT be integrated into most downstream tasks requiring inter-sentence connections, such as Question Answering and Natural Language Inference (NLI), easily, without requiring any other architecture.
To integrate BERT to downstream tasks, an additional fully connected layer is used on top of transformer layers to predict the given text’s class instead of the target (masked) word. Usually, the Wikipedia dataset is used to pre-train the model on MLM and NSP objectives. The resulting parameters are fine-tuned on the downstream task with the addition of the aforementioned fully connected layer.
In this study, we performed some tests using the DistilBERT language model. DistilBERT [
60] is based on the original BERT model. It is a more efficient version of BERT in expense for a minor deficiency in classification performance. It retains
of BERT’s performance while using
fewer parameters. To accomplish this, they use knowledge distillation, where a small model is trained to reproduce the behavior of a larger model (DistilBERT and BERT, respectively, in this case). Knowledge distillation aims to make the student model (DistilBERT) predict the same values as the teacher model (BERT) using fewer parameters. This way, one can transfer the knowledge learned by the teacher model to more efficient student models. Parameter reduction from BERT to DistilBERT comes from the removal of some of the transformer layers in BERT. The authors of DistilBERT show that some of the parameters of BERT are not used in the prediction, therefore, do not contribute to learning downstream tasks. Consequently, they suggest removing some layers and use the knowledge distillation technique to create a more efficient language model.
3.2.2. Image Model
We used Resnet [
55] as the image model due to its success in many image processing tasks. It is a very deep neural network model that relies on convolutional neural network architecture. At the time it is published, it was the state-of-the-art model in the ImageNet [
61] object classification challenge.
Resnet has several different variations in network depth: 34-layered model Resnet34, 50-layered model Resnet50, 101-layered model Resnet101, and, finally, the largest model with 152-layers Resnet152. Each layer consists of several 1 × 1 and 3 × 3 convolutions. Each model starts and ends with an average pooling operation before the first layer and after the last layer.
Stacking so many layers in deep neural networks naively does not immediately lead to better results; instead, it causes performance degradation problems. An increase in the depth of a model causes an increase in training errors, and accuracy is saturated. To deal with this issue and build substantially deeper networks, authors needed a workaround. Therefore, shortcut connections called residual connections are used. These shortcut connections are used after every two layers in the architecture, propagating the inputs to the outputs of those two layers. They are parameter-free, which means that they do not perform any operation on the inputs, such as pooling, convolution, or multiplication; therefore, they do not contain any learnable parameters. It is shown that these shortcut connections can overcome the performance degradation problem in very deep neural network architectures, making models, such as Resnet, very successful at stacking many layers and capturing more features than the prior models.
In this work, Resnet152 is used because it outperforms the smaller Resnet models, and the Wikimedia Commons dataset was large enough to tune such a large model.
3.2.3. Text-Image Combination Method
Combining multiple modalities can be problematic and risks breaking the learned semantic relationship of words by individual models. Thus, many studies in this field focus on the fusion of modalities.
We used attentive pooling networks [
62] to combine the text and vision parts of the model. It is a two-way attention mechanism that is aware of both modalities and jointly learns to attend over them through matrix multiplications and pooling operations.
Attentive pooling takes the hidden states of each word in BERT as textual input and takes the last layer of Resnet in the form of a matrix as visual input. These inputs are multiplied with the matrix U , which is composed of parameters to learn and passed through activation. The result is a single matrix of visual features on the rows and textual features on the columns. This representation scheme allows features from different modalities to be jointly represented in a single matrix where max-pooling operation is performed over each row and column to find out the most important feature dependent upon the other modality. Two vectors, and , are the outcomes of the attentive pooling mechanism. For fine-tuning this model on downstream tasks, these two outputs are concatenated and passed through an additional fully connected layer to reduce the dimension to the number of classes.
3.3. Multi-Modal Language Model Training
The idea of pre-training neural language models is borrowed from the advances in image processing models [
32]. It is shown in both vision and text models that pre-training a model on a preliminary image/text understanding task improves the performance vastly.
For image processing, the pre-training task is usually the object classification task on the ImageNET dataset [
61]. ImageNET dataset has 1.2 million images that are hand-labeled into 1000 categories. Respective models are trained to predict the objects in each image by adding a fully connected layer on top to reduce the feature vectors’ size to 1000. The aim here is to teach the model basic image understanding: Identifying objects and entities in images. It is shown by many vision models that they are even able to differentiate images of 120 different dog breeds in the imageNET dataset, such as “Australian terrier” and “Airedale terrier”. They manage to do this by using the shapes and colors of entities in the pictures.
The process is similar for language models, with the only difference in pre-training objectives. Earlier models (before BERT) used next word prediction in huge unlabeled text, such as Wikipedia and Common Crawl text. The aim was to predict the next word given the previous set of words. Starting from BERT and onward, the pre-training objective changed from the next word prediction to masked language modeling. This method allowed the text models to successfully grasp language understanding by training them on massive datasets containing billions of words. They learned the meaning and semantic/syntactic relations of words (due to distributional hypothesis), which are fundamental to any downstream task.
Once the pre-training objective is completed and the image/text model gained basic image/language understanding, respectively, the last fully connected layer is removed from the model and replaced with an appropriate classification layer according to the task at hand. The model is, then, fine-tuned for the downstream task. For image models, downstream tasks can be object detection, semantic segmentation, etc., while, on the textual models, they are composed of sentiment analysis, sentence classification, natural language inference, and so on.
In this work, we adopt a novel multi-modal pre-training objective. The idea is inspired from the advances in cognitive psychology. It is shown that language acquisition in children starts with experiential information and continues with textual information [
11,
12]. As Kiela et al., in 2015 [
63], stated, perceptual information is more relevant for, e.g., elephants than it is for happiness. In other words, we first learn the language through images and learn concrete concepts, and then we start learning abstract concepts from textual sources.
Advancements in computational linguistics also reinforce this idea by showing that concrete examples in language are easier to learn, while abstract ones are more challenging. Hessel et al., in 2018 [
64], showed that the more concrete the downstream task gets, the easier it becomes for language models. Bruni et al., in 2014 [
38], showed that the semantic/syntactic similarities of concrete examples on the MEN dataset are easier to learn, while the abstract words can get ambiguous. They prove this by showing that the concrete examples have a
Spearman correlation rank, while the abstract examples have
(contributing to an overall
).
To adopt this learning scheme to this project, the Wikimedia Commons Dataset (see
Section 3.1) is divided into two categories: Abstract samples and concrete samples. We determined concrete/abstract examples based on the concreteness levels of words from the UWA MRC Psycholinguistic Database. First, we fed the image model concrete examples. Then, we trained the textual model with all of the samples concrete and abstract combined, in a curriculum learning fashion [
15,
16]. Therefore, the learning model mimics humans through this pre-training process.