Next Article in Journal
Analysing Computer Science Courses over Time
Previous Article in Journal
Linking and Sharing Technology: Partnerships for Data Innovations for Management of Agricultural Big Data
 
 
Data Descriptor
Peer-Review Record

#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

by Gabriel Oliveira dos Santos, Esther Luna Colombini and Sandra Avila *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 1 December 2021 / Revised: 29 December 2021 / Accepted: 15 January 2022 / Published: 21 January 2022
(This article belongs to the Section Information Systems and Data Management)

Round 1

Reviewer 1 Report

This is an intersting paper and project, and it has good potential for being useful not only as providing an image captioning dataset in Portuguese, but also as a methodological description. The paper is logically and methodologically mostly sound. There are, however, a few points I woul like to draw the authors’ attention to:

 

1) The point is made that PraCegoVer has larger reference length and standard deviation than other typical similar datasets, but it is not clear why this is so, and was this the aim on purpose. Since datasets like this in Portuguese didn’t really exist before, would it not make sense to make the first one more similar (and not more challenging) than similar datasets in other languages? Or is it perhaps so that these characteristics of the dataset are not by design but simply caused by the data collection method, perhaps also illustrating consistency problems in using the general public for data provision? In any case, I would recommend the authors shortly describe and critically evaluate the implications of these aspects in the manuscript.

 

2) The authors describe several steps that were taken to reduce memory consumption, but it is not immediately evident from the manuscript why this was done – how much memory would have been consumed without the consumption reduction, and how effective precisely were the memory consumption reductions steps? What kind of computers were available, could the availability of more computing resources have reduced the need to make memory-saving workflow changes, as such changes are at least theoretically always a potential error source?

 

3) The idea of using public Instagram posts is a good one, although it of course results in a lot of inconsistencies requiring a lot of data cleaning, as the authors also demonstrate. The authors give some reasons for not using Facebook, but what about Twitter? It is mentioned as an option, but not described more than that?

 

4) It does not seem to be possible to open "dataset.json" from Zenodo. Is this because it is only available on request? It would be useful to be able to see the structure of this file. The authors mention that the dataset is going to be available for download from Zenodo, but why with restricted access? Wouldn’t open access (with an appropriate license) be more productive towards the larger goals of the work and its utilization within the scientific community? What about the processing code used in creating the dataset - will that be made available as well? This would probably also be useful to many.

 

5) The important pre-processing phase could perhaps be described in some more detail. If not in the main text, then as supplementary information?

 

6) Minor detail: the authors write the hashtag as #PraCegoVer throughout the text, whereas in Figure 2 it is #pracegover. Perhaps the authors could comment on case sensitivity (or lack thereof) of Instagram hashtags?

Author Response

  1. The point is made that PraCegoVer has larger reference length and standard deviation than other typical similar datasets, but it is not clear why this is so, and was this the aim on purpose. Since datasets like this in Portuguese didn’t really exist before, would it not make sense to make the first one more similar (and not more challenging) than similar datasets in other languages? Or is it perhaps so that these characteristics of the dataset are not by design but simply caused by the data collection method, perhaps also illustrating consistency problems in using the general public for data provision? In any case, I would recommend the authors shortly describe and critically evaluate the implications of these aspects in the manuscript.

Response

The famous datasets in the literature of image captioning (e.g., MS COCO, Flickr30K, VizWiz) are created using crowdsourcing platforms, in which human annotators write down the captions according to a protocol. Although data annotation is a process that results in data with better annotation quality, it is an expensive process. Hence, we decided to collect the data freely annotated by the followers of the PraCegoVer movement. It stimulates people to post images tagged with #PraCegoVer and add a short description of their content. This is a cheap method to create the dataset; however, this data collection raises problems inherent to the data source, Instagram. Unlike typical datasets in this literature, PraCegoVer is a wild dataset, i.e., people do not follow any protocol to write the captions. Thus, PraCegoVer has a larger reference length and standard deviation than other typical similar datasets. Nevertheless, we introduced a method to tackle this problem in our paper “CIDEr-R: Robust Consensus-based Image Description Evaluation”, published in Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT@EMNLP 2021).

  1. The authors describe several steps that were taken to reduce memory consumption, but it is not immediately evident from the manuscript why this was done – how much memory would have been consumed without the consumption reduction, and how effective precisely were the memory consumption reductions steps? What kind of computers were available, could the availability of more computing resources have reduced the need to make memory-saving workflow changes, as such changes are at least theoretically always a potential error source?

Response

Our algorithm for removing duplicates (Algorithm 1) requires distance matrices for images and texts. The construction of such matrices has a quadratic complexity in time and space, which means that both the time and memory size required to execute this algorithm is proportional to the total number of posts squared. Then, it is not feasible to run the algorithm considering the entire dataset because it would consume a considerable amount of memory and take too long. 

We hypothesized that duplicated examples were within the same cluster because they are very similar. Then, we carried a qualitative analysis, and we confirmed our hypothesis. A solution for the memory consumption problem relies on this fact. Instead of running the algorithm for removing duplicates considering the entire dataset, we execute it only for the posts that compose a cluster because duplicates might be within the same cluster. This way, we no longer need a memory size proportional to the total number of posts squared. However, we need a memory size proportional to the cluster size squared. Since we created small clusters, as shown in Figure 9, with at most 60 thousand posts, the memory consumption is reduced in 1 - (60K^2)/(533K^2) ~ 98.7%.

Moreover, we used a method to reduce the dimensionality of the feature vectors from images. It has two main advantages: the immediate gain is that mathematical operations over vectors with lower dimensions are faster and require less memory; the second one is that when we reduce the dimensionality of the vectors, all the features are projected to a space that may shrink the noise, then it can be a better feature space.

Indeed, the information was not clear in the manuscript. Hence, we have addressed this point by adding more details in Sections 4.3.2. Image Processing and 4.3.3. Image Clustering.

  1. The idea of using public Instagram posts is a good one, although it of course results in a lot of inconsistencies requiring a lot of data cleaning, as the authors also demonstrate. The authors give some reasons for not using Facebook, but what about Twitter? It is mentioned as an option, but not described more than that?

Response: We did not use Twitter because a few tweets contain images that use #PraCegoVer. In most cases, the tweets that use it are only a copy of the same content posted on Instagram. Also, because of the 280-character limit, the users split the text into many tweets. Then, leveraging these data requires more preprocessing for Twitter. Thus, we believe that the effort to collect and preprocess data from Twitter was worthless.

To make clear our decision to collect data only from Instagram, we better explain this point in 4.1. Data Collection (at line 154).

  1. It does not seem to be possible to open "dataset.json" from Zenodo. Is this because it is only available on request? It would be useful to be able to see the structure of this file. The authors mention that the dataset is going to be available for download from Zenodo, but why with restricted access? Wouldn’t open access (with an appropriate license) be more productive towards the larger goals of the work and its utilization within the scientific community? What about the processing code used in creating the dataset - will that be made available as well? This would probably also be useful to many.

 

  • It does not seem to be possible to open "dataset.json" from Zenodo. Is this because it is only available on request? It would be useful to be able to see the structure of this file.

Response: "dataset.json" can not be opened inside the Zenodo platform because of its size, 678 MB. We offer an alternative to users of PraCegoVer to download just a small sample instead of the entire dataset; it is the file “sample.tar.gz” (https://zenodo.org/record/5710562/files/sample.tar.gz?download=1). This way, they can see the overall structure of the dataset before downloading it.

  • The authors mention that the dataset is going to be available for download from Zenodo, but why with restricted access? Wouldn’t open access (with an appropriate license) be more productive towards the larger goals of the work and its utilization within the scientific community?

Response: The Brazilian Law No. 13,709 (Portuguese version: http://www.planalto.gov.br/ccivil_03/_ato2015-2018/2018/lei/l13709.htm, English version: https://iapp.org/media/pdf/resource_center/Brazilian_General_Data_Protection_Law.pdf), also known as General Data Protection Law, establishes rules for personal data collecting, storing, handling, and sharing. According to Article 11, Item II(c), the processing of sensitive personal data can occur without consent from the data subject, when it is indispensable for studies carried out by a research entity, whenever possible, ensuring the anonymization of sensitive personal data.

Our dataset contains images of people, and it consists of data collected from public profiles on Instagram. Thus, the images and raw captions might contain sensitive data that reveal racial or ethnic origins, sexual orientations, religious beliefs. Hence, to avoid the unintended use of our dataset, we decided to restrict its access, ensuring that the dataset will be used for research purposes only. Still, we will make it available under request explaining the objectives of the research. 

To clarify this point, we better explained it in Section 6 Usage Notes, Questions: “6.3.8. Did the individuals in question consent to the collection and use of their data?” and “6.6.2. How will the dataset be distributed?”.

  • What about the processing code used in creating the dataset - will that be made available as well? This would probably also be useful to many.

Response: We have a Github repository (https://github.com/larocs/PraCegoVer), and we will make our code available there as soon as our manuscript is published. We do believe that it can also contribute to other people who want to collect data from Instagram following the same protocol with any hashtag. We addressed this point in Section 6 Usage Notes, Question 6.4.3. “Is the software used to preprocess the instances available?”.

  1. The important pre-processing phase could perhaps be described in some more detail. If not in the main text, then as supplementary information?

Response: We have addressed this point by adding more details in Sections 4.3.2. Image Processing and 4.3.3. Image Clustering.

  1. Minor detail: the authors write the hashtag as #PraCegoVer throughout the text, whereas in Figure 2 it is #pracegover. Perhaps the authors could comment on case sensitivity (or lack thereof) of Instagram hashtags?

Response: Instagram does not take into account case sensitivity, so it considers #pracegover and #PraCegoVer the same. In any case, to make our text clearer and avoid doubts, we changed #pracegover to #PraCegoVer in Figure 2.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper presents an interesting subject. In order to increase the novelty of the paper, the following aspects must be addressed:

  • there is no section of related work
  • the novelty of the paper is not clearly presented - the proposed approach must be argued with existing results
  • algorithm 2 is a base method - it is no need to be introduced 
  • results for the image classification are not presented
  • why the evaluation is made only on words that are or not present in the sentence; there is nothing said about the meaning of the resulted sentence
  • add more recent references (eg. 2020, 2021)

Author Response

1. There is no section of related work. 

Response:  We have addressed this point by including a section of Related Work (Section 2).

2. The novelty of the paper is not clearly presented - the proposed approach must be argued with existing results.

Response: Thank you for pointing this out. The novelty of this work is three-fold: 

  • We introduce the first dataset for the problem of image captioning with captions in Portuguese; 
  • We develop a framework — with the complete pipeline — for collection, preprocessing and analysis of data from a hashtag on Instagram, which is helpful for social media studies (Section 4 Method);
  • We propose an algorithm for identifying and removing duplicate posts (Section 4.2.2. Duplication Clustering, Algorithm 1). 

The entire pipeline will be available in our Github repository (https://github.com/larocs/PraCegoVer) as soon as this paper is published.

To address this point, we highlight our key contributions in Section 1 Summary. Regarding the existing results, we have addressed this point by including a section of Related Work (Section 2).

3. Algorithm 2 is a base method - it is no need to be introduced.

Response: We agree with the reviewer. Thus, we removed Algorithm 2.

4. Results for the image classification are not presented.

Response: We clustered the images to find out the most relevant classes of images in our dataset. However, clustering is an unsupervised method, thus the images are not human-annotated with the classes. We show the most relevant classes of images from our dataset in Figures 10, 11, 12, 13, 14, and 15. In our paper, we do not work with image classification tasks.

5. Why the evaluation is made only on words that are or not present in the sentence; there is nothing said about the meaning of the resulted sentence.

Response: To evaluate the quality of the generated captions, we used the well-known metrics in this literature: BLEU [15], ROUGE [16], METEOR [17], and CIDEr-D [12]. Typically, these metrics compare a candidate sentence against a set of reference sentences (ground truth). BLEU, ROUGE, and METEOR are rule-based metrics and rely on n-gram to assign a score to the candidate sentence. In contrast, CIDEr-D is a semantically sensitive metric, that computes the term frequency-inverse document frequency (TF-IDF) vectors to the candidate and reference sentences,  then compares the two TF-IDF vectors using cosine similarity. This way, CIDEr-D can capture differences in meaning between the ground truth and the generated sentence.

6. Add more recent references (eg. 2020, 2021).

Response: Thank you for the suggestion, we have addressed this point by including a section of Related Work. We added the following recent references: 6, 18, 19, 30, 31, 32, 33, 36 and 37.

Author Response File: Author Response.pdf

Reviewer 3 Report

This manuscript introduces a multi-modal dataset with Portuguese captions based on posts from Instagram. The authors have described the data collection method in details, and data quality appears to be technically sound. Overall, the paper is well written.

  • Comment: The authors claim that “… this is the first dataset proposed for the Image Captioning problem with captions in Portuguese.”. Can the authors compare the pros and cons of their method to construct the dataset to simply translate an existed dataset with English captions?
  • (Line 61) “… and remove post duplication based on visual and textual information to remove instances with similar content.”
    • Comment: rephrase for conciseness.
  • (Line 97) “… incrementally, storing: images …”
    • Comment: check punctuation.
  • (Definitions 1 and 2) Comment: It is not clear how to calculate the distance between two images, or between two captions. Though the authors later mentioned (at line 163) to indicate using cosine distance, but it remains unclear how to use calculate cosine distance for images of different sizes and for captions of different lengths.
  • (Fig. 5 caption) “Similarity graphs considering only image features and considering both visual and textual features. It can be …”
    • Comment: Try just “Similarity graphs. It can be ….” to avoid redundancy.

 

Author Response

1. Comment: The authors claim that “… this is the first dataset proposed for the Image Captioning problem with captions in Portuguese.”. Can the authors compare the pros and cons of their method to construct the dataset to simply translate an existed dataset with English captions?

Response:

Although simply translating datasets from English to Portuguese is a cheap way to train models to generate captions in Portuguese, the literature of Natural Language Processing has already shown that it introduces noise in the data that can harm the performance of models. In particular, the works of Xue et al. “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer” (https://aclanthology.org/2021.naacl-main.41/) and Rosa et al. “A cost-benefit analysis of cross-lingual transfer methods” (https://arxiv.org/pdf/2105.06813.pdf) have shown that the model performance is considered hampered when translated datasets are used in comparison with use datasets originally annotated in the target language. 

Moreover, even if one simply translates typical datasets from English to Portuguese, the resulting datasets are not directly comparable to ours. PraCegoVer considers a scenario of data in the wild, Consequently, it has misspellings, an extensive vocabulary with many informal words since the data is collected from social media, and the annotations do not follow a protocol. 

To clarify this point, we added a brief explanation in Section 1 Summary.

2. (Line 61) “… and remove post duplication based on visual and textual information to remove instances with similar content.”. Comment: rephrase for conciseness.

Response: Thank you for the suggestion, we have addressed this point by enumerating our key contributions in Section 1 Summary.

3. (Line 97) “… incrementally, storing: images …”.  Comment: check punctuation.

Response: Thank you for the suggestion, we have changed the sentence (line 167) accordingly. 

4. (Definitions 1 and 2) Comment: It is not clear how to calculate the distance between two images, or between two captions. Though the authors later mentioned (at line 163) to indicate using cosine distance, but it remains unclear how to use calculate cosine distance for images of different sizes and for captions of different lengths.

Response

To compute the distance between two captions or between two images, we first extract feature vectors, and then we compute the cosine distance between such vectors. For captions, we extracted the feature vectors using the TF-IDF method. For images, we extracted the feature using the neural network architecture MobileNetV2. Indeed, the information was not clear in the text. Hence, we updated Section 4.2.2.  Duplication Clustering (line 222) to give more information concerning this process.

5. (Fig. 5 caption) “Similarity graphs considering only image features and considering both visual and textual features. It can be …”. Comment: Try just “Similarity graphs. It can be ….” to avoid redundancy.

Response: Thank you for the suggestion, we have changed the sentence (Fig. 5 caption) accordingly.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Since all my comments were addressed I recommend to publish the paper.

Back to TopTop