Learning to Describe: A New Approach to Computer Vision Based Ancient Coin Analysis (Version 2, Approved)
|Reviewer 1 Sinem Aslan Ca' Foscari University of Venice||Reviewer 2 Mihai Ciuc University Politehnica of Bucharest, Image Processing and Analysis Laboratory, Bucuresti,|
Approved with revisions
Cooper, J.; Arandjelović, O. Learning to Describe: A New Approach to Computer Vision Based Ancient Coin Analysis. Sci 2020, 2, 27.
Cooper J, Arandjelović O. Learning to Describe: A New Approach to Computer Vision Based Ancient Coin Analysis. Sci. 2020; 2(2):27.Chicago/Turabian Style
Cooper, Jessica; Arandjelović, Ognjen. 2020. "Learning to Describe: A New Approach to Computer Vision Based Ancient Coin Analysis." Sci 2, no. 2: 27.
Article Access Statistics
Ca' Foscari University of Venice
A common approach followed in literature for ancient coin classification has been recognising authorised issuers of coins using various image representations and classification algorithms. Differently from the common approach, in this paper, it is aimed to recognise the semantic elements at the motifs on the images of reverse sides of ancient coins. More specifically, semantic class names are settled from textual descriptions of coins and corresponding visual representations are explored on coin images. While proposed approach is quite interesting, presentation of the experimental design was not sufficiently clear to me. Besides, while the same problem is tackled at a previous publication  of the same authors, at the present work I could not determine methodological or experimental extension over the previous one.
Some notes of mine are as follows:
- It is mentioned at the first paragraph of the Introduction section that present work extends the previous work  of the same authors. However, I could not determine any extension, neither experimental nor methodological, both works seem extremely similar to each other. The only difference I could detect is visualisations of the learned filters at Fig. 12 and 13 at the current paper. If there is extension in a number of aspects that I could not recognise, could the authors list them at the corresponding paragraph in the Introduction section?
- Seems as the same neural network architecture has already been proposed and used for coin classification at a previous paper of the same authors [Schlag I, Arandjelovic O. Ancient Roman coin recognition in the wild using deep learning based recognition of artistically depicted face profiles. In ICCVW 2017 (pp. 2898-2906).] Would be fine to see the citation to that work at explanation of the framework in the related section (3.Proposed Framework) by a mention on (if there is) the difference in the current approach.
- Semantic labels are settled based on most frequent terms in the text descriptions of coins. I could not get from the manuscript why it is limited to five classes. A histogram graph depicting frequency of all terms in textual descriptions would be helpful to figure out such point. What was the initial size of the overall dataset and after getting the images related to such chosen semantic classes have the remaining images been neglected or have they been used to choose the negative examples from?
- The specifications of the overall dataset used in the experiments are not clear to me. I saw that Horse, Cornucopia, Patera, Eagle and Shield classes have around 18K, 14K, 5K, 14K and 18K images respectively. However, these visual elements can mutually appear on the same images (e.g. visual elements of patera and cornucopia appear on the same image at Fig.6 (row 2, col2)). Then, (1) what is the overall size of the dataset (around 69K?) or each set that consists of 18K, 14K, 5K, 14K and 18K images have some intersections due to mutually appearing elements (i.e. so total amount is less than 69K?) (2) from which set the negative examples are chosen?
- From Table 10, it is understood that training is done separately for each of the five image sets. I did not get if training is done using 2-class labels or 5-class labels (at some sense it seems as a 2-class classification problem because it is mentioned several times that positive and negative examples are used in the experiments - then again how the negative classes is decided?). Can it be mentioned more apparent in manuscript?
- At Figure 11 caption, it is written that the identified salient regions corresponds to a cornucopia, a patera, and a shield, respectively. The last one should possibly be eagle, not shield. Could the authors give an example visualisation also for shield?
- Fig 12 and 13 do not seem useful to me, because it is not possible to discriminate the difference between them.
Response to Reviewer 1Sent on 11 Jul 2020 by Jessica Cooper, Ognjen Arandjelovic
University Politehnica of Bucharest, Image Processing and Analysis Laboratory, Bucuresti,
The paper presents a method to detect the presence of common elements on ancient coins (horse, shield etc.) using a convolutional neural network. The problem is rendered extremely complicated by the fact that annotations that are used in training (which have been made by professional coin dealers) are unstructured.
The paper is very well written, and the results obtained are remarkable.