Next Article in Journal
Antimalarial Drugs in Ghana: A Case Study on Personal Preferences
Next Article in Special Issue
Making Japenese Ukiyo-e Art 3D in Real-Time
Previous Article in Journal
Effects of the Use of Good Agricultural Practices on Aflatoxin Levels in Maize Grown in Nandi County, Kenya
Previous Article in Special Issue
Classification of Ancient Roman Coins by Denomination Using Colour, a Forgotten Feature in Automatic Ancient Coin Analysis
Open AccessArticlePost Publication Peer ReviewVersion 2, Approved

Learning to Describe: A New Approach to Computer Vision Based Ancient Coin Analysis (Version 2, Approved)

School Of Computer Science, University of St Andrews, Jack Cole Building, North Haugh, St Andrews KY16 9SX, UK
Authors to whom correspondence should be addressed.
Received: 22 February 2020 / Accepted: 24 February 2020 / Published: 17 April 2020
(This article belongs to the Special Issue Machine Learning and Vision for Cultural Heritage)
Peer review status: 2nd round review Read review reports

Reviewer 1 Sinem Aslan Ca' Foscari University of Venice Reviewer 2 Mihai Ciuc University Politehnica of Bucharest, Image Processing and Analysis Laboratory, Bucuresti,
Version 1
Approved with revisions
Authors' response
Authors' response
Version 2
Version 2, Approved
Published: 17 April 2020
DOI: 10.3390/sci2020027
Download Full-text PDF

Version 1, Original
Published: 2 March 2020
DOI: 10.3390/sci2010008
Download Full-text PDF
In recent years, a range of problems under the broad umbrella of computer vision based analysis of ancient coins have been attracting an increasing amount of attention. Notwithstanding this research effort, the results achieved by the state of the art in published literature remain poor and far from sufficiently well performing for any practical purpose. In the present paper we present a series of contributions which we believe will benefit the interested community. We explain that the approach of visual matching of coins, universally adopted in existing published papers on the topic, is not of practical interest because the number of ancient coin types exceeds by far the number of those types which have been imaged, be it in digital form (e.g., online) or otherwise (traditional film, in print, etc.). Rather, we argue that the focus should be on understanding the semantic content of coins. Hence, we describe a novel approach—to first extract semantic concepts from real-world multimodal input and associate them with their corresponding coin images, and then to train a convolutional neural network to learn the appearance of these concepts. On a real-world data set, we demonstrate highly promising results, correctly identifying a range of visual elements on unseen coins with up to 84% accuracy. View Full-Text
Keywords: numismatics; Roman; Rome; deep learning; computer vision numismatics; Roman; Rome; deep learning; computer vision
Show Figures

Figure 1

MDPI and ACS Style

Cooper, J.; Arandjelović, O. Learning to Describe: A New Approach to Computer Vision Based Ancient Coin Analysis. Sci 2020, 2, 27.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region


Reviewer 1

Sent on 22 Mar 2020 by Sinem Aslan | Approved with revisions
Ca' Foscari University of Venice

A common approach followed in literature for ancient coin classification has been recognising authorised issuers of coins using various image representations and classification algorithms. Differently from the common approach, in this paper, it is aimed to recognise the semantic elements at the motifs on the images of reverse sides of ancient coins. More specifically, semantic class names are settled from textual descriptions of coins and corresponding visual representations are explored on coin images. While proposed approach is quite interesting, presentation of the experimental design was not sufficiently clear to me. Besides, while the same problem is tackled at a previous publication [1] of the same authors, at the present work I could not determine methodological or experimental extension over the previous one.  

Some notes of mine are as follows:

  • It is mentioned at the first paragraph of the Introduction section that present work extends the previous work [1] of the same authors. However, I could not determine any extension, neither experimental nor methodological, both works seem extremely similar to each other. The only difference I could detect is visualisations of the learned filters at Fig. 12 and 13 at the current paper. If there is extension in a number of aspects that I could not recognise, could the authors list them at the corresponding paragraph in the Introduction section?  
  • Seems as the same neural network architecture has already been proposed and used for coin classification at a previous paper of the same authors [Schlag I, Arandjelovic O. Ancient Roman coin recognition in the wild using deep learning based recognition of artistically depicted face profiles. In ICCVW 2017 (pp. 2898-2906).] Would be fine to see the citation to that work at explanation of the framework in the related section (3.Proposed Framework) by a mention on (if there is) the difference in the current approach. 
  • Semantic labels are settled based on most frequent terms in the text descriptions of coins. I could not get from the manuscript why it is limited to five classes. A histogram graph depicting frequency of all terms in textual descriptions would be helpful to figure out such point. What was the initial size of the overall dataset and after getting the images related to such chosen semantic classes have the remaining images been neglected or have they been used to choose the negative examples from? 
  • The specifications of the overall dataset used in the experiments are not clear to me. I saw that Horse, Cornucopia, Patera, Eagle and Shield classes have around 18K, 14K, 5K, 14K and 18K images respectively. However, these visual elements can mutually appear on the same images (e.g. visual elements of patera and cornucopia appear on the same image at Fig.6 (row 2, col2)). Then, (1) what is the overall size of the dataset (around 69K?) or each set that consists of 18K, 14K, 5K, 14K and 18K images have some intersections due to mutually appearing elements (i.e. so total amount is less than 69K?) (2) from which set the negative examples are chosen?
  • From Table 10, it is understood that training is done separately for each of the five image sets. I did not get if training is done using 2-class labels or 5-class labels (at some sense it seems as a 2-class classification problem because it is mentioned several times that positive and negative examples are used in the experiments - then again how the negative classes is decided?). Can it be mentioned more apparent in manuscript? 
  • At Figure 11 caption, it is written that the identified salient regions corresponds to a cornucopia, a patera, and a shield, respectively. The last one should possibly be eagle, not shield. Could the authors give an example visualisation also for shield?
  • Fig 12 and 13 do not seem useful to me, because it is not possible to discriminate the difference between them. 

Response to Reviewer 1

Sent on 11 Jul 2020 by Jessica Cooper, Ognjen Arandjelovic

-- Thank you for your comments. We will update the manuscript to clarify these points. Please see below for detailed responses and directions to relevant sections of the manuscript.

-- Our submission is an extension of our previous conference paper and contains more theoretical content underpinning the algorithm, further experiments including more detailed exploration of the dataset and a more in-depth analysis and discussion of findings and their relevance.

-- Our network architecture is different to the one you mention - that model used five convolution blocks, each consisting of “two sets of convolutional layers, batch normalization, and rectified linear unit activation… The final architecture is made up of five consecutive convolutional blocks and max-pooling pairs. The number of filters is doubled after every pooling layer with the exception of the last layer.” Whereas we use an architecture closer to AlexNet, as noted in our submission. We do not use blocks or apply batch normalization, nor double the number of filters after each pooling layer. Indeed, we do not use pairs of convolutional blocks and max pooling either. The kernel sizes for each operation are not explicitly given in the paper you mention, but it appears they are likely to be smaller than ours, since they reference Simonyan and Zisserman “who demonstrated that a carefully crafted network built using few small (3×3), stacked kernels is superior to one comprising bigger kernels in terms of describability and computational cost” - in contrast, we use larger kernels as we found that they gave better performance.

-- A histogram depicting the frequency of all terms is infeasible - there are many thousands of possible terms as evident in the sample attributions in Figure 3. We limit our work to the most frequent five terms for reasons of time cost and feasibility.

-- Section 4: “our data comprised 100,000 images and their associated textual descriptions.”; -- Binary labelling for each element is correct. Please see section 2.2.3: “We shuffle the samples before building training, validation and test sets for each of the selected elements. To address under-representation of positive examples, we use stratified sampling to ensure equal class representation [18], matching the number of positive samples for each class with randomly selected negative samples (and thereby doubling the size of the dataset for each element). This provides us with datasets for each element of the following sizes: ‘horse’: 17,978; ‘cornucopia’: 13,956; ‘patera’: 5,330; ‘eagle’: 14,028; ‘shield’: 17,546 each of which we split with a ratio of 70% training set, 15% validation set and 15% test set.”

-- Thank you, we will change this.

-- The difference is subtle, but do you not see a leaning towards recognising diagonal edges, as one would expect given the shape of the cornucopia in the first, and more small curved shapes in the second? We will investigate visualisation of the filters of deeper layers.

Reviewer 2

Sent on 26 Mar 2020 by Mihai Ciuc | Approved
University Politehnica of Bucharest, Image Processing and Analysis Laboratory, Bucuresti,

The paper presents a method to detect the presence of common elements on ancient coins (horse, shield etc.) using a convolutional neural network. The problem is rendered extremely complicated by the fact that annotations that are used in training (which have been made by professional coin dealers) are unstructured.


The paper is very well written, and the results obtained are remarkable.

Response to Reviewer 2

Sent on 11 Jul 2020 by Jessica Cooper, Ognjen Arandjelovic

Thank you!

Back to TopTop