A Multi-Annotator Survey of Sub-km Craters on Mars

: We present here a dataset of nearly 5000 small craters across roughly 1700 km 2 of the Martian surface, in the MC-11 East quadrangle. The dataset covers twelve 2000-by-2000 pixel Context Camera images, each of which is comprehensively labelled by six annotators, whose results are combined using agglomerative clustering. Crater size-frequency distributions are centrally important to the estimation of planetary surface ages, in lieu of in-situ sampling. Older surfaces are exposed to meteoritic impactors for longer and, thus, are more densely cratered. However, whilst populations of larger craters are well understood, the processes governing the production and erosion of small (sub-km) craters are more poorly constrained. We argue that, by surveying larger numbers of small craters, the planetary science community can reduce some of the current uncertainties regarding their production and erosion rates. To this end, many have sought to use state-of-the-art object detection techniques utilising Deep Learning, which—although powerful—require very large amounts of labelled training data to perform optimally. This survey gives researchers a large dataset to analyse small crater statistics over MC-11 East, and allows them to better train and validate their crater detection algorithms. The collection of these data also demonstrates a multi-annotator method for the labelling of many small objects, which produces an estimated conﬁdence score for each annotation and annotator.


Introduction
Craters are formed by the impact of meteorites on a planet's surface. The size of a crater is correlated with the energy of the impact event and, thus, the mass of the associated impactor [1]. By measuring the density of craters on a planetary surface-taking into account the flux of impactors and the rate of erosion-we can estimate its age. In practice, measuring crater frequency as a function of size (known as a Crater Size-Frequency Distribution, or CSFD) allows for more precise age estimates than the total count. In CSFDs, we observe a clear inverse power law for larger craters [2], making relative age estimates between regions straightforward. However, absolute age-dating of surfaces demands that several parameters related to both the production and erosion of craters be well constrained. The production rate is primarily affected by impactor flux, and the atmospheric and material characteristics of the planet. The erosion rate is the number of craters for a given size that are destroyed per unit time. This includes processes on Mars, such as fluvial and aeolian activity, volcanism, and obliteration via later impacts [3].
It has long been observed that all of these factors are easier to constrain for larger craters: fluxes of larger impactors are easier to measure, and many of the erosion effects (e.g., dust deposition) become negligible, because the craters are so large as to be impervious to them. The primary mechanism for erosion of large craters is in fact their obliteration due to even larger impacts imprinted on top of them. However, CSFDs at smaller size ranges are less applicable to age-dating, due to a less well bounded impactor flux, and erosion effects that are not dominated by a single cause, but are rather a mix of several non-negligible factors [3].
CSFDs are unreliable at smaller diameter ranges, thus it is tempting for scientists performing age-dating studies to simply ignore craters below, say, 1 km diameter. Unfortunately, larger craters are significantly rarer than smaller ones, meaning that, if we wish to measure surface age at higher spatial resolution, then we must rely on the statistics of smaller craters more heavily, as large ones are too sparse to draw reliable conclusions from. This also holds for young surfaces, which have seen relatively few impacts. Therefore, there will always be an interest in extending the range of crater diameters we can reliably use to age-date planetary surfaces. Additionally, small crater statistics could be used to help quantify those processes that are currently poorly constrained. For example, secondary cratering has been shown to be responsible for nearly a quarter of all craters with diameter ≥ 1 km globally on Mars [4], and dramatic examples, such as Zunil crater show rays with huge numbers of secondaries [5]. Clearly, any attempt to disentangle the complex statistics of primary and secondary crater populations must begin with a much larger survey of craters (e.g., [6]).
Manual surveys of Martian craters have successfully mapped all large craters, with a high degree of accuracy [7]. However, given that hundreds of millions of craters exist on the Martian surface, it will be impossible to conduct a manual survey globally, or indeed of any substantial portion of the surface. Eventually, automated detection is likely to lead to the most fruitful results. To this end, Machine Learning models-in particular, Convolutional Neural Networks (CNNs)-have been the state-of-the-art method for many Computer Vision tasks since 2012 [8]. Specifically, CNNs have been successfully employed for a huge variety of Object Detection applications (e.g., [9,10]). The power of these methods is revealed most spectacularly when the datasets used to train them are very large, because of their ability to generalise and interpolate in a hugely complex input space. These new techniques present a huge opportunity for the automation of regional or global crater surveys. In recent years, Crater Detection Algorithms (CDAs) for larger craters have been widely implemented: for example, [11] use a segmentation network to detect craters between 2 and 32 km in diameter in THEMIS imagery, whilst [12] used Digital Terrain Models as input. For High Resolution Stereo Camera (HRSC) imagery, which at 12.5 m resolution allows for sub-km crater detection, several CDAs have been proposed-either using more traditional Machine Learning algorithms, such as decision trees [13], or CNNs [14].
For Martian cratering, few datasets exist, and those that do (e.g., [7]) often do not include small craters. To date, the biggest openly available dataset of small Martian craters was released in 2012 [15]. This contains 3050 craters from the HRSC [16] over Nanedi Valles between 200 m and 5 km diameter. Our dataset extends and complements this prior effort, as it uses multiple annotators, a different camera system, is located in a different region of Mars, and it expands the diameter range down to tens of metres. Multi-annotator datasets have some precedent in the crater detection field: for example, the Moon Zoo project used several thousand citizen scientists to annotate Lunar craters over the Apollo 17 landing site [17]. The novel metric that we use to evaluate the consistency of our results (Section 3.2) provides a framework for future crowdsourcing efforts in a variety of object detection problems, not restricted to crater counting.

Data Description
The dataset comprises 12 annotated scenes, each 2000-by-2000 pixels in size, from the Context Camera (CTX) [18] aboard the Mars Reconnaissance Orbiter. More specifically, the tiles have been selected in a pseudo-random fashion from the co-registered and orthorectified mosaic over the MC-11 East quadrant, the creation of which is detailed in [19,20]. These data are co-registered to a baseline Digital Elevation Model that uses HRSC data [21]. The resolution of the data is 6 m, and each image measures 12 km by 12 km. CTX is in a sun-synchronous orbit, meaning that all images are taken at 3 pm local time [22], leading to consistent illumination angles across the dataset. The 12 images cover a diverse set of locations, terrain types, and altitudes, as can be seen in Figure 1.
Each image comes in both png format and as a geo-coded tif image in data/images. Associated with each image is a folder in data/annotations/raw containing 6 PASCAL VOC annotation format xml files, each one from a different annotator. Finally, there is a clustered annotation file, also in PASCAL VOC format [23]. Table 1 breaks down the number of individual annotations, and inter-annotator agreement (as defined in Section 3.2), for each tile. There is a large variation in the number of craters found within each of the 12 images, with the least cratered having 223 individual annotations, and the most having 3954. We found there was a high variance in the number of labels produced by each annotator, with some annotators consistently marking fewer or more craters in all of the images they annotated. Generally there was a ratio of 2-4 between the most and least numerous surveys on a given tile, this provides a strong argument for a multi-annotator strategy, which reduces the impact of this variance on the accuracy of results. Table 1. Scene-wise breakdown of annotations per annotator, including the number of labels, and agreement score (as defined in Section 3.2). The variation in agreement score is large, highlighting the difficulty of some of the scenes in comparison to others. For each tile, the set of six annotators (denoted 'i' to 'vi') is different, and the ordering of the columns is arbitrary.

SCENE
Labels per annotator Agreement score (%) In Table 2, we see information regarding the clustering of results, with average, minimum and maximum diameters for all images after clustering. During clustering, we first remove any annotations that have a diameter less than three pixels. Often, these very small craters are resolvable; however, accurate labelling is difficult and may have resulted in unreliable final results, so were omitted. Therefore, many of the images have a minimum diameter of 18 m (three pixels), whilst the upper limit varies much more, given the sporadic distribution of larger craters in our images. The median average diameters are consistently below mean averages, suggesting a skewed distribution towards lower size ranges, which is to be expected due to the negative power-law distribution of CSFDs.
Each PASCAL VOC format xml file contains information about the image it is related to, including path information, and its dimensions. It then contains a list of object fields, which denote each crater label. The difficult field is used to denote when an annotator chose to express that they were unsure about a given annotation, whilst the bndbox field gives the boundaries of the crater. No geographical information is provided in the xml, with all dimensions in pixel coordinates. Many of the fields included in the xml file are somewhat redundant (e.g., pose, truncated), but are included to increase compatibility with software which expects PASCAL VOC format data as input. An example of these xml files is given below.
The same format is used for the clustered annotations, but they also contain an extra 'confidence' value in each object instance denoting the number of annotators who found that crater, between 1 and 6, held in data/annotations/clustered. This clustered annotation file is created using the src/cluster_surveys.py script, with parameters set as default.

Collection
Our annotations were conducted by a group of 16 young ORBYTS (Original Research By Young Twinkle Students) [24] participants, at The College of Richard Collyer, Horsham, UK. In order to prepare, the annotators were given roughly 15 hours of seminars on Martian science, Machine Learning, and statistics. In addition to this, we worked collectively to closely define how annotations would be made, what was counted as a crater and what was not, and any possible edge-cases. This was all done in order to ensure a high accuracy and as low a variance as possible between the annotators. The tool used in annotation was a customised open-source image annotation tool LabelImg. Our version was made to restrict annotations to be equal width and height, as craters are usually most easily defined as a circle, rather than an ellipse. We also added a toggle for 'difficult' annotations, which could be triggered when annotators were unsure of their own marking.
After a training session using the software, annotations were intermittently conducted over a period of around eight weeks. Six annotators were selected using a random number generator for each tile and were then separated so as to not be influenced by one another whilst labelling. Once a tile was completed, an annotator would then be assigned another tile randomly from those still to be done.There is some variation in the quantity of annotations made by each annotator due to absences, differences in speed, and variation in the difficulty of the images.
Craters of all sizes that were visible were annotated. The lower diameter limit was determined by the resolution and quality of the CTX images. At the 6 m/pixel of CTX, we found that, in practice, few craters under around 30 m (5 pixels) in diameter were visible. However, given that some craters smaller than this could be resolved, we did not enforce a minimum diameter limit on individual annotations (the clustering process does filter results smaller than 3 pixels in diameter). If craters extended partially beyond the sides of the image they were ignored, as it could lead to ambiguity in size and position. Annotators began labelling larger craters and gradually zoomed in further, panning across the image systematically in order to comprehensively survey all sizes and locations.
Combining individual annotations is done using agglomerative clustering, a technique used since the 1950s for a variety of clustering problems [25]. This has the benefit of only requiring two parameters: one is the function with which a distance is calculated between two annotations, and the other is the cut-off distance at which no more clustering is performed. First, all distances between different pairs of annotations is calculated; then, the closest two are paired, and the new point is calculated as the mean between those which were clustered. This process is repeated until no distances remain below the cut-off. For our application, we would like to ensure that the distance between any two markings by the same annotator is 1, meaning clusters can never contain more than one entry from a given annotator. Where j ) is defined as the Intersection over Union of the two circles, also known as the Jaccard index ( Figure 2).
The cut-off distance threshold was largely decided by visual inspection. Too low a value, and repeat entries for the same craters might be included in the final survey; too large and multiple craters might be combined erroneously. We found that a value of 0.9 (clustering any IoU > 0.1) was close to optimal and, therefore, used this for all images. Table 2 gives details regarding the results of the agglomerative clustering, and a visual example of the results can be seen in Figure 3.   Table 2. Scene-wise analysis of clustering process. On average, each crater was found by over three annotators, and has a mean diameter of less than 100 m. Annotations were filtered to disallow labels with D < 18 m (3 pixels), hence the small discrepancy in individual counts between Table 1 and this. Tiles with fewer craters tend to have high average diameters, suggesting that either small craters are too difficult to see, or that they are preferentially eroded or filled by dust quicker in these areas, relative to those with lots of small craters.

Validation
We use statistical measures that quantify the amount of agreement between our annotators, and also between our annotators and an experienced crater counter in order to validate the accuracy of our dataset. It should be noted that these measures of agreement do not account for systematic errors (bias), but do tell us about variance to some extent. Ultimately, crater counting is an ambiguous pursuit-especially at smaller diameter ranges-because of the limited resolution of our instruments. To calculate each annotator's agreement score on a given image, we perform the following calculations: 1. Let the ith label from annotator n be denoted as L   This score rests on the assumption that for any given annotator, the combined labels of all other annotators will be a more complete and reliable survey. However, this does becomes less reliable at extremely low numbers of annotators. The score also cannot factor in craters that were not found by any other annotators, however it does describe the consistency of detection within the group of annotators, and it tells us about the positional accuracy of markings. On top of this quantitative validation, all images were scanned visually to look for any obviously spurious annotations, which were then removed from the dataset. These likely came about from accidental inputs and were easy to spot, although they were not common. Table 3 outlines some spatial statistics for the annotations, which detail the level of positional agreement between individual annotators. Overall, the positional agreement of clustered annotations is high, with a standard deviation in measured diameter of around 15%, and the central locations varying with a standard deviation of less than a pixel. Table 3. Positional accuracy of annotations, by number of annotations per crater. Higher diameter craters tend to be annotated more, leading to a higher mean diameter for craters found by more annotators. The standard deviation of these annotations, in comparison to the final clustered value, is around 14-18% overall. The accuracy of the craters' centres is high, with a standard deviation consistently below one pixel (6 m).  We conducted a simple comparison exercise with an experienced crater counter to further measure the reliability of the multi-annotator labels. The 'expert' annotator labelled craters in 4 randomly selected tiles from the dataset (D, F, K, and L). Subsequently, the Agreement score between these expert labels were calculated against the set of other annotators. As we can see from Table 4, the expert achieved scores that exceeded the average inter-annotator score, and in the case of tiles D and F exceeded all of the other annotators. Given this-and the fact that the number of labels from the expert in each tile was always within the non-experts' range-we see that the expert's annotations are somewhat more reliable than an individual annotator's, but also that the agreement between the expert against the entire body of non-experts is high. This suggests that using six non-experts can provide results that match those of an expert considerably more closely than those of an individual non-expert. Table 4. Results for experiment using expert annotations on a subset of the dataset. In general, the expert annotator scores highly in comparison to the other annotators, and has a total count similar to the mean of the non-experts for each tile. This shows that when considered as a whole, the non-experts' annotations have a high agreement with an expert's, suggesting that multi-annotator methods can boost the performance of non-experts for Object Detection labelling tasks. The Agreement score and positional accuracy measures gives us a way to validate the consistency of individual annotations. However, it does not provide insight into whether the emergent statistics of our crater surveys are reliable. For this, it is helpful to consider the cumulative CSFD of our annotations ( Figure 4). Cumulative CSFDs are generally defined as the number of craters above a given diameter per km 2 , and are used commonly in age-dating work [26]. Our data holds to the expected inverse power law found in cumulative CSFDs, showing a broadly constant gradient above 50 m diameters. Although no ground-truth exists by which to verify our results, we believe that with the Agreement score, positional accuracy metrics, comparison with experts, and the cumulative CSFDs all suggest that the annotations comprising our dataset are reliable both individually and in aggregate. . Cumulative CSFDs of the whole dataset. As expected, there are more unclustered annotations than clustered. At small diameter ranges (<50 m) we see that the gradient tails off, which is an expected symptom of the limited resolution of our data. At the other end of the size range, the statistics become more unreliable, as the number of craters becomes very low. In relative terms, there are more clusters with single annotations (craters with low confidence) at smaller diameter ranges, whilst most of the large craters were found by five or six annotators, which is to be expected, as they are much more visible.

Discussion
This small crater survey offers a valuable training and validation set for researchers aiming to develop Crater Detection Algorithms. In the specific case of Deep Learning, we believe that a combination of this with other datasets (e.g., [7,15]) may provide enough data at many scales, locations and with different instruments to create a highly generalised CDA, which is robust to changes in surface features, noise, crater morphology, and crater size.
Many previous crater surveys have focused on annotating larger diameter (>1 km) craters, which brings with it different challenges. The 2014 global survey by Robbins et al. [7] used both Digital Elevation Models and infrared images, as opposed to high-resolution images. Resolving these larger craters was likely less difficult; however, the extreme erosion of some large craters, and their tendency to more frequently overlap with one another, posed difficulties that were less affected in our size range.
The multi-annotator approach we have used is scalable, and it can be adapted to many other Object Detection tasks, both within and beyond Remote Sensing. We believe a more nuanced approach to training and validating models, which accounts for confidence, is particularly useful for problems wherein ground-truth data is unreliable, because of issues, like noise and resolution. Framing an ambiguous problem, such as small crater detection as a binary one leads to model performance being measured in a way that does not reflect real-world capabilities.
We suggest, based on our experience in this project, that the extended training and seminar series conducted prior to labelling resulted in annotations with higher accuracy. In addition, the sense of genuine scientific collaboration (as opposed to labouring at a task one does not hold a stake in) between the ORBYTS students and academic team was a strong motivator that helped annotators to stay focused and productive during the labelling sessions. Future efforts that use this hybrid crowdsourcing approach should continue to optimise the balance between speed (fewer repeated annotations and less training) and accuracy (more repeated annotations and more training) on a case-by-case basis. For small craters, we believe that the community would benefit from future work that samples areas from across different parts of the planet, and provides more rich annotations of physical crater characteristics where possible (e.g., estimates of crater erosion, ellipticity, depth, etc.).

User Notes
For access to the dataset, and associated python code to cluster, display and analyse the data, please visit: https://doi.org/10.5281/zenodo.3946647. Acknowledgments: The authors would like to thank those involved in the planning and administration of the ORBYTS scheme: William Dunn, Jonathan Holdship, Lucinda Offer, Marcell Tessenyi and Maria Niculescu-Duvaz, among others. We would also like to acknowledge the resources, space and support provided by The College of Richard Collyer, in particular by Matthew Horncastle and James Waller, whose enthusiasm and organisation helped the project tremendously.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: