A Granularity-Based Intelligent Tutoring System for Zooarchaeology

This paper presents a tutoring system which uses three different granularities for helping students to classify animals from bone fragments in zooarchaeology. The 3406 bone remains, which have 64 attributes, were obtained from the excavation of the Middle Palaeolithic site of El Salt (Alicante, Spain). The coarse granularity performs a five-class prediction, the medium a twelve-class prediction, and the fine a fifteen-class prediction. In the coarse granularity, the results show that the first 10 most relevant attributes for classification are width, bone, thickness, length, bone fragment, anatomical group, long bone circumference, X, Y, and Z. Based on those results, a user-friendly interface of the tutor has been built in order to train archaeology students to classify new remains using the coarse granularity. A pilot has been performed in the 2019 excavation season in Abric del Pastor (Alicante, Spain), where the automatic tutoring system was used by students to classify 51 new remains. The pilot experience demonstrated the usefulness of the tutoring system both for students when facing their first classification activities and also for seniors since the tutoring system gives them valuable clues for helping in difficult classification problems.


Introduction
The use of technology opens new frontiers in learning and improves data mining from different sources in order to improve students' learning processes [1]. One of the biggest challenges of including communication technologies in learning is the way in which interaction between teachers and students is simulated by automatic methods [2,3]. In this context, tutoring and the possibility of turning an automatic system into an effective instrument for counseling students are stirring a great amount of

Materials and Methods
This section is divided into materials, where the dataset of the study is explained, and methods, where the followed methodology to build the tutoring system and to evaluate it is explained.

Materials
The used dataset had 3406 instances of archaeological remains and 64 attributes (the last one is the predicted class) extracted from faunal assemblage of Stratigraphic Unit Xa of El Salt (Figures 1 and 2) [25][26][27]. The excavation of "El Salt" can be seen in Figure 1 and examples of the archaeological remains of "El Salt" can be seen in Figures 2 and 3.
The Salt is one of the fundamental sites of the Middle Palaeolithic in the western Mediterranean region due to the entity of its archaeological sequence, for its extraordinary state of conservation, which affects even organic matter and for the integrated and multidisciplinary nature of the research process which takes place in it, with the confluence of different Hispanic, European, and North American Universities and Research Centers. This enclave has been located at the head of the Serpis River, in the locality of Alcoy (Alicante, Spain). It is presented as an outdoor site of about 300 m 2 located at the foot of a large wall that rises up to 38 m tall. The space inhabited at the foot of the wall was protected by a large roof, which at times of maximum development came to serve as protection for almost the entire surface. Highlights are the strategic condition of its location, in the middle of various biotopes of plain, mountains, river valley, and lake-palm environment, etc. and immersed in a territory of the mountains of Alicante, very rich in diversified resources. Research at the El Salt site attempts to deepen the knowledge of its paleolithic record, from an integrative and multidisciplinary perspective. This record does not stop growing by virtue of the application of increasingly sophisticated excavation techniques and high resolution analytical procedures. To the traditional material record of this type of deposits, consisting of lithic remains, fauna, anthracological material, etc. Currently, a microscopic and even molecular registry is being added that is decisively contributing to enrich the information and improve its quality. These facts give us the opportunity of obtaining a complete and accurate data-set in zooarchaeology, among others, which is a key step for training classification algorithms as the ones needed for developing the ITS presented here.  In Figure 2, we can observe taphonomic damage and bone remains recovered examples. In Figure 3, we can observe bioestratinomic damage examples.
In Table 1, a descriptive analysis of the sample is shown. There are two columns in the table. In the first one, we have listed the most relevant parameters (according to the classification algorithm with best performance in the coarse, medium, and fine granularities, which will be explained in the following section) and the predicted class 'Family' (taxonomic rank between Order and Genus) both in coarse, medium, and fine granularities. Following the name of each parameter, we have included a description of the possible values that it can take. For instance, "Bone" can take the listed 93 unique values, while the anatomical group can take only 11 values. It is worth noting that, in the case of the prediction classes, the coarse granularity can take five values, the medium granularity 12, and the fine granularity 15. The second column of the table describes briefly the attribute. If the attribute is numeric, count (number of not null values), mean, standard deviation (std), minimum (min), first quartile (Q1), second quartile (Q2), third quartile (Q3), and maximum (max) are given. If the attribute is categorical, count (number of not null values), unique (number of categories), top (top category), and frequency (freq, which is the frequency of the top category) are given. In the categorical granularity families, the number of each category is given because it is relevant for the predicting classifier that will be explained in Section 2.2.

Methods
We used 33% of the remains for testing and 67% for training. A pipeline with a standard scaler with a SMOTE (synthetic minority over-sampling technique) or ADASYN (adaptive synthetic) method was used, together with 10-fold cross-validation. The parameter that we want to maximize is f1-score_macro. We have chosen it because F1-score is the harmonic mean of precision and recall, and the macro option calculates metrics for each label, and finds their unweighted mean, and this does not take label imbalance into account. The used algorithms with their parameters have been: • k-nearest neighbors (KNN) with parameters: -classifier__n_neighbors (number of neighbors to use): [3,5,7,9] More information about the algorithms can be found in the literature [28]. These algorithms have been implemented using the Python library scikit-learn and the ITS has been implemented using Flask.
Regarding the criteria used to distinguish the different taphonomic alterations, zooarchaeological and taphonomic analysis are performed using already established standard methods [29,30]. Faunal remains are taxonomically and anatomically identified whenever possible, while non-identified specimens with insufficient information were classified into three bone categories (long, flat, or articular) and associated with a weight-size category based on bone density, circumference, and thickness of the cortical surface: large-sized >300 kg, medium-sized 100-300 kg, small-sized 5-100 kg, and very small-sized <5 kg [31][32][33]. On the other hand, all remains were analysed taphonomically using macroscopic and microscopic techniques to identify biostratinomic and diagenetic modifications. In the fracture analysis, we classified all of the fragments by fracture type (recent, old fresh, old dry, or indeterminate) following the criteria established by [34] and the morphotypes created by Real [35,36]. All dimensions were measured for each bone fragment. Also, bone surface modifications were observed and quantified to identify different damage caused by anthropogenic activity (percussion, butchering marks, or thermal alteration) or predator action (tooth marks, or digestion), as well as the diverse diagenetic processes that produce alterations on bone surfaces as erosion, sediment concreteness, roots marks, weathering, pigmentation, and trampling [37][38][39][40][41][42][43][44]. Damage caused after the prey's death and previously their deposit and sedimentation, the biostratinomic process, was classified and analysed individually in a specific database according to the type of damage, origin, agent, location, morphology, distribution, direction, intensity, quantity, and dimensions. Diagenetic processes were recorded based on their presence and the degree of alteration. In the specific case of butchering marks, different activities were established based on the ethnoarchaeological literature [37,[45][46][47].
The questionnaire used to test the experience of the tutor in the pilot of El Salt 2019 season followed partially the methodology explained in [48]. The following areas were tested:

1.
Constructive/active learning: The tutor stimulated us to understand underlying mechanisms/theories. 2.
Self-directed learning: The tutor stimulated us to search for various resources by ourselves.

3.
Contextual learning: The tutor stimulated us to apply knowledge to the discussed problem.

4.
Global score: Overall performance of the tutor.

5.
Open answer: Give some tips for improvement.
A rating between 1 (strongly disagree) and 5 (strongly agree) was given to questions 1-4.

Results
Results are divided into the experiments for the prediction of animals in the coarse, medium and fine granularities, and the ITS development and use. The performance of the different methods in the coarse granularity can be seen in Table 2, and the confusion matrices of the coarse, medium, and fine granularities using the random forest method can be seen in Figures 4-6. In the fine granularity ( Figure 6), classes with less than three instances do not appear because a stratified split between training and test has been applied.
In the three granularities, the 10 more relevant attributes for the classification have been (by this order):
The best configuration parameters for the random forest were:  In Figure 7, we show the interface of the ITS implemented for being used in the season. As we can see in the interface, there is a first region where the user must be identified. Right below this first section, the user should start filling the different characteristics and parameters that have been obtained from the bone fragment. The ITS also gives information about the different options of a parameter when applicable. This is very important since the ITS could be used by archaeologists from different sites, and we must be sure that they agree in the format of the data.
In Figure 8, we show the output screen of the ITS right after a user has finished the characterization of a fragment and decides to ask for a prediction. In this case, the prediction is shown at the beginning of the interface. As soon as the prediction is done, the ITS is ready for another prediction. For this reason, the fields below the prediction do not have any value. It is worth noting that all the information filled by the user has been stored in the ITS server and is available for users upon request. Figure 8 is an example of a table used by students in the archaeological site to compare their own predictions and the ITS answers. We have censored information about the fragment identification and the identity of the students.   This first version of the ITS included the coarse granularity and was used in August 2019 by 3 students to classify 51 remains that were characterized following the format of the same database but belonged to a different archaeological site called "Abric del Pastor". Abric del Pastor belongs also to Middle Palaeolithic and is near "El Salt". Two of the three students also answered the questionnaire. The results of the tutor, compared to the predictions of the students frequently agreed even when the remains used for training corresponded to a different site ("El Salt"). Students, due to their limited knowledge and the difficulties of analyzing very fragmented bone remains, usually described the remains based only on the size, which usually agree with the families predicted by the tutor. For example, a lot of bone remains predicted as "small size" are described by the tutor as Bovidae, which is a family included in the category of small animals because most bovids belong to the wild goat (Capra pyrenaica). This information can help students to give more accurate answers or verify their initial guess. On the other side, sometimes the tutor fails and not all the sizes and families are correctly classified. For example, some bones that are correctly described as "small size" by students are characterized as Leporidae by the tutor, which is a mistake. Other mistakes are related to the wrong characterization of birds as Leporidae.

Discussion
Random forests using SMOTE is the best machine learning method for this domain. In Table 3 the F1-score of the different animals is displayed. We can see that coarse and fine granularities have the best F1-scores (0.86 and 0.85 respectively). However, the fine granularity is more suitable because it has more granularity and it classifies better small animals.
Let us now analyze the feedback obtained from the archaeological students and teachers. From the pedagogical point of view, and analyzing their answers, we can see that the ITS moderately stimulates underlying mechanisms but, on the other hand, it strongly stimulates the search for resources by the users. There are a few different opinions about ITS's ability to stimulate the application of knowledge related to the discussed problem, but the overall performance of the ITS is quite high. Finally, the final tips for improvement were to improve the interface and to have more data in order to have a more accurate prediction.
Although predictions given by the ITS are usually accurate and useful, the presence of some mistakes must be analyzed. There are two ways of improving it. First, the excessive number of characterization as Leporidae can be related to the granularity used in this version. An increment in the number of classes would increase the accuracy of the prediction by increasing the amount of information in the system. The second way of improving the accuracy of the tutor is taking into account the spatial distribution of the fragments in the sites. Since the X, Y, and Z variables are important in the characterization, using information from different archaeological sites in the training would help in the correct characterization of the tutor by avoiding an excessive importance of the bone distribution in a specific site.

Conclusions
In this article we have presented a method to create intelligent tutoring systems in archaeology to help students in specialized tasks that require analysis of huge amounts of data. The method proposed here implies the application of classification algorithms that must be trained with a complete data set in order to give accurate results. We have tested our method by developing an intelligent tutoring system in the field of zooarchaeology. We have found that random forest is the method with the best performance in this domain, and the fine granularity approach is the more useful approach for archaeologists because the main doubts appear with small animals, which are not common. In the coarse granularity, random forest was the method which had better results between the tested ones, with an accuracy and f1-score of 0.86 for both. As questionnaires to classify the remains are quite long (63 attributes), the first 10 more relevant attributes for the classification have been shown, which are width, bone, thickness, length, bone fragment, anatomical group, long bone circumference, X, Y, and Z. In addition, a classification using a medium and fine granularity has also been performed obtaining an accuracy of 0.84 and 0.86 and a f1-score of 0.83 and 0.85 respectively.
Regarding the pilot with the tutor, it correctly classified the families in most of the cases. However, some mistakes were also found. Mistakes can be partially corrected by changing the granularity since some of the mistakes are related to the reduced amount of classes in the granularity implemented in the pilot experience. The satisfaction questionnaire showed that the tutor was helpful. Students suggested an improvement in the interface and the accuracy of the predictions.
In future work, more data from different archaeological sites can be included in the training phase in order to increase accuracy and avoid effects related to specific characteristics of a single excavation. Also, natural language processing (NLP) and gamification techniques could be added to the tutor in order to enhance the student's learning process. . This work was also part of the project Explora "Application of Soft Computing techniques and 3D modeling in massive data processing in Archeology" that was supported by the Spanish Ministry of Science and Innovation (Grant TIN2015-72682-EXP). This study has been partially funded by ACCIÓ, Spain (Pla d'Actuació de Centres Tecnològics 2019) under the project Augmented Workplace.