1. Introduction
The definition of “mother tongue” varies across different sources and continues to evolve to encompass the nuances of language use by individuals. One commonly accepted definition is that it refers to the language a person learns through their interactions with family and society during the early years of their life [
1]. According to UNESCO, there are over 7000 known mother tongues in the world, with approximately 3000 of them facing the risk of extinction in the near future [
2]. Hundreds of millions of people speak some mother tongues, such as Chinese, Hindi, Spanish, English, Arabic, Japanese, and Russian. Others like Turkish, Korean, French, German, Bengali, and Italian also have many speakers globally.
As the Internet continues to expand its reach across the globe, becoming accessible even to less economically developed populations, the diversity of languages used in the digital world is also increasing. English used to be the dominant language on the Internet; however, this is changing as more non-English speakers access online resources. The ability of people from different countries and cultures to communicate and share information in their mother tongue has resulted in a proliferation of diverse languages being used on the Internet.
The exponential increase in the global Internet user base has expanded the market reach for companies; however, the diverse linguistic landscape online presents a formidable challenge in effectively engaging with people from different language backgrounds. Communicating and marketing successfully with individuals speaking different languages is crucial to understanding and utilising their mother tongue. The mother tongue of Internet users serves as a defining characteristic, and knowledge of this aspect can be leveraged in various ways to enhance business strategies and user experiences.
Understanding a user’s mother tongue can have practical applications in various domains. For instance, Internet service providers (ISPs) can customise their services to align with users’ language preferences, thereby enhancing user experience. Similarly, online businesses can improve their targeted advertising strategies by considering customers’ mother tongues, as different language preferences may entail distinct consumer needs. Additionally, in digital forensics, knowledge of a suspect’s mother tongue can serve as valuable evidence in criminal investigations, allowing investigators to narrow the pool of potential suspects. Investigators often need to sift through substantial amounts of data and digital evidence to identify perpetrators when dealing with cybercrimes. Information about the suspect’s mother tongue can help focus investigative efforts on a smaller subset of suspects. Another practical application is automatically modifying the interface of a website or application based on the user’s mother tongue, making it more accessible and user-friendly, thus enhancing user satisfaction and engagement. Overall, leveraging the knowledge of a user’s mother tongue can have diverse applications in fields such as ISP services, targeted advertising, digital forensics, and website/application design to enhance user experiences and streamline processes.
This paper aims to identify the mother tongue of an unknown user by leveraging KD, which involves analysing how individuals type on a keyboard, including factors such as timing, speed, and pressure of key usage. KD has been utilised in various applications such as authentication, biometrics, and user behaviour analysis [
3], and offers a promising avenue for research in diverse fields, ranging from cybersecurity to psychology. The ability to identify individuals based on their unique typing patterns has significant implications for authentication, user behaviour analysis, and other related areas [
4].
The remainder of this paper is organised as follows: The next section provides a comprehensive review of the relevant literature that contextualises the topic of this study. Subsequently, the methodology employed in this research is described and analysed in detail. Next, the findings about identifying the mother tongue of unknown users, utilising five different machine learning models, are presented. Finally, the paper concludes with a discussion of the potential applications of this research and suggestions for future extensions of the study.
2. Background
The term “mother tongue” refers to the language an individual learns from birth or acquires from their family and community during their formative years. It serves as their primary mode of communication and thought, and they are typically most proficient in using this language. However, the concept of mother tongue has evolved, leading to varying interpretations.
Some experts contend that the language spoken by an individual’s biological mother is the true mother tongue, while others argue that it encompasses the language of the immediate environment. A study [
5] asserts that children who receive education in their mother tongue are more likely to excel academically and achieve better long-term educational outcomes. This study defines the mother tongue as “the language a child hears at home and in the community from birth, “emphasising the importance of preserving and nurturing the mother tongue in bilingual education.
Similarly, another study [
6] provides a comprehensive overview of bilingual education and bilingualism, defining the mother tongue as the first language a child learns, typically spoken at home. It underscores the significance of maintaining and fostering the mother tongue in bilingual education, as it can facilitate academic success and social integration.
In conclusion, the concept of mother tongue has different interpretations, ranging from the language spoken by one’s biological mother to the language of the immediate environment. However, scholars such as Cummins and Baker emphasise the importance of preserving and developing the mother tongue in bilingual education, as it can positively impact academic achievement and social integration.
The concept of a mother tongue, also known as a first language (L1), has been defined in various ways by scholars from different disciplines. In linguistics, it is often defined as the language that a person learns naturally from birth or early childhood and has a high level of proficiency in [
7]. In education, the mother tongue can also refer to the language used as a medium of instruction in schools and the language of instruction in multilingual contexts [
8]. However, the definition and concept of the mother tongue have evolved, and there are different views and perspectives on its content. For example, some scholars have criticised the narrow focus on language proficiency in the traditional definition of the mother tongue and have emphasised the sociocultural and affective aspects of language learning and use [
9]. Kamusella [
10] argued that language is a political construct and that identities can be constructed and contested through language use. He also emphasised the importance of linguistic diversity and multilingualism in creating more inclusive societies. In another study, Gorter [
11] posited that language use is complex and dynamic and that individuals can have multiple and changing linguistic identities based on their social context and experiences. He also highlighted the importance of acknowledging and valuing linguistic diversity in education and society.
One part of the research focused on converting one language into another, with the conversion concerning written or spoken speech. For example, a study by Fei et al. [
12] dealt with the problem of incomplete semantic role labelling in low-resource languages. They converted the labels from the source language to the target language in their method. Yi et al. [
13] dealt with synthesising spoken speech of various languages from text data (text-to-speech) and tried to deal with the problem of incorrect pronunciation. They proposed a triplet training scheme composed of an anchor, a positive, and a negative sample to cover unseen cases. A similar problem was dealt with by Zhou et al. [
14], who tried to improve pronunciation when converting speech into another language, using cross-lingual voice conversion techniques. Finally, Vaswani et al. [
15] proposed a new simple network architecture, the “Transformer”, for translating one language to another and achieved outstanding results.
Two important terms related to the mother tongue are “language loss” and “language policy”. Language loss refers to the gradual or rapid decline in the proficiency or use of an individual’s mother tongue, often due to language change or death [
16]. A study looking for the effect of the use of computers and Internet use on language loss would be noteworthy. Language policy refers to decisions and practices related to language use in various sectors, such as education, government, media, and commerce, which can significantly impact the status, use, and development of the mother tongue and other languages [
17].
Regarding the detection of the mother tongue, Mechti et al. [
18], utilising a gated recurrent unit (GRU) network, introduced a deep learning model that can accurately identify the mother tongue of Arabic language learners, an essential aspect of language education. The primary objective of this study was to tackle the challenge of recognising the mother tongue of Arabic language learners to customise personalised language learning strategies for each learner. The learners’ written work is presented as input to the proposed model to generate writing samples. The pre-trained word embedding layer transforms the input text into a sequence of vectors, then passed to the GRU network to capture and model the input data’s long-term dependencies, given its ability to model sequential data. The model is trained on a dataset of writing samples from Arabic language learners with different mother tongues and is evaluated on a separate test set. The results show that the proposed model outperforms several baseline models and achieves high accuracy in identifying the mother tongue of Arabic language learners.
In addition, Siddhant et al. [
19] investigated the use of pronunciation information for speaker and language recognition. They tested their models on conversational speech datasets in multiple languages and found that pronunciation information improves the accuracy of mother tongue recognition.
The papers presented in this discussion demonstrate the diverse range of approaches and methods that researchers have used to identify mother tongues. In addition to the tools used in the works mentioned above, many others can also be used to find a user’s mother tongue. Some of them are attention models [
20,
21], transformer models [
22,
23], and graph models [
24,
25]. By developing more accurate and efficient methods for mother tongue recognition, researchers can potentially improve language-related applications such as speech recognition, language teaching, and natural language processing.
One of the earliest studies on keystroke dynamics (KD) was conducted by Gaines et al. in 1980 [
26], where they investigated the variation in typing patterns between individuals and found that individuals had unique typing patterns that could be used for identification purposes. Since then, several studies have focused on using KD in authentication systems. For example, Monrose et al. [
27] proposed a keyboard-based authentication system that used neural networks to identify users based on their typing patterns, achieving high accuracy rates. Bergadano et al. [
28] also conducted a study on using KD for biometric authentication, developing a model based on KD and evaluating its effectiveness through experiments.
Other studies have explored this using KD to detect impostors and anomalies in typing behaviours. Killourhy and Maxion [
29] used a dataset from users typing a fixed text at regular intervals over several weeks to train and test anomaly detection algorithms, such as Principal Component Analysis, Mahalanobis distance, and Support Vector Machines. They found that the performance of the algorithms varied depending on the specific keystroke features being analysed. Gunetti and Picardi [
30] analysed the KD of free text to investigate its feasibility as a biometric authentication mechanism for text entry, obtaining promising results with low false alarm rates and impostor pass rates.
KD has also been explored in user classification and recognition of the user’s physical or mental situation. Tsimperidis and Arampatzis [
31] attempted to identify characteristics of users, such as gender, age, and handedness, using KD features and a rotation forest classifier, achieving high accuracy rates in user profiling. Tsimperidis et al. [
32] used keystroke durations and diagram latencies extracted from a dataset to develop a system that could accurately distinguish the age group of an unknown user. Roy et al. [
33] proposed a KD-based indicator for Parkinson’s disease screening at home, using ensemble learning and addressing key hypotheses related to the screening process to enhance the accuracy and effectiveness of the method.
As it became evident from the literature, on the one hand, the identification of a user’s mother tongue has been attempted using various approaches, such as methods of natural language processing and exploitation of pronunciation information. On the other hand, KD has been mainly used for user authentication, recognising some inherent or acquired characteristics of users, and recognising users’ mental and physical state. However, at least according to what is known, KD has not been used so far to identify users’ mother tongue.
3. Methodology
The availability of datasets on KD for research purposes is limited on the Internet, with most of them being recorded by users using a fixed-text approach, wherein users enter a given text. This approach has limitations as it may not accurately reflect the natural typing behaviour of users. On the other hand, publishing free-text logging data risks exposing personal data; therefore, such datasets are rarely found in studies or surveys conducted online. As a result, in the initial phase of this research, free-text data were collected from volunteers who willingly participated in the study. In the second phase, relevant features related to KD were extracted from the collected data. Finally, in the third and last phase, the users’ mother tongue was identified, and parameters of five machine learning algorithms, namely Support Vector Machine (SVM), Naïve Bayes (NB), Radial Basis Function Network (RBFN), Random Forest (RF), and Multi-Layer Perceptron (MLP), were appropriately set.
3.1. Dataset Preparation
The main objective of this study was to develop a research methodology to classify users based on their mother tongue using KD. A data collection procedure was designed and implemented to achieve this objective, utilising a keylogger to record users’ typing behaviour during their routine computer usage for approximately three and a half months (from 2 April 2022 to 19 July 2022). For this purpose, hundreds of users were approached to participate in the project. Those who agreed to be recorded were instructed to use their computer only when the keylogger was active, so the log file contained only data from them. No other instructions or restrictions were given to the volunteers, so they were free to type whatever they wanted, whatever time of day they wanted, and in whatever application, with the aim of recording as closely as possible the actual typing. In addition, the volunteers were given guarantees that their data would not be shared with third parties and would be used exclusively for the needs of this research. Finally, volunteers were allowed to review their data before handing over the log file to the researchers and deciding whether or not to do so.
The study was conducted with users representing five languages: Albanian, Bulgarian, English, Greek, and Turkish. It is worth noting that each of these languages belongs to a distinct language family, namely Germanic languages (English), Slavic languages (Bulgarian), Turkic languages (Turkish), Hellenic languages (Greek), and Albanian languages (Albanian).
After the data recording process, a total of 194 log files were collected. The number of log files collected for each language is presented in
Table 1.
The dataset utilised in this study comprised a variable number of logfiles for each language. Still, the asymmetry was not deemed extreme, as the class with the fewest samples contained approximately 1/3 of the class with the most samples. The study recruited volunteers of both genders in nearly equal numbers, with ages ranging from 18 to 65 years, and the distribution of ages was almost uniform.
The logfiles of the dataset consist of records, each containing information about an action on the keyboard, such as the key used, the exact time it was used, and the type of action (press or release). Also, each log file contains information about the volunteer, including their mother tongue.
4. Results and Discussion
The primary objective of this study was to classify users based on their mother tongue using keystroke durations as features and machine learning models. To evaluate the performance of these models, metrics such as accuracy, time to build model, F1 score, and area under the ROC curve (AUC) were utilised.
The AUC was chosen as a relevant metric for evaluating the model’s performance, particularly when dealing with imbalanced classes. The AUC quantifies the area under the ROC curve, which illustrates the trade-off between the true and false positive rates at various classification thresholds.
Each experiment was performed with the 10-fold cross-validation method to obtain a more reliable statistical measure since it is performed ten times with a different training set and testing set each time, and their average value is calculated. This avoids the possibility of calculating an outlier as accuracy, F1 score and AUC.
Through a series of extensive experiments, the best results obtained from each classifier were documented in
Table 2.
The best results of
Table 2 were obtained for the SVM model having a C parameter value equal to 10, for the RBFN model having 10 clusters and a minimum standard deviation for clusters equal to 1.5, for the RF model having 1100 trees and 150 features to be considered in random feature selection, and for the MLP model having a learning rate of 0.6 and a momentum of 0.8.
Figure 1 shows the best results of the five machine learning models in searching the mother tongue of unknown users.
Based on the findings, the RBFN model demonstrated the highest accuracy of 82.5% and an F1 score of 0.823, followed by the MLP model with 81.4% accuracy and 0.814 F1 score. Conversely, MLP had the highest value of 0.946 in terms of AUC, followed by RBFN with 0.939. Notably, the NB and SVM models were observed to have shorter training times compared to other models.
Two conclusions can be drawn from these results. Firstly, the classification of users based on their mother tongue using KD data is feasible, with an accuracy of over 82% achieved in this dataset, surpassing the baseline accuracy of around 28%. Secondly, RBFN exhibited the best overall performance in accuracy and F1 score, but SVM may be more practical for real-world applications due to its shorter training time.
Finally, a closer examination of the results is presented in the confusion matrix shown in
Table 3, which corresponds to the best-performing run of the experiments.
Although the languages under investigation belong to distinct language families, as elucidated earlier,
Table 3 presents evidence indicating a higher likelihood of misclassification of logfiles as belonging to languages spoken in geographically adjacent regions. For instance, log files attributed to the Albanian language were occasionally misclassified as Greek or Bulgarian, which are official languages of neighbouring countries. While this observation is a preliminary finding, it merits further investigation in an extended research study to better comprehend the underlying factors contributing to this phenomenon.
Since there is no other research in the literature that attempts to identify the mother tongue of users using KD, the results of this research cannot be directly compared with those of any other.