Determining the Age of the Author of the Text Based on Deep Neural Network Models

: This paper is devoted to solving the problem of determining the age of the author of the text based on models of deep neural networks. The article presents an analysis of methods for determining the age of the author of a text and approaches to determining the age of a user by a photo. This could be a solution to the problem of inaccurate data for training by ﬁltering out incorrect user-speciﬁed age data. A detailed description of the author’s technique based on deep neural network models and the interpretation of the results is also presented. The study found that the proposed technique achieved 82% accuracy in determining the age of the author from Russian-language text, which makes it competitive in comparison with approaches for other languages


Introduction
Text messages are the most popular form of communication and an important part of the daily life of a modern person. The unique style of such messages is directly dependent on gender, age, education and other factors, which makes it possible to clearly identify their author. The features that distinguish the author of a particular message are expressed through the structure of sentences, vocabulary, the use of certain linguistic structures and speech patterns. These described features make it possible to differentiate people into groups.
The relevance of research on this topic is associated with the auxiliary function for solving the problems of text mining [1][2][3][4][5][6][7], particularly its attribution [8,9]. The sentiment of the text, as well as the gender and age of the author, are the most informative features used in attribution. Their use can significantly improve the generalization ability of the model for authorship attribution. The obtained solutions of these problems find their application in such important areas of life as information security and forensics, commerce (for example, as a way to optimize targeted advertising), and science (as a tool for linguistic research). They are especially important in forensic expertise, where it is necessary to solve such problems as the following: • Differentiating users of online platforms by age to combat pedophilia and prevent children from accessing adult content; • Authorship attribution of an anonymous note with threats; • Authorship attribution of a suicide note.
that enabling semantic functions could improve performance when working with short texts such as tweets. The following results were obtained: 81.7% for author gender (PAN 2017), 68.2% for personality type (MBTI), 51.2% for age (PAN 2016), 99.6% for news feed topics (BBC), 52.3% for drug side effects, and 50% for their effectiveness. The authors of [16] proposed a joint learning approach for age prediction. In particular, the long short-term memory (LSTM) auxiliary layer was used to study the text representation and simultaneously combine the auxiliary representation into the main LSTM layer to adjust the age regression. During training, the auxiliary classification LSTM model and the main LSTM regression model were used together. The texts were collected from weibo.com, a well-known microblogging platform in China. The data included 2000 samples of texts written by people between 19 and 28 years old. These texts were divided into 10 age categories. The results showed that this approach significantly improved the effectiveness of age prediction using either an individual classification or a regression model. The accuracy of the dataset used was 57.3%.
The work in [17] presents the idea of using a multi-task convolutional NN (MTCNN) for determining the age and gender of the authors from weibo.com. Users with less than 40 subscribers were removed. The resulting dataset consisted of 263,460 entries of 136,072 users. In addition, information on gender and age was considered. Experiments showed that the proposed method demonstrated an improved result, compared with the SVM and CNN methods for a small dataset, and that in comparison with similar work for English texts, in the case of the Chinese language, as of 2016, the complexity of the problem increased. The experimental results showed accuracies of 68.7% and 71.4% for age and gender, respectively.
The research conducted in [18] presented an automated tool with a unique set of features for analyzing a given text. These parameters were unigrams, parts of speech, and production rules. The proposed method consisted of several steps. The first step was data entry. The second step was tokenization and the extraction of parameter sets. The third step was text cleaning. The fourth step was to apply feature selection to the data. The fifth step involved applying the classifier using different algorithms (SVM, NB algorithm, decision tree, logistic regression, random forest (RF), and multiclass classifier). The last step was to create an output class and evaluate the model performance. The following classes were used in the experiment: gender (male, female), age (younger-no more than 35 years old-and older, or 35 years old and older). The best results were an accuracy of 82.81% for gender determination using SVM and 83.2% accuracy for age determination using the SVM classifier.
In [19], one of the tasks was to classify Russian-language texts by gender and age. The authors used the SVM and RF approaches. The prediction was performed on the corpus of 15,000 LiveJournal posts. Age groups from 20 to 50 years old participated in the study, and texts containing less than 1000 characters were not considered. The best results were achieved using the RF classifier, resulting in a 49.5% accuracy for age without considering gender and 49.3% with it.
It should be noted that most of the methods considered for determining the age of the author of the text have low accuracy, but this is not the only drawback. All research was focused either on short (posts from social networks and blogs, comments, and so on) or long (articles, posts, reviews, and the like) texts; that is, they were strictly dependent on the length. This approach is incorrect since it does not allow us to assess the universality of the model when solving real problems. In addition, author profiling on social networks is especially difficult, since the slang used by users is often unstructured and contains noise. The problem is also the illiteracy of users, which can be either intentional or unintentional.
The main part of the described drawbacks can be eliminated at the stage of text preprocessing. However, converting the data into an appropriate form does not guarantee high accuracy. There are many factors that affect the accuracy of a model. The most significant of these is the raw data. Training the model will not be effective enough if the data are unreliable, noisy, or unbalanced. For example, real social network users may specify the wrong age in their profile. This requires filtering of the data. The solution to this problem is the implementation of CV models intended for the related problem Scientific work [20] was devoted to the deep expectation (DEX) method based on the VGG-16 architecture. The essence of this method is that a face is detected on the image, and then age prediction is performed using an ensemble of 20 different architectures. Photos of celebrities from IMDB and Wikipedia sites were used as experimental data. The DEX method showed the best result at the ChaLearn Looking at People competition in 2015, with an ε error of only 0.264975.
The authors of [21] used the dropout SVM approach. The approach was inspired by the success of a deep NN [22,23], where the problem of retraining can be solved by adding a dropout regularization function that turns off some part of the neurons during training. This approach improved the adaptation of neurons to input data. The accuracy of face images collected by the authors reached 70%.
In [24], a new CNN model called the soft stagewise regression network (SSR-Net) was presented. Age determination using a multiclass classifier is carried out in several stages. Each stage is responsible for refining the results of the previous one, which leads to a more accurate assessment. To solve the classification problem, SSR-Net assigns a dynamic range to each age class, allowing it to shift and scale according to the input image. The main advantage of the resulting model is compactness; it is only 0.32 MB in size. Despite its compactness, the performance of SSR-Net is close to modern methods, whose model sizes are often 1500 times larger. The MAE of the model was 2.52.
The study in [25] was based on the hyperplane ranking algorithm. The proposed approach uses age tags to predict rank. The age rank was obtained by aggregating a series of binary classification results. The FG-NET dataset was used for the experiments. The results showed that the learning strategy chosen by the authors exceeded the traditional approaches of classification, regression, and ranking. The system shows the best results for images with a neutral facial expression, as facial emotions negatively affect the results of the age assessment. The MAE value for this approach was 3.82.
The research presented in [26] demonstrated the VGG-Face model, developed on the basis of the well-known VGG-Very-Deep-16 architecture. The performance of the model was evaluated on Labeled Faces in the Wild [27] and YouTube Faces [28]. The resulting accuracy was 98%. In this work, it was proposed to refer to the experience of foreign researchers and use the advantages of natural language processing (NLP) and CV models to solve the problem of determining the age of the author of a Russian-language text. Table 1 shows the results of the study of methods using feature extraction from photos on similar datasets.

Methodology
The method for determining the age of the author of the text presented in Figure 1 includes several stages.

Methodology
The method for determining the age of the author of the text presented in Figure 1 includes several stages. Step 1: The pre-processing stage involves clearing texts from noisy data. Such data negatively affects the classification procedure. One of the problems of social networks is the large amount of  Step 1: The pre-processing stage involves clearing texts from noisy data. Such data negatively affects the classification procedure. One of the problems of social networks is the large amount of spam messages left by users. Therefore, the set of texts is cleared of duplicate messages and comments containing such spam words as asset, subscribe, like, hack, and mutual subscription, among others. Another characteristic of texts from online platforms is the frequent use of emoticons, particularly by audiences up to 18 years old. Short comments consisting mainly of emoticons and including less than five Russian words should also be deleted, as they are not informative enough. All emoticons are replaced with the @emoji tag.
Step 2: The data filtering stage is intended for additional validation of the age data. Often, social media users specify the wrong age on their profile. This may be due to various reasons; however, a frequent purpose is the aim of accessing adult (18+ years old) content or directly registering for an online platform. To solve this problem, it was decided to use photos from users' pages and the CV model as a tool for effectively filtering inaccurate data.
According to the results obtained by other researchers, the VGG-Face model is the most suitable model for solving the problem of determining age from a user's photo. Therefore, it was decided to use it as the basis for filtering inaccurate data. This method also includes several additional steps: • The age of the user is determined by the photo. Then, 2 years are added or subtracted to or from it.

•
If the age specified by the user falls within the interval, then the counter increases. Otherwise, it remains unchanged.

•
The age of the user is considered correct if the counter is equal to or more than half the number of the author's photos.
Such actions allow the selection of only reliable data, based on the coincidence of the age specified in the profile and determined by the photo. Messages from users who publish mostly old photos are not included in the dataset.
Step 3: Data conversion into tensor form is carried out using hashing. A dictionary of the most common words is used, and each word from the text is compared with its index from the dictionary when encoding. Hashing is especially effective with significantly sparse data.
Step 4 assumes the use of the filtered data from the previous stage for training a deep NN. The task of dividing texts into age groups is nontrivial. Therefore, research [50] was devoted to identifying the most effective architecture of NN. It was found that the most effective architectures for determining the age of the author of the text were FastText and CRNN [51], a hybrid of a CNN and a recurrent NN (RNN).
FastText architecture is distinguished by some features of input data preprocessing. Skip-gram with negative sampling is used for the vector representation of words. Negative sampling provides negative examples to connect words that are not paired in context. Anywhere from 3 to 20 negative words are selected for each word. Skip-gram ignores word structure, so a model that breaks words into n-grams was added. Usually, the value of n can be from 3 to 6. The whole word is also added to the chain of n-grams. This approach allows working with words that the model has not previously encountered. Hashing is used since the features obtained by splitting into n-grams have an excessively large dimension. We used the following parameters for FastText training: The CRNN architecture is an efficient combination for analyzing text sequences. The convolutions of various dimensions allow selection of the most significant features for classification, regardless of their size. Recurrent layers respond to temporary changes and correct context dependencies. The recurrent part of the architecture is represented by LSTM. The choice of LSTM was due to its ability Information 2020, 11, 589 7 of 12 to store long-term dependencies in its memory cells. This is particularly valuable in solving problems of text analysis. The convolutional part is represented by convolutions with filter dimensions 1, 3, and 5. Thus, the CRNN architecture makes it possible to respond to both local and global informative features, as well as short-term and long-term dependencies.
As with the NN architecture, the training parameters also have a serious impact on the final result. Their choice was carried out experimentally, based on the experience of researchers in the NLP field [14][15][16][17][18] In addition to the described architectures, variations of BERT [52] were considered. This is a well-known and effective architecture for text mining. BERT is a bidirectional model based on the classic transformer architecture which combines the advantages of convolutional and recurrent architectures. Classic multilingual BERT (bert-base-multilingual-cased), XLM (xlm-mlm-xnli15-1024) [53], and RoBERTa (xlm-roberta-base) [54] were applied in this study, as these models are well-adapted to the Russian language. We used the default parameters for all applied BERT modifications.
Step 5 involves validating the selected models. The procedure for validation was 10-fold cross-validation. This was used as a way to get reliable assessments and improve the learning process. The most accurate model based on the validation results was chosen as a decision-making tool.
Step 6: At the last step, the model obtained in Step 5 is used to predict the age of the author of the text.

Results
Effective text classification requires a large corpus of representative data. Often, there are marked-up texts in various languages in the public domain, but there are no such corpora for solving the problem of determining the age of the author of a Russian-language text. This situation creates an additional task: collecting and marking up a dataset.
For experimental purposes, it was decided to use real data from users of the vk.com online platform, since combating pedophilia in social networks is one of the key areas of application of solutions to this task. The choice of the resource was due to the rapidly growing number of public messages left by users. On average, for the most popular social network in Russia and the CIS, vk.com, this number reached 550,000 messages left by more than 30,000 users.
The data were collected from the platform's community pages using the API. Thus, more than 50,000 links and 70,000 photos from the social network communities were received. This process was automated; a special script retrieved the last 100 entries left on the community pages and their related comments. The total amount of data collected was over 2 million records. Each of these entries included a short comment, an age tag, and five photos from the user's account.
The data were preprocessed according to the author's method. They were cleared of spam, uninformative messages, duplicates, and so on. Then, the preprocessed data were filtered by the VGG-Face model, and 5500 examples were selected from 75,000 texts as a result of applying the data filtering method based on the VGG-Face model. For these texts, the user's age turned out to be reliable.
In this experiment, distributions of the authors and users of vk.com by age before and after filtering were obtained (Figure 2). Based on the resulting distributions, we can conclude that some users deliberately underestimated their ages in their profiles. The graph shows a drop in the area of users under 18 years old and an increase for people over 18.
State University of Control Systems and Radio-electronics (TUSUR) were collected from social networks. The ages of these pupils were all known.
The training was conducted using two datasets. The first set included texts, age labels, and images that were not filtered by the VGG-Face method. The second set included both the filtered and additional messages of the Tomsk pupils. It was decided to implement both binary and multiclass classification. For this purpose, the data contained in the dataset were divided into categories. In the first case, the sets included two categories: under 18 and over 21 years old. In the second, under 18, from 21 to 27, and over 30 were the categories.
The choice of categories was intended to distinguish between underage and adult users. It should be noted that texts written by authors between the ages of 18 and 21 were not considered when training the models, since the generalization ability of the models at this interval was unsatisfactory and would negatively affect the final result. The situation was similar for the age interval of 27-30 years. The main distinctive features of the selected groups were the level of education (school, university), writing style (official business, conversational, containing slang), as well as vocabulary.
The results of the experiments are presented in Table 2. It should be noted that the implemented approaches are difficult to compare with analogues because there are no existing solutions for determining the age of the author of a short Russian text using an NN. In addition, the accuracy of solutions based on traditional ML methods [19] is much lower than that of NN approaches (15% less on average). Thus, their consideration in this case would not be applicable. The Russian language has a distinctive feature of inflection, so comparing this to methods used for English is also incorrect. Inflection means more complex word formation and a high degree of morphological and syntactic homonymy. These aspects make it difficult to use the original methods. Their accuracy for the Russian language drops to 50%.
The use of a verified dataset improved the initial accuracy of the models by an average of 17.5%. The obtained values of accuracy allow us to conclude that it is advisable to use CV models in order to improve the training set and clean it from inaccurate data. The obtained distribution was uneven. This was due to the fact that social networks are not very popular among people over 40 in Russia. This fact was additionally confirmed by the official statistics of vk.com. Additionally, filtering negatively affected the number of messages from users under 18. Therefore, about 5000 additional messages by pupils from schools in Tomsk affiliated with Tomsk State University of Control Systems and Radio-electronics (TUSUR) were collected from social networks. The ages of these pupils were all known.
The training was conducted using two datasets. The first set included texts, age labels, and images that were not filtered by the VGG-Face method. The second set included both the filtered and additional messages of the Tomsk pupils.
It was decided to implement both binary and multiclass classification. For this purpose, the data contained in the dataset were divided into categories. In the first case, the sets included two categories: under 18 and over 21 years old. In the second, under 18, from 21 to 27, and over 30 were the categories.
The choice of categories was intended to distinguish between underage and adult users. It should be noted that texts written by authors between the ages of 18 and 21 were not considered when training the models, since the generalization ability of the models at this interval was unsatisfactory and would negatively affect the final result. The situation was similar for the age interval of 27-30 years. The main distinctive features of the selected groups were the level of education (school, university), writing style (official business, conversational, containing slang), as well as vocabulary.
The results of the experiments are presented in Table 2. It should be noted that the implemented approaches are difficult to compare with analogues because there are no existing solutions for determining the age of the author of a short Russian text using an NN. In addition, the accuracy of solutions based on traditional ML methods [19] is much lower than that of NN approaches (15% less on average). Thus, their consideration in this case would not be applicable. The Russian language has a distinctive feature of inflection, so comparing this to methods used for English is also incorrect. Inflection means more complex word formation and a high degree of morphological and syntactic homonymy. These aspects make it difficult to use the original methods. Their accuracy for the Russian language drops to 50%. The use of a verified dataset improved the initial accuracy of the models by an average of 17.5%. The obtained values of accuracy allow us to conclude that it is advisable to use CV models in order to improve the training set and clean it from inaccurate data.
In addition, the dataset was tested using the only publicly available tool [55] designed to determine gender and age based on quantitative parameters of texts, such as 3-8 n-grams. The result was 65% when determining the age.

Discussion and Conclusions
As part of the study, the technique for determining the age of the author of a Russian-language text based on the FastText model and the method for filtering user photos using the VGG-Face model was developed.
Great attention was paid to the experimental data. There are no Russian-language corpora to solve this problem, so our own dataset was collected from social networks. Experiments have shown that using raw social network data is inaccurate, since the real age does not coincide with the ages indicated in the profiles. Therefore, the results of methods evaluated on raw data from social networks cannot be trusted.
We proposed the CV algorithm to filter users' messages. Determining the age by photo with VGG-Face and comparing it to the age specified in the profile ensures that the user has a correctly specified age. In this case, users' messages can be used for model training. Experiments have shown that the application of a filtering procedure has a significant impact on the data distribution by age and the results obtained. Another conclusion drawn from experiments on verified data by photos is that users deliberately underestimate their age on social networks. Such actions may be carried out for illegal purposes, particularly for unimpeded communication with users under 18. The analysis confirms the importance of identifying the real age of social network users, mainly for the detection of pedophilia.
The results on the verified corpus are due to the fact that the wrong age can be easily predicted in social media. Therefore, the classification of the original raw data showed worse results. There is a possibility that the user will upload photos that are not his own, but our results show that the method is suitable for this case.
The best result for Russian-language text was obtained using the FastText model on verified data. The cross-validation accuracy was 82.1% for two categories. The obtained accuracy of the method is comparable to the approaches for English, Chinese, and other research, even considering the complexity of the Russian language.