The Measurement of Demographic Temperature Using the Sentiment Analysis of Data from the Social Network VKontakte

Kalabikhina, Irina Evgenievna; Banin, Evgeniy Petrovich; Abduselimova, Imiliya Abduselimovna; Klimenko, German Andreevich; Kolotusha, Anton Vasilyevich

doi:10.3390/math9090987

Open AccessArticle

The Measurement of Demographic Temperature Using the Sentiment Analysis of Data from the Social Network VKontakte

by

Irina Evgenievna Kalabikhina

¹,

Evgeniy Petrovich Banin

^2,*

,

Imiliya Abduselimovna Abduselimova

¹,

German Andreevich Klimenko

¹ and

Anton Vasilyevich Kolotusha

¹

Population Department, Faculty of Economics, Lomonosov Moscow State University, Leninskije Gory, GSP-1, 119991 Moscow, Russia

²

Department of Applied Mechanics, Faculty of Robotics and Complex Automation, Bauman Moscow State Technical University, Baumanskaya 2-ya st., 5/1, 105005 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(9), 987; https://doi.org/10.3390/math9090987

Submission received: 11 January 2021 / Revised: 24 March 2021 / Accepted: 19 April 2021 / Published: 28 April 2021

(This article belongs to the Special Issue Mathematical and Instrumental Methods in the Digital Economy)

Download

Browse Figures

Versions Notes

Abstract

Social networks have a huge potential for the reflection of public opinion, values, and attitudes. In this study, the presented approach can allow to continuously measure how cold “the demographic temperature” is based on data taken from the Russian social network VKontakte. This is the first attempt to analyze the sentiment of Russian-language comments on social networks to determine the demographic temperature (ratio of positive and negative comments) in certain socio-demographic groups of social network users. The authors use generated data from the comments to posts from 314 pro-natalist groups (with child-born reproductive attitudes) and eight anti-natalist groups (with child-free reproductive attitudes) on the demographic topic, which have 9 million of users from all over Russia. The algorithm of the sentiment analysis for demographic tasks is presented in the article. In particularly, it was found that comments under posts are more suitable for analyzing the sentiment of statements than the texts of posts. Using the available data in two types of groups since 2014, we find an asynchronous structural shift in comments of the corpuses of pro-natalist and anti-natalist thematic groups. Interpretations of the evidences are offered in the discussion part of the article. An additional result of our work is two open Russian-language datasets of comments on social networks.

Keywords:

reproductive attitudes; demographic temperature; sentiment analysis; big data; Latent Dirichlet Allocation; VKontakte

JEL Classification:

C02; C18; C31; C61; J11; J13

1. Introduction

Over the past 10 years, social networks have evolved into a tool of influence—social engineering [1]. The 2016 election race in the United States can be considered a landmark event that marked the transition to a new information reality. From that moment, it became clear that social networks have a huge potential for influencing the masses [2,3] and for reflecting the public opinion [4]. Working with data from social networks is largely related to the analysis of large texts. According to the international database of scientific publications Scopus, one can observe a doubling of the number of publications on the subject of “sentiment analysis” every three years (Figure 1a) from 2010 to the present. The leading positions in publication activity are held by the USA, China, and India (Figure 1b). During the period from 2010 to 2020, authors from Russia published 267 indexed documents in this area.

The majority of publications on the sentiment analysis topic are concentrated in four fields of knowledge: computer science (39.2%), engineering (12.3%), mathematics (10.2%), and social science (10.0%). The subject of economics, econometrics, and finance (classification from Scopus database is used) account for 2.4% of all publications on the subject of sentiment analysis, and business, management, and accounting account for 4.5%. According to the statistics presented, we can conclude that sensitivity analysis tools have been robustly tested to be usable enough at the level of practical applications.

These algorithms could be used in various subject areas, which would entail an increase in the number of publications and the humanities and social disciplines. It should be noted that this growth also depends on the possibility of effective interaction of specialists from the humanitarian fields with specialists in data analysis. Interdisciplinarity, in this case, is a key aspect of the possibility of obtaining new knowledge about economics. This article is based on the application of a multidisciplinary approach to measuring the sentiment of utterances on social media for use in demographic analysis.

This is the first attempt to analyze the sentiment of Russian-language comments on social networks to determine the “demographic temperature” in certain socio-demographic groups of social network users.

By demographic temperature, we mean the emotional background or the predominance of the positive or negative stance of statements on topics related to family values, childbirth, and other topics in the field of reproductive behavior. The demographic temperature is measured as the difference between the number of positive and negative statements over a certain period of time [5].

Sentiment analysis is one of the subtasks of computational linguistics, the purpose of which is to assess the “mood” of a text. In the simplest implementation, the mood of the text is divided into classes: the positive, the negative, and the neutral. More complex implementations require the classification of the elements of the text into five or more classes in order to assess the gradations of its emotional coloring. The approaches to sentiment analysis can be divided into two groups. The first group includes rule-based methods. As the title emphasizes, the work of these methods is based on pre-allocated classification rules and developed vocabularies, i.e., it is a supervised learning option. According to the developed rules, based on emotional keywords and their combinations, it is possible to assess the class of the text; for example, the presented approach is implemented in the studies [6,7,8]. The presented group of methods has several drawbacks, such as poor generalizability of the results and the laboriousness of forming the rules. Moreover, the complexity of the rules grows with increasing depth of analysis, since reference sentiment values (sentiment dictionary) are needed. In the Russian-speaking segment, there are two large projects with developed sentiment dictionaries: RuSentiLex [9] and LINIS Crowd [10]. Both projects are developed dictionaries with an assessment of sentiment (from positive to negative) for each word or combination of words without characteristics of emotional coloring, which limits the applicability of the developed tool. More complex sentiment models are proposed in the SentiWordNet [11] and SenticNet [12,13] projects based on the sentiment analysis of the English language.

As an alternative to the first approach, the second approach based on machine learning is gaining popularity nowadays. This approach involves the automatic extraction of features from the text. Naïve Bayes classifier, logistic regression, decision tree, support vector machine, and neural networks are the most popular implementations of algorithms for the classification of the emotional coloring of texts. Currently, the methods of deep learning of neural networks are the most actively developing methods of sentiment analysis. They demonstrate best [14,15] (the class of best practices is based on convolutional and recurrent neural networks, as well as learning transfer models) results of sentiment assessment in comparison with other existing approaches. The simplest implementation of the representation of text in vector space is the “bag of words” model. It is used in a large number of studies for the primary processing of text data. Deeper approaches to analysis include the distribution semantics models such as Word2Vec (Project link https://code.google.com/archive/p/word2vec/ (accessed on 8 October 2020)), Doc2Vec [16], FastText [17], and GloVe [18]. Sentiment analysis methods based on machine learning are characterized by the need for pre-training on large sets of marked-up texts. There are known attempts to combine the two presented approaches (based on the rules and methods of machine learning), for example, the works [19,20].

There are currently five sources of textual data: user-generated content, product and service reviews, news content, books, and mixed data. In a review paper on applied sentiment analysis [21], it was noted that for the Russian language, this field is little explored (the author notes the 27 most relevant studies on the analysis of sentiments in Russian). Much of the research has focused on analyzing the sentiments of tweets (short messages) on the social network Twitter. According to Brand Analytics (Report data at the link https://br-analytics.ru/blog/social-media-russia-2019/ (accessed on 8 October 2020)), Twitter is not widely used in Russia. For example, Twitter had only about 650 thousand active users from Russia in 2019 (in November 2019, about 32 million messages were written by Russian users, i.e., there were 49 posts per author on average). Nevertheless, tweets require a concise presentation of the position from the user (on the social network, the message is limited to 280 characters), which allows a researcher to cover a large number of dissimilar opinions. It makes tweets a convenient unit of analysis.

In this study, we tested the machine learning toolkit on text data obtained from the most popular social network, VKontakte.

The social network VKontakte has the largest coverage of the Russian-speaking population at the time of our research. According to a report by the consulting company Deloitte (Link to source https://www2.deloitte.com/content/dam/Deloitte/ru/Documents/research-center/media-consumption-in-russia-2018-en.pdf (accessed on 8 October 2020)), VKontakte covers up to 70% of the Russian population. A paucity of research on this topic and high coverage of the population predetermined the choice of the social network VKontakte as a source of textual data for analysis.

The level of analysis is country-specific since the selected groups collect users with different geolocations. The data are collected to measure the demographic temperature in certain socio-demographic communities over a fairly long period of time from 2014 to the present (late 2020 to early 2021).

We collected unstructured text data from two groups, carried out preprocessing of data (cleaning, lemmatization, stemming, and punctuation removal), formed a structured array (corpus) of texts. Thematic clusters were identified based on the Latent Dirichlet Allocation (LDA). After carrying out a thematic analysis for each cluster, an assessment of the sentiment of the texts was carried out. The dynamics of the change in sentiment over time were constructed for the comments of participants in groups with different reproductive attitudes for a more accurate measurement of the demographic temperature (demographic climate) in Russian society.

2. Data and Processing

The source of text data is VKontakte thematic groups (Website Vk.com (accessed on 8 October 2020)).

We chose simple characteristics of groups in a demographic context. The first dataset consists of pro-natal (pro-family) groups (e.g., “Good Parents”, “Pregnancy”, “I Am Mom”, and others). It can be assumed that the members of these groups have positive reproductive attitudes, have given birth to a child in the recent past, or have the goal of having a child soon [5]. The second dataset consists of anti-natalist groups (e.g., “Childfree” and others). It can be assumed that the reproductive attitudes of the participants in these groups are not associated with the birth of a child in the recent past or near future [22]. It is important to take into account the reproductive attitudes of the population for demographic analysis, since the analysis of various factors of fertility is most accurately implemented in such a campaign [23]. The influence of demographic policy or economic factors should be assessed in different population groups by the criterion of reproductive attitudes. We considered separately the comments of women and men in all time periods (2014–2020), which can also be important for analyzing the demographic temperature in society, since almost all measures of socio-demographic policy and economic dynamics are not gender-neutral [24,25]. Let’s take a closer look at creating a database of pro-natalist groups (our first open dataset see [5], our second open data set see [22]). At the first stage of processing using the built-in API (application programming interface) by keywords (“mother”, “mothers”, “children”, “child”, “baby”, “health”, “birth”, “pregnancy”, and “ parents”) unique numbers of addresses of thematic groups were collected using the form vk.com/<unique group identifier>. At the first stage, about 1000 unique group addresses were collected with data on the number of participants. At the second stage, groups, associated with advertising, were excluded from the sample, as well as groups with low activity of participants (the overall dynamics of changes in the number of posts, likes, and reposts was estimated) and with number of subscribers less than 10,000. Thus, the final sample for collecting text information consisted of 341 groups (with the maximum number of subscribers of 1,482,303, the minimum number of subscribers of 10,058 and the average of 75,297). Without considering self-intersections, the sample coverage was about 9 million users. The table shows a list of the most popular groups on the generalized topic “motherhood and childhood” ranged by the number of subscribers.

After the formation of the final list of groups, text information was collected from the groups. In this work, we used only texts from posts and comments. Based on the information collected, the corpus of the language was formed: all words were reduced to lower case, stop words (stop-word list, i.e., the most common words (such as prepositions, conjunctions, etc.) can be viewed at the link https://countwordsfree.com/stopwords/russian (accessed on 8 October 2020)) were removed using functions from the nltk (project website nltk.org) or genism (link to resource https://pypi.org/project/gensim/ (accessed on 8 October 2020)) library, punctuation was removed, and numerical data were excluded. To reduce the amount of text data, we additionally carried out stemmization (removal of word endings) and lemmatization (reduction of a word to its initial form using the MyStem lemmatizer (link to the Yandex project https://yandex.ru/dev/mystem/ (accessed on 8 October 2020)). An example of text processing is shown in Figure 2.

The resulting database of pro-natalist groups (or pro-natalist’s corpus) is presented in the public domain [5]. The second database of anti-natalist groups is in the process of being edited for public access (will be publicly available at the same address). It currently contains groups presented in the Table 1. Data collection was carried out in the same way as in the case of pro-natalist groups. The final sample for collecting text information consisted of 8 anti-natalist groups (with the maximum number of subscribers of 61,071, the minimum number of subscribers of 619, and the average of 8950). In contrast to the pro-natalist groups, the number of anti-natalist groups in Russia is smaller (there were only 8 anti-natalist groups, excluding groups with less than 600 subscribers). Nevertheless, the activity of anti-natalist groups (number of posts, comments, likes, etc.) is much higher.

Thus, the analysis uses data from comments on posts from 314 Russian pro-natalist groups (with reproductive attitudes for childbirth) and 8 anti-natalist groups (with reproductive attitudes against childbirth (“childfree”)) on the demographic topic. We propose that the participants of so cold pro-natalists groups have reproductive attitudes to have children and the participants of anti-natalist groups have reproductive attitudes to have no children in the period of investigation or during reproductive life.

3. Method: Thematic Modeling and Sentiment Analysis

The classical topic modeling model based on the tf-idf metric [26] is the simplest to implement but has some limitations: a small amount of reduction in description length and reveals little in the way of inter- or intradocument statistical structure. To address these shortcomings, the Latent Semantic Analysis (LSA) [27] was developed, which evolved into a probabilistic Latent Semantic Analysis (pLSI) using maximum likelihood or Bayesian methods [28]. A little later, on a similar probabilistic model in [29], a method based on Latent Dirichlet Allocation (LDA) was proposed. The most important advantage of LDA over pLSA is that the LDA more adequately estimates the probabilities of new and rare words. It can also be noted that LDA builds a sparser topic model and less retraining.

A thematic model is a model of a collection of text documents that determines which topics a given document applies to. In addition to highlighting the structure of a text collection, thematic modeling allows a semantic search for information. This method differs from search by keywords, where the meaning is not explicitly presented. The thematic model identifies hidden topics in the document by the observed distributions of the words

p (w | d)

in the documents (i.e., the frequency estimate). Thus, according to a collection of documents, it is possible to select information:

(1) The topics that make up the collection of texts (

p

(t) is the probability of topic

t

in the collection);

(2) The topics that each document consists of (

p (t | d)

is the probability of topic

t

in document

d

);

(3) The words that each topic consists of (

p (w | t)

is the probability of the word

w

in topic

t

).

If we assume that

$W$ is a finite set of words;
$D$ is a finite set of documents in the collection;
$T$ is a finite set of topics;
The word order in the document and the order of the documents in the collection are not important;
Every word $w$ in the document is related to some topic $t$

then the task is reduced to finding the distribution

{(d_{i}, w_{i}, t_{i})}_{i = 1}^{n} ~ p (d, w, t)

in a discrete probability space of dimension

D \cdot W \cdot T

(Figure 3).

This study solves the inverse problem of topic modeling. Accordingly, using the known and observed

d_{i}, w_{i}

it is necessary to obtain

t_{i}

((the number of topics is a hyperparameter and is set by the user). Thus, the thematic model can be formally represented as a formula for the total probability:

p (w | d) = \sum_{t \in T} p (w | d, t) \cdot p (t | d)

(1)

To simplify this relation, the hypothesis of conditional independence is applied

p (w | d, t) = p (w | t)

. So the ratio is converted to:

p (w | d) = \sum_{t \in T} p (w | t) \cdot p (t | d)

(2)

Relation (2) allows solving both direct and inverse problems of thematic modeling. This study is devoted to the inverse problem of topic modeling. Consequently, word frequencies are known from the documents retrieved from the VKontakte social network communities n_dw:

p (w | d) = \frac{n_{d w}}{n_{d}}

(3)

By the known distribution of word frequencies, the problem is reduced to the equation

p (w | d) = \sum_{t \in T} φ_{w t} θ_{t d},

(4)

where:

φ_{w t} = p (w | t)

is the probability of terms

w

in every topic t,

θ_{t d} = p (t | d)

is the probability of topic t in every document d.

Before the implementation of LDA, it is necessary to make an assumption about the prior Dirichlet distributions, where:

$θ_{d} = {(θ_{t d})}_{t \in T} \in ℝ^{| T |}$ —random vectors from Dirichlet Allocation with $α \in ℝ^{| T |}$ :

$D i r (θ_{d} | α) = \frac{Γ (α_{0})}{\prod_{t} Γ (α_{t})} \prod_{t} θ_{t d}^{α_{t} - 1}, α_{0} = \sum_{t} α_{t}, \sum_{t} θ_{t} = 1$
$φ_{t} = {(φ_{w t})}_{w \in W} \in ℝ^{| W |}$ —random vectors from Dirichlet Allocation with $β \in ℝ^{| W |}$ :

$D i r (φ_{t} | β) = \frac{Γ (β_{0})}{\prod_{w} Γ (β_{w})} \prod_{w} φ_{w t}^{β_{w} - 1}, β_{0} = \sum_{w} β_{w}, \sum_{w} φ_{w t} = 1$

There are two types of topic modeling problems: forward and backward. The direct task allows you to generate texts based on the known distribution of words by topic. The inverse problem allows to form a list of topics based on the observed frequency of terms in the collection. For the inverse problem, the number of topics is a hyperparameter and is set by the researcher. Based on the resulting set of topics in the form of a vector of the most frequent words (including bigrams and trigrams), the researcher himself sets the topic with an appropriate semantic context for further analysis. TensorFlowb (Official website of the project tensorflow.org) and tflearn (Official website of the project tflearn.org) libraries were used for sentiment analysis. The neural network for sentiment analysis has a three-level architecture. The first level corresponds to the dimension of the commentary corpus dictionary. The second level consists of 125 neurons (fully connected layer, ReLU (Function description by reference https://en.wikipedia.org/wiki/Rectifier_(neural_networks) (accessed on 8 October 2020)) activation function). The third level consists of 25 neurons (fully connected layer, ReLU activation function). The output level is binary (“0”—negative, “1”—positive, activation function—softmax (Function description by reference https://ru.wikipedia.org/wiki/Softmax (accessed on 8 October 2020))). Learning Algorithm Specification (Description by reference http://tflearn.org/layers/estimator/ (accessed on 8 October 2020)):

Rg = tflearn.regression (*neural network layer*

, optimizer = «sgd»

, loss = «categorical_crossentro»)

Neural network architecture (Figure 4):

net = tflearn.input_data([None, VOCAB_SIZE])

net = tflearn.fully_connected(net, 125, activation = ‘ReLU’)

net = tflearn.fully_connected(net, 25, activation = ‘ReLU’)

net = tflearn.fully_connected(net, 2, activation = ‘softmax’)

Figure 4. Model diagram.

The sum of the level signals at the output is equal to one, i.e., the output is two numbers that characterize the probability that this comment will be negative or positive. Neural network training is based on the principle of error backpropagation. The neural network was trained on a marked-up database of short messages from Twitter [30]. Some examples are given in Table 2.

The neural network was trained in the Google Colab environment using a graphics processing unit (GPU). To train the neural network, about 24 GB of RAM was used with the size of the training dictionary of 5000 words. Before training, the data underwent stemmization (reduction to the basic form of the word). All non-Cyrillic symbols from the sample were eliminated. The test sample size is 30% of the entire sample. The number of epochs for training is 30. The resulting accuracy on the training sample is 93.4%; on the test sample, it is 69% (for both groups). The threshold value of the probability for classifying a comment as positive or negative is 0.5.

4. Results of Modeling

The methodology of thematic modeling of latent Dirichlet placement (LDA) is a type of unsupervised learning, so for the distribution of texts across clusters, preliminary markup of the training corpus is not required. One only needs to select the hyperparameter of coherence (consistency). A topic can be considered coherent if the terms that are most frequent in a given topic (cluster) are often found together next to the collection’s documents. In our study, coherence was evaluated on the same collection that the model was built on. To assess coherence, we used pointwise mutual information (PMI). The method for evaluating pointwise mutual information is implemented in the gensim library (function CoherenceModel). Figure 5 shows the change in the coherence metric depending on the number of topics for LDA clustering. Thus, the best clustering by the coherence metric is achieved for 10 topics. The pyLDAvis (Official website of the project pypi.org/project/pyLDAvis/ (accessed on 8 October 2020)) library is used to visualize thematic modeling.

It is important to note that despite the fact that the coherence metric was maximal for 10 topics, some clusters (in Figure 6, topics 1, 2, 3, 6, 7, 8) intersect, which indicates that the words match each other. When divided into 10 topics, the size of 7 clusters out of 10 is comparable (the area of the circle corresponds to the percentage of words in the topic to the total number of words in the dictionary). In order to divide the sample of posts into more polar topics, one needs to reduce the number of topic clusters to 4. In this case, the topics will turn out to be more “polar”, i.e., located in different parts of the two-dimensional vector space, which is obtained by convolving the multidimensional space of topic vectors.

The result of thematic modeling based on the corpus of posts for four topics is shown in Figure 7. Figure 7 demonstrates that if there are fewer topics, the clusters are separated, which makes it possible to interpret the topics more accurately. However, in this case, repeating words in the topic vector is also possible. The results of thematic modeling are summarized in Table 3 below.

When divided into four topics, the results are more separable than for the case with a large number of topics. The presented analysis was also performed for a sample of comments collected in groups from Table 1. It is important to note that comments are usually short messages inspired by some of the topics covered in the post. In other words, the topics that are presented in the post content should be reflected in the comments to the posts. Cluster analysis for a sample with comments showed that the comments corpus can best be divided into 3 and 7 topics by the coherence metric (Figure 8).

The results of dividing the sample of comments into three and seven topics are summarized in Table 4 below. It is important to note that just as with the division of the post sample, the comments sample also shows an intersection of clusters when moving from three to seven topics. The word frequency and principal component space (PC1 and PC2) for three and seven topics are shown in Figure 9 and Figure 10.

Frequently, one or more topics can be classified as non-specific topics, i.e., general conversational topics that contain the most common grammar and are not informative for researchers in terms of demographic connotations.

Data from Table 3 allow us to note a feature of thematic modeling based on a sample of comments. Comments are more uniform when divided into topics, and the specificity (words that allow you to explicitly separate one topic from another by a person) of information found in posts is lost during discussion. This is typical for the entire sample of comments. In addition, despite the increased consistency metric, with an increase in the value of the hyperparameter (the number of topics), the homogeneity of the resulting vectors becomes stronger. Thus, we can conclude that comments under posts are suitable for sentiment analysis, but the volume of the comment text is not suitable for meaningful thematic modeling. Moreover, it is rarely possible to find a meaningful position in the comments but it is always possible to find an emotional one. Therefore, the authors suggest evaluating the sentiment of statements based on comments (where the emotional position prevails), and implementing the thematic analysis based on posts (where the meaningful position prevails).

5. Empirical Examples of Modeling

As it was mentioned, for evaluating sentiment of statements it is better to use the body of comments. We made two corpuses of data that are the pro-natalist corpus and the antinantalist corpus [5]. The pro-natalist corpus consists of more than 100,000 comments of 314 groups of people with reproductive attitudes to have a child and approximately 670,000 comments of 8 anti-natalist groups with child-free reproductive attitudes. The period of investigation is from February 2014 to November 2020.

The users of anti-natalist groups are 10 times less than pro-natalist ones, while their activity in writing posts, commenting and other interaction is several times higher activity of pro-natalist groups. Even a small number of anti-natalist communities (8 in the current study) create a larger informational background, which exceeds them by the number of pro-natalist’s communities.

Figure 11 shows the distribution of comments by gender and by month since the beginning of 2014, which is divided by the author’s gender. There are 48,885 positive comments and 54,690 negative ones for pro-natalist corpus. It should be noted that women are most active in the represented pro-natalist groups. There has been a sharp increase in activity since June 2017. The peak value of activity is observed in the first months of 2020. We also note that since the beginning of 2017, there have been periods of decline in activity (at the ends of 2018, 2019 and 2020) and growth of activity (in autumn of 2017 and 2018, spring 2019, at the beginning and the middle of 2020).

The distribution of the number of comments in the anti-natalist corpus by gender and by month is given at Figure 12. There are 267,002 positive comments and 403,938 negative ones for the anti-natalist corpus. Women continue to be the most active group. However, there has been a sharp decrease in activity since January 2017. The peak value of activity is observed in 2016; the increase in activity occurred in 2015 and 2020.

The demographic temperature (the differences between positive and negative comments) of pro-natalists by month are shown in Figure 13. After 2017 the number of negative comments increased significantly and exceeded the number of positive comments in almost every monthly period. The most significant predominance of negative comments is observed in 2017, 2019, and 2020.

The demographic temperature of anti-natalist groups by month are shown in Figure 14. After 2017 the number of negative comments decreased significantly. The negative temperature was almost every monthly period. The most significant predominance of negative comments is observed in 2016. The “colder” periods during 2014–2020 were in 2015, 2017, 2019, and 2020.

The ratio of positive to negative comments for pro-natalist and anti-natalist corpuses by month are given on Figure 15 and Figure 16.

A value of “1” indicates a neutral demographic temperature (relative indicators can also measure demographic temperature). Below “1” is negative temperature, above “1” is a positive one. The general conclusions are as follows. The pro-natalists are more positive in general during the total period (the average demographic temperature for the period is slightly negative, that is 0.951). The anti-natalists have a lower average demographic temperature over the period, which is 0.691. Both pro-natalists and anti-natalists have been reducing the demographic temperature from 2014 to the present. Anti-natalists have done it in 2014, pro-natalists have decreased later, in 2017.

6. Conclusions and Discussion

The paper presents a methodology for analyzing sentiment of the statements based on data from the Russian-language social network VKontakte, applied to solving one of the demographic problems that are determining the demographic temperature of Russian society, its various socio-demographic groups (anti-natalists and pro-natalists, women and men from these groups). We present the 15-year monthly dynamics of demographic temperature (net sentiment) in two types of thematic groups, distinguished by an important criterion for demographic analysis.

Data from social networks appear to be important alternative data in demographic analysis in addition to data from population censuses, current population records, sample surveys, and even registers, since big data for measuring demographic temperature and other tasks can be collected continuously to quickly respond to changes in the demographic climate in society or, for example, short-term forecasting of birth rate dynamics before the publication of official data [30].

We have a methodological outcome in the field of structuring big data and building algorithms for processing data from social networks for demographic tasks using libraries in Russian. A meaningful scientific result in the field of demography.

We have developed a methodology for measuring the demographic temperature of different population groups according to social media data to assess the periods of influence of demographic policy (and other factors) on reproductive mood and demographic temperature. Our algorithm helps to see lasting changes in the sentiments of groups, assembled according to a critically important criterion for demographic analysis. In demography, it is a great success to have differentiated data for groups with pro- and anti-natalist reproductive attitudes since this provides more accurate interpretations of fertility’s determinants. The policy itself is designed differently for these two groups. By superimposing these changes on the demographic policy calendar, we can hypothesize about the factors influencing the policy on reproductive behavior and reproductive attitudes. One of the principles of our methodology is the search for thematic groups. Not individual statements by randomly found respondents, but thematic groups. Working with the sentiment of brightly colored groups according to the criterion of reproductive attitudes gives us the mood of the target groups. This removes the need to search for demographic statements and to prove that we are analyzing predominantly statements on our demographic topics. Groups were created specifically for conversations on these topics. The selection of these two groups methodically solves the problem of selecting statements on the desired topic. The allocation of polar groups gives us a unique opportunity for the reaction of loyal and negative-minded people to the birth of children and parenting.

It was found that comments under posts are more suitable for analyzing the sentiment of statements than the texts of posts. We suggest evaluating the sentiment of statements based on comments (where the emotional position prevails) and implementing the thematic analysis based on posts (where the meaningful position prevails).

The Latent Dirichlet Allocation method is used to separate posts by topics. The training of the classifier of short comments to posts is carried out on a sample of tweets from work [31]. On the presented sample of an attainable height of 93.4%, the set achieves an accuracy of up to 69% in a test operation. In expanding the development of the work, the sample for training the neural network for assessing the sentiment will be expanded with comments from the VKontakte network and marked by the authors, and the gradation of senses will also be expanded (up to five senses, taking into account the neutral one). The algorithm for identifying pro-natalist and anti-natalist texts in the VKontakte social network based on LDA will be finalized based on expert assessments, filtering of comments by the degree of correlation with the post will be carried out (at the moment, not all comments reflect the attitude specifically to the text, but rather to the response of users). The scope of text search in the social network will be expanded to other communities. The authors consider the development of methods for identifying texts of necessary social topics, linking to texts of public opinion on the topic through comments, likes and reposts and, thus, developing a tool for assessing the demographic temperature associated with measures of social policy of the state.

We presented a tool for measuring the dynamics of demographic temperature in selected communities of people with polar reproductive attitudes. The first group included participants actively interested in the issues of childbirth and parenting. The second group included participants who argued a positive attitude toward childlessness.

So, one of the successful solutions to our methodology is that we consider the search for thematically homogeneous groups, preferably opposition-minded, to assess the dynamics of the sentiment in a comparable way against the background of events that can affect these dynamics.

The study revealed two types of user behavior: users of pro-natalist groups are multifaceted but inactive in terms of creating content and interacting with it, users of anti-natalist groups are 10 times less, while their activity in writing posts, commenting and other interaction is several times higher activity of pro-natalist groups. Even a small number of anti-natalist communities (8 in the current study) create a larger informational background, which exceeds them by the number of pro-natal communities.

Anti-natalist groups were more active, gaining a significantly larger number of positive and negative comments per group throughout the entire period under review. This may be due to the fact that people with young children have less time to post on social media. In addition to the fact that in Russia during the time under study, conscious childlessness was not encouraged, and the members of anti-natalist groups positioned themselves in this way, looked for like-minded people, and justified their position. Under the conditions of official propaganda of the normal behavior of a person who wants and has a family and children, pro-natalists are considered successful and happy people (“happiness likes silence”).

In general, anti-natalist groups criticize the classical views on the institution of the family; various personal stories of users with a negative bias are considered, i.e., such groups actively generate content that can only be perceived negatively.

The presence of male and female utterances, in this case, did not provide a meaningful result. The differences were in the volume of comments, but not in the dynamics of the demographic temperature. If the differences are, we could obtain a signal to develop a gendered demographic policy. However, we do not find such a signal, the dynamics of demographic policy are the same for women and men in the two groups. Our next step will be devoted to the analysis of comments from representatives of different cohorts.

The prevalence of negative stances on social media is a recognized phenomenon [32,33,34]. Our algorithm made it possible to identify not only the final positive or negative stance of statements but also the degree of intensity of negative or positive statements in these groups, which is a «cold» demographic temperature of social groups.

During the period under consideration, the intensification of the emotional background occurred asynchronously in these groups. Anti-natalists were more active until 2017. The pro-natalists became more active after 2017. There are no coincidences with the exception of 2020, which may be related to lockdown and payments to families with children during the pandemic, on the one hand, and increased interest in discussing childbirth and parenting during the epidemic, on the other hand.

We find an asynchronous structural shift in comments of the corpuses of pro-natalist and anti-natalist thematic groups. Perhaps the mirroring of reactions is caused by the same events in public polemics and demographic policy. For example, the strengthening of traditionalism and rejection of child-free in the rhetoric of demographic and family politics since 2014 [35] could provoke an active reaction from anti-natalist communities. The increased activity of pro-natalists should have matured under the pressure of official propaganda of traditional family values. That is why the reaction of pro-natalist groups was later in 2017.

It is not at all obvious that the demographic temperature is getting worse in loyal groups during the period of the active demographic policy. This is an unexpected result. It is consistent with the drop in total fertility rates. However, intuitively, we are not sure of such a result at the beginning of the study.

Demographic policies could spark an increased focus on parenting in this period. The extension of the maternity capital program was announced twice by President Putin in December 2015 until 2018 and then at the end of 2017 until 2021. The maternity capital program is a large lump sum payment for the birth of a second and older child, a bright, popular measure. Then, in January 2020, for the first time since 2016, maternity capital was indexed. President Putin announced a promise to extend the program until 2026.

We do not make an unambiguous conclusion about the impact of demographic policy because several factors affect fertility, including economic cycles, for example. However, we can definitely say that our analysis helped us to see shifts in the sentiment of different groups. In this case, we see a shift that is a warming in the anti-natalist group and a cooling in the pro-natalist group with a slight lag after the turn in politics. So, we need to look for what occurred during this period of time. As noted early, this period coincides with the strengthening of traditionalism in the concept of demographic policy. People with different reproductive attitudes do not respond in the same way to questions of childbirth and parenting in the face of increasing of traditionalism. This issue seems important to us. Our result made it possible to understand that the rollback from modernization upsets those people who are precisely the target group and support of demographic policy. For pro-natalists, the demographic sentiment is deteriorating, then, the demographic temperature is steadily downgraded. This can be a signal for a revision of the design and priority measures of population policy.

During the pandemic in 2020, the demographic temperature in both groups remains lower. That is correlated with decreasing fertility. Moreover, the dynamic of the total fertility rate is linked with the demographic temperature at first sight. The total fertility rate increased from 2006 to 2015 and decreased then till 2020.

In addition, the pro-natalist groups may be more sensitive to economic changes and demographic policies than anti-natalist groups, which is associated with less stable activity in pro-natalist groups. It is possible also that pro-natalist groups discuss real difficulties in raising and caring for children more often than the fact of the birth of a child and hypothetical difficulties, which leads us to differing results of measuring demographic temperature between groups. That’s why the anti-natalists react more brightly to social norms’ changing (strengthening of traditional values from 2014). However, the pro-natalists react more brightly to real income decreasing (accumulative effect from 2015). The families with children are ones for whom the decline in economic opportunities and other negative manifestations of the pandemic lead to a deterioration of the demographic climate. The economic dynamics since 2014 can be characterized as stagnation; there were no major changes in 2016–2017, then, lockdown is in process. However, these interpretations should be verified in future research.

Thus, we have a methodological contribution and demographic one. We present a reproducible algorithm for assessing the demographic temperature in the policy-grounded selected thematic groups of social network users. We find a steady deterioration in the demographic temperature of the main target population in the process of strengthening traditionalism in politics. It is consistent with the drop in total fertility rates.

An additional result of our work is two open Russian-language data sets to test our results and continue the research.

Author Contributions

Conceptualization, I.E.K.; methodology, I.E.K. and E.P.B.; software, E.P.B.; validation, E.P.B.; formal analysis, E.P.B.; investigation, E.P.B. and I.E.K.; resources, E.P.B. and I.E.K.; data curation, E.P.B.; writing—original draft preparation, E.P.B., I.E.K., I.A.A., G.A.K., and A.V.K.; writing—review and editing, I.E.K. and E.P.B.; visualization, E.P.B.; supervision, I.E.K.; project administration, I.E.K.; funding acquisition, I.E.K. All authors have read and agreed to the published version of the manuscript.

Funding

The manuscript was prepared with the financial support of the Economic Faculty at Lomonosov Moscow State University on research on the topic “Reproduction of the population in a digital society”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The database is contributed as a part of the paper. Data supporting reported results (open data paper and open data set) can be found here: https://doi.org/10.3897/popecon.4.e60915 (accessed on 1 April 2021) (Kalabikhina and Banin 2020) and here http://doi.org/10.5281/zenodo.4612131 (accessed on 1 April 2021) (Kalabikhina and Banin 2021).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Glossary

N-gram	A sequence of n elements. From a semantic point of view, it can be a sequence of sounds, syllables, words, letters, or stable collocation phrases
Collocation	A phrase that has syntactically significant attributes and semantically an integral unit in which the choice of one of the components is carried out according to the meaning, and the choice of the second one depends on the choice of the first one (for example, to put conditions—the choice of the verb to put is determined by tradition and depends on the noun condition, with the word sentence there will be a different verb to make). The collocation is a regular one N-gram
Lemmatization	The procedure of leading the word to its semantic canonical form (infinitive for verbs, nominative singular for nouns and adjectives)
Stemmization	The procedure of elimination of root appendages in a word, i.e., separation of suffixes, prefixes, and endings from the root of the word
Digital trace of the users	Information about the users’ activity and data that they leave when using the Internet
Social engineering	Manipulating people to perform certain actions
Hyperparameter	A parameter, the value of which is set by the user
Stop-word	The word that does not carry a semantic load in the text

References

Sabatovych, I. Do social media create revolutions? Using Twitter sentiment analysis for predicting the Maidan Revolution in Ukraine. Glob. Media Commun. 2019, 15, 275–283. [Google Scholar] [CrossRef]
Enli, G. Twitter as arena for the authentic outsider: Exploring the social media campaigns of Trump and Clinton in the 2016 US presidential election. Eur. J. Commun. 2017, 32, 50–61. [Google Scholar] [CrossRef]
Groshek, J.; Koc-Michalska, K. Helping populism win? Social media use, filter bubbles, and support for populist presidential candidates in the 2016 US election campaign. Inf. Commun. Soc. 2017, 20, 1389–1407. [Google Scholar] [CrossRef]
Koltsova, O. Methodological challenges for detecting interethnic hostility on social media. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Heidelberg, Germany, 2019; pp. 7–18. [Google Scholar]
Kalabikhina, I.E.; Banin, E. Database “Pro-family (pronatalist) communities in the social network VKontakte”. Popul. Econ. 2020, 4, 98–130. [Google Scholar] [CrossRef]
Thelwall, M.; Buckley, K.; Paltoglou, G.; Cai, D.; Kappas, A. Sentiment strength detection in short informal text. J. Am. Soc. Inf. Sci. Technol. 2010, 61, 2544–2558. [Google Scholar] [CrossRef]
Thelwall, M.; Buckley, K.; Paltoglou, G. Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol. 2011, 63, 163–173. [Google Scholar] [CrossRef]
Hutto, C.J.; Gilbert, E. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the 8th International Conference on Weblogs and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; pp. 216–225. [Google Scholar]
Loukachevitch, N.; Levchik, A. Creating a general Russian sentiment lexicon. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23–28 May 2016; pp. 1171–1176. [Google Scholar]
Koltsova, O.Y.; Alexeeva, S.V.; Kolcov, S.N. An opinion word lexicon and a training dataset for Russian sentiment analysis of social media. In Komp’juternaja Lingvistika i Intellektual’nye Tehnologii; Rossiiskii Gosudarstvennyi Gumanitarnyi Universitet (Russian State University of Humanities): Moscow, Russia, 2016; pp. 277–287. [Google Scholar]
Baccianella, S.; Esuli, A.; Sebastiani, F. SENTIWORDNET 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC-10), Valletta, Malta, 17–23 May 2010; pp. 2200–2204. [Google Scholar]
Cambria, E.; Poria, S.; Bajpai, R.; Schuller, B. SenticNet 4: A semantic resource for sentiment analysis based on conceptual primitives. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2666–2677. [Google Scholar]
Cambria, E.; Poria, S.; Hazarika, D.; Kwok, K. SenticNet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2 February 2018; pp. 1795–1802. [Google Scholar]
Tang, D.; Qin, B.; Liu, T. Deep learning for sentiment analysis: Successful approaches and future challenges. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2015, 5, 292–303. [Google Scholar] [CrossRef]
Tang, D.; Zhang, M. Deep learning in sentiment analysis. In Deep Learning in Natural Language Processing; Springer International Publishing: Cham, Switzerland, 2018; pp. 219–253. [Google Scholar]
Chen, Q.; Sokolova, M. Word2Vec and Doc2Vec in Unsupervised Sentiment Analysis of Clinical Discharge Summaries. arXiv 2018, arXiv:1805.00352. [Google Scholar]
Nedjah, N.; Santos, I.; Mourelle, L.D.M. Sentiment analysis using convolutional neural network via word embeddings. Evol. Intell. 2019, 3, 1–25. [Google Scholar] [CrossRef]
Sharma, Y.; Agrawal, G.; Jain, P.; Kumar, T. Vector representation of words for sentiment analysis using GloVe. In Proceedings of the ICCT 2017—International Conference on Intelligent Communication and Computational Techniques, Manipal University Jaipur, Jaipur, India, 22 December 2017; pp. 279–284. [Google Scholar]
Kumar, A.; Srinivasan, K.; Cheng, W.-H.; Zomaya, A.Y. Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data. Inf. Process. Manag. 2020, 57, 102141. [Google Scholar] [CrossRef]
Meškele, D.; Frasincar, F. ALDONA: A hybrid solution for sentence-level aspect-based sentiment analysis using a lexicalised domain ontology and a neural attention model. In Proceedings of the 34 ACM Symposium on Applied Computing, Association for Computing Machinery, Limassol, Cyprus, 8–12 April 2019; pp. 2489–2496. [Google Scholar]
Smetanin, S. The applications of sentiment analysis for Russian language texts: Current challenges and future perspectives. IEEE Access 2020, 8, 110693–110719. [Google Scholar] [CrossRef]
Kalabikhina, I.E.; Banin, E.P. Database “Childfree (antinatalist) communities in the social network VKontakte”. Zenodo 2021. [Google Scholar] [CrossRef]
Antonov, A. Opyt issledovaniya ustanovok na zdorov’ye i prodolzhitel’nost’ zhizni. Sotsial’nyye Problemy Zdorov’ya i Prodolzhitel’nosti Zhizni 1989, 44. [Google Scholar]
Edgell, S.; Duke, V. Gender and Social Policy: The impact of the public expenditure cuts and reactions to them. J. Soc. Policy 1983, 12, 357–378. [Google Scholar] [CrossRef] [PubMed]
Jain, D.; Elson, D. Harvesting Feminist Knowledge for Public Policy Rebuilding Progress; SAGE International Development Research Centre: Ottawa, ON, Canada, 2011; ISBN 9788132107415. [Google Scholar]
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 50–57. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Kalabikhina, I.E.; Abduselimova, I.A.; Arkhangelsky, V.N. Kratkosrochnoe prognozirovanie demograficheskih tendencij na osnove dannyh google trends. Appl. Inform. 2020, 15, 91–118. [Google Scholar]
Loukachevitch, N.N.; Rubtsova, Y. Entity-Oriented sentiment analysis of tweets: Results and problems. In Text, Speech, and Dialogue; Springer: Berlin, Germany, 2015; pp. 551–559. [Google Scholar]
Chmiel, A.; Sobkowicz, P.; Sienkiewicz, J.; Paltoglou, G.; Buckley, K.; Thelwall, M.; Hołyst, J.A. Negative emotions boost user activity at BBC forum. Phys. A Stat. Mech. Appl. 2011, 390, 2936–2944. [Google Scholar] [CrossRef]
Jalonen, H. Negative emotions in social media as a managerial challenge. In Proceedings of the 10th European Conference on Management, Leadership and Governance, ECMLG, VERN’ University of Applied Sciences, Zagreb, Croatia, 13–14 November 2014; pp. 128–135. [Google Scholar]
Jalonen, H. Social Media—An Arena for Venting Negative Emotions. In Proceedings of the 3rd International Conference of Communication, Media, Technology and Design, Anadolu University—Institute of Communication Sciences in Turkey, Istanbul, Turkey, 24–26 April 2014; pp. 224–230. [Google Scholar]
Kalabikhina, I.E. Modern Socio-Demographic Policy in Russia: Is There Any Continuity in Conceptual Approaches in the Documents of 2007–2017? Available online: https://womaninrussiansociety.ru/wp-content/uploads/2019/12/%D0%9A%D0%B0%D0%BB%D0%B0%D0%B1%D0%B8%D1%85%D0%B8%D0%BD%D0%B0_14_28-1.pdf (accessed on 8 January 2021).

Figure 1. Dynamics of publications on the topic “sentiment analysis” by years (a) and the total contribution of leaders in the number of publications (b).

Figure 2. An example of preprocessing text from the corpus.

Figure 3. Structural representation of a collection of documents.

Figure 5. Dependence of the coherence metric on the number of topic clusters for a sample of posts.

Figure 6. Dividing the body of posts into ten thematic clusters (a) and demonstrating the frequency of the most frequently used words in the ninth cluster (b) using the LDA model.

Figure 7. Division of the corpus of posts into four thematic clusters. The frequency of words for the second topic is presented.

Figure 8. Dependence of the coherence metric on the number of topic clusters for a sample of posts.

Figure 9. Dividing the comments corpus into three thematic clusters. The frequency of words for the first topic is presented.

Figure 10. Division of the comment corpus into seven thematic clusters. The frequency of words for the third topic is presented.

Figure 11. Distribution of the number of comments in the pro-natalist’s corpus by month. Red—negative comments, green—positive ones.

Figure 12. Distribution of the number of comments in the anti-natalist’s corpus by month. Red—negative comments, green—positive ones.

Figure 13. The difference between positive and negative comments for pro-natalist’s corpus by month. “-” points at the predominance of negative comments (the demographic temperature).

Figure 14. The difference between positive and negative comments for anti-natalist’s corpus by month. “-” points at the predominance of negative comments (the demographic temperature).

Figure 15. The ratio of positive to negative comments for pro-natalist corpus by month.

Figure 16. The ratio of positive to negative comments for anti-natalist corpus by month.

Table 1. The most popular pro-natalist groups (groups on the topic “motherhood, parenting, pregnancy, children”) and anti-natalist groups (child-free groups).

URL	Group Name	Number of Subscribers
Pro-natalist groups (the participants have child-born reproductive attitudes)
https://vk.com/club52388302 (accessed on 8 October 2020)	XOPOШИE POДИTEЛИ (“GOOD PARENTS”)	1,482,303
https://vk.com/club34677924 (accessed on 8 October 2020)	Бepeмeннocть (“Pregnancy”)	1,339,737
https://vk.com/club170234932 (accessed on 8 October 2020)	Жeнcкoe Здopoвьe (“Women Health”)	1,053,617
https://vk.com/club20199180 (accessed on 8 October 2020)	PAЗBИBAЙKA POДИTEЛИ И ДETИ B ИHTEPHETE (“DEVELOPMENT PARENTS AND CHILDREN ON THE INTERNET”)	794,730
https://vk.com/club14395935 (accessed on 8 October 2020)	Pampers: Maмoчки BKoнтaктe (“Pampers: Mommies in VK”)	428,464
https://vk.com/club69716165 (accessed on 8 October 2020)	MAMA: Paзвитиe, Ceмья, Дeти (“MOM: Development, Family, Children”)	213,562
https://vk.com/club29746763 (accessed on 8 October 2020)	Ceмья Poдитeли Дeти CПб (“Family Parents Children Saint-Petersburgh”)	208,624
https://vk.com/club78865067 (accessed on 8 October 2020)	MAMA Дeти Ceмья (“MOM Children Family”)	202,603
https://vk.com/club61700163 (accessed on 8 October 2020)	Пoлeзнaя cтpaничкa! Здopoвьe \| Kpacoтa \| Cпopт (“Useful page! Health \| Beauty \| Sport”)	178,790
https://vk.com/club20622108 (accessed on 8 October 2020)	Я-MAMA: бepeмeннocть, дeти, ceмья, мaтepинcтвo (“I am MOM: pregnancy, children, family, motherhood”)	147,412
Anti-natalist groups (the participants have child-free reproductive attitudes)
https://vk.com/club69265846 (accessed on 8 October 2020)	Пoдcлyшaнo Чaйлдφpи (“Overhear Childfree”)	61,071
https://vk.com/club43946 (accessed on 8 October 2020)	Childfree	2406
https://vk.com/club48085 (accessed on 8 October 2020)	AДEKBATHЫE ЧAЙЛДΦPИ (“Adequate Childfree”)	627
https://vk.com/club4687918 (accessed on 8 October 2020)	CHILDFREE	3256
https://vk.com/club38197124 (accessed on 8 October 2020)	For ChildFree. Для чaйлдφpи (“For ChildFree”)	1855
https://vk.com/club58565280 (accessed on 8 October 2020)	ПPABДA пpo Childfree (Чaйлдφpи) (“TRUTH about Childfree”)	619
https://vk.com/club59638638 (accessed on 8 October 2020)	He xoчy poжaть (childfree) (“I don’t want to give birth”)	1237
https://vk.com/club148257242 (accessed on 8 October 2020)	Пoдcлyшaнo Я He Xoчy Дeтeй (Childfree) (“Overhear I Don’t Want Children”)	527

Table 2. Examples of negative and positive comments.

Negative (Russian)
1. Чтo твopитcя c миpoм? Hoвocти cплoшь o пeдoφилax, пeчaльнo и вoзмyтитeльнo 2. Mыcль oб этoм и жeлaниe yбить ceбя пo этoмy пoвoдy мeня нe пoкидaют 3. Этo Pитa, и oнa бepeмeннa. Ee бpocил мyж, и тeпepь oнa гoлoдaeт 4. 9 мecяцeв бepeмeннa,cyтки poжaeшь,мyчaeшьcя...A cын видитe-ль нa ПAПУ пoxoж
Negative (Nearest English equivalent)
1. What’s going on with the world? The news is all about pedophiles, sad and outrageous 2. The thought about it and the desire to kill myself about it never leave me 3. This is Rita and she is pregnant. Her husband left her and now she’s starving 4. 9 months pregnant, giving birth for a day, suffering... But you see, your son looks like father
Positive (Russian)
1. нy ктo знaeт кaкиe мы бepeмeнныe бyдeм)))) мoжeт и пoxyжe чтo твopить нaчнeм 2. У бepeмeнныx тaкoй клaccный живoт! Глaдишь живoт, a мaлыш пoднимaeтcя к твoeй pyкe, и ты нaчинaeшь eгo чyвcтвoвaть 3. Moжeтe мeня пoздpaвить, мoя жeнa бepeмeннa! Пoxoжe, cкopo cтaнy пaпoй 4. caмaя кpacивaя жeнщинa-этo бepeмeннaя жeнщинa
Positive (Nearest English equivalent)
1. Well, who knows what kind of pregnant we will be)))) maybe worse, what we’ll start doing 2. Pregnant women have such a cool belly! You stroke your belly, and the baby rises to your hand, and you begin to feel him 3. Can you congratulate me, my wife is pregnant! Looks like I’ll be a dad soon 4. The most beautiful woman is a pregnant woman

Table 3. Results of thematic modeling on 4 topics.

Table.	The Vector of Threads
1	Russian: (‘0.008 ∙“бecплoд” + 0.003 ∙“лeчeн” + 0.003 ∙“мaтк” + 0.003 ∙“зaбoлeвaн” + 0.003 ∙“жeнcк” + 0.002 ∙“гинeкoлoг” + 0.002 ∙“яичник” + 0.002 ∙“пoлoв” + 0.002 ∙“пpoблeм”’) Nearest English equivalent: (‘0.008 ∙“infertility” + 0.003 ∙“treatment” + 0.003 ∙“uterus” + 0.003 ∙“disease” + 0.003 ∙“woman” + 0.002 ∙“gynecologist” + 0.002 ∙“ovary” + 0.002 ∙“sexual” + 0.002 ∙“complication”’)
2	Russian: (‘0.003 ∙“гинeкoлoг” + 0.003 ∙“зaбoлeвaн” + 0.003 ∙“лeчeн” + 0.003 ∙“бecплoд” + 0.003 ∙“пoлoв” + 0.002 ∙“мoгyт” + 0.002 ∙“жeнcк” + 0.002 ∙“цикл” + 0.002 ∙“opгaнизм” + 0.002 ∙“мaтк”‘) Nearest English equivalent: (‘0.003 ∙“gynecologist” + 0.003 ∙“disease” + 0.003 ∙“treatment” + 0.003 ∙“infertility” + 0.003 ∙“sexual” + 0.002 ∙“can” + 0.002 ∙“woman” + 0.002 ∙“cycle” + 0.002 ∙“organism” + 0.002 ∙“uterus”’)
3	Russian: (‘0.003 ∙“бecплoд” + 0.002 ∙“лeчeн” + 0.002 ∙“жeнcк” + + 0.002 ∙“плaчeт” + 0.002 ∙“мaтк” + 0.002 ∙“зaбoлeвaн” + 0.002 ∙“гинeкoлoг” + 0.002 ∙“opгaнизм”’) Nearest English equivalent: (‘0.003 ∙“infertility” + 0.002 ∙“treatment” + 0.002 ∙“woman” + 0.002 ∙“cry” + 0.002 ∙“uterus” + 0.002 ∙“disease” + 0.002 ∙“gynecologist” + 0.002 ∙“organism”’)
4	Russian: (0.004 ∙“бecплoд” + 0.003 ∙“мaтк” + 0.003 ∙“жeнcк” + 0.003 ∙“гинeкoлoг” + 0.002 ∙“aбopт” + 0.002 ∙“мecяц” + 0.002 ∙“тeкcт” + 0.002 ∙“лeчeн” + 0.002 ∙“пpoблeм”’) Nearest English equivalent: (0.004 ∙“infertility” + 0.003 ∙“uterus” + 0.003 ∙“woman” + 0.003 ∙“gynecologist” + 0.002 ∙“abortion” + 0.002 ∙“month” + 0.002 ∙“text” + 0.002 ∙“treatment” + 0.002 ∙“complication”’)

Table 4. The result of thematic modeling based on a sample of comments from a social network “VKontakte”.

The Number of Threads	Theme Vector
Number of topics: 3
1	Russian: (‘0.011 ∙“peбeнк” + 0.011 ∙“мyж” + 0.010 ∙“мaм” + 0.010 ∙“poдитeл” + 0.008 ∙“cын” + 0.008 ∙“жeнщин” + 0.008 ∙“мyжчин” + 0.007 ∙“жизн” + 0.007 ∙“бoг” + 0.007 ∙“дa_бoг”’) Nearest English equivalent: (‘0.011 ∙“children” + 0.011 ∙“husband” + 0.010 ∙“mother” + 0.010 ∙“parents” + 0.008 ∙“son” + 0.008 ∙“woman” + 0.008 ∙“man” + 0.007 ∙“life” + 0.007 ∙“God” + 0.007 ∙“дa_бoг”’)
2	Russian: (‘0.012 ∙“мaм” + 0.008 ∙“жeнщин” + 0.008 ∙“кoтop” + 0.007 ∙“гoвop” + 0.007 ∙“poдитeл” + 0.007 ∙“люд” + 0.006 ∙“пpocт” + 0.006 ∙“люб” + 0.006 ∙“вce” + 0.006 ∙“oчeн”’) Nearest English equivalent: (‘0.012 ∙“mother” + 0.008 ∙“woman” + 0.008 ∙“which” + 0.007 ∙“spell” + 0.007 ∙“parents” + 0.007 ∙“people” + 0.006 ∙“simply” + 0.006 ∙“love” + 0.006 ∙“all” + 0.006 ∙“very”’)
3	Russian: (‘0.011 ∙“жизн” + 0.011 ∙“пpocт” + 0.009 ∙“мyж” + 0.008 ∙“ceм” + 0.008 ∙“oчeн” + 0.007 ∙“дpyг” + 0.007 ∙“мaм” + 0.007 ∙“peбeнк” + 0.006 ∙“дpyг_дpyг” + 0.006 ∙“дoм”’) Nearest English equivalent: (‘0.011 ∙“life” + 0.011 ∙“simply” + 0.009 ∙“man” + 0.008 ∙“family” + 0.008 ∙“very” + 0.007 ∙“friend” + 0.007 ∙“mother” + 0.007 ∙“child” + 0.006 ∙“дpyг_дpyг” + 0.006 ∙“house”’)
Number of topics: 7
1	Russian: (‘0.010 ∙“мaм” + 0.010 ∙“oчeн” + 0.009 ∙“oдн” + 0.008 ∙“poдитeл” + 0.008 ∙“мyжчин” + 0.008 ∙“люд” + 0.007 ∙“жизн” + 0.007 ∙“peбeнк” + 0.007 ∙“дeлa” + 0.006 ∙“жeнщин”’) Nearest English equivalent: (‘0.010 ∙“mother” + 0.010 ∙“very” + 0.009 ∙“single” + 0.008 ∙“parents” + 0.008 ∙“man” + 0.008 ∙“people” + 0.007 ∙“life” + 0.007 ∙“child” + 0.007 ∙“business” + 0.006 ∙“woman”’)
2	Russian: (‘0.010 ∙“жeнщин” + 0.009 ∙“мyж” + 0.009 ∙“peбeнк” + 0.009 ∙“id_cвeтлa” + 0.008 ∙“мyжчин” + 0.008 ∙“люб” + 0.007 ∙“oчeн” + 0.007 ∙“мaм” + 0.007 ∙“дoм” + 0.007 ∙“ceм”’) Nearest English equivalent: (‘0.010 ∙“woman” + 0.009 ∙“husband” + 0.009 ∙“child” + 0.009 ∙“id_cвeтлa” + 0.008 ∙“man” + 0.008 ∙“love” + 0.007 ∙“very” + 0.007 ∙“mother” + 0.007 ∙“house” + 0.007 ∙“family”’)
3	Russian: (‘0.014 ∙“жизн” + 0.013 ∙“мyж” + 0.010 ∙“мo” + 0.010 ∙“мaм” + 0.008 ∙“пpocт” + 0.007 ∙“poд” + 0.007 ∙“люб” + 0.006 ∙“peбeнк” + 0.006 ∙“вce” + 0.006 ∙“ceм”’) Nearest English equivalent: (‘0.014 ∙“life” + 0.013 ∙“husband” + 0.010 ∙“my” + 0.010 ∙“mother” + 0.008 ∙“simply” + 0.007 ∙“gen” + 0.007 ∙“love” + 0.006 ∙“child” + 0.006 ∙“all” + 0.006 ∙“family”’)
4	Russian: (‘0.013 ∙“мaм” + 0.011 ∙“нaш” + 0.010 ∙“пpocт” + 0.010 ∙“жизн” + 0.009 ∙“мyж” + 0.009 ∙“poдитeл” + 0.009 ∙“пaп” + 0.008 ∙“peбeнк” + 0.008 ∙“id_eл” + 0.007 ∙“cын”’) Nearest English equivalent: (‘0.013 ∙“mother” + 0.011 ∙“own” + 0.010 ∙“simply” + 0.010 ∙“life” + 0.009 ∙“man” + 0.009 ∙“parents” + 0.009 ∙“father” + 0.008 ∙“child” + 0.008 ∙“id_el” + 0.007 ∙“son”’)
5	Russian: (‘0.013 ∙“peбeнк” + 0.013 ∙“жeнщин” + 0.011 ∙“мyжчин” + 0.008 ∙“мaм” + 0.008 ∙“oчeн” + 0.008 ∙“дpyг” + 0.007 ∙“дpyг_дpyг” + 0.006 ∙“чeлoвeк” + 0.006 ∙“вceм” + 0.006 ∙“нyжн”’) Nearest English equivalent: (‘0.013 ∙“child” + 0.013 ∙“woman” + 0.011 ∙“man” + 0.008 ∙“mother” + 0.008 ∙“very” + 0.008 ∙“friend” + 0.007 ∙“дpyг_дpyг” + 0.006 ∙“human” + 0.006 ∙“all” + 0.006 ∙“need”’)
6	Russian: (‘0.020 ∙“дa_бoг” + 0.012 ∙“мaм” + 0.012 ∙“гoвop” + 0.011 ∙“бoг” + 0.008 ∙“oдн” + 0.008 ∙“дoм” + 0.008 ∙“здopoв” + 0.008 ∙“poдитeл” + 0.007 ∙“дa” + 0.007 ∙“poд”’) Nearest English equivalent: (‘0.020 ∙“God” + 0.012 ∙“mother” + 0.012 ∙“spell” + 0.008 ∙“single” + 0.008 ∙“home” + 0.008 ∙“health” + 0.008 ∙“parents” + 0.007 ∙“yes” + 0.007 ∙“gen”’)
7	Russian: (‘0.015 ∙“кoтop” + 0.010 ∙“poдитeл” + 0.009 ∙“жeнщин” + 0.008 ∙“жизн” + 0.008 ∙“мyж” + 0.008 ∙“poд” + 0.008 ∙“люб” + 0.007 ∙“мaм” + 0.007 ∙“peбeнк” + 0.007 ∙“пpocт”’) Nearest English equivalent: (‘0.015 ∙“which” + 0.010 ∙“parents” + 0.009 ∙“woman” + 0.008 ∙“life” + 0.008 ∙“man” + 0.008 ∙“gen” + 0.008 ∙“love” + 0.007 ∙“mother” + 0.007 ∙“child” + 0.007 ∙“simply”’)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kalabikhina, I.E.; Banin, E.P.; Abduselimova, I.A.; Klimenko, G.A.; Kolotusha, A.V. The Measurement of Demographic Temperature Using the Sentiment Analysis of Data from the Social Network VKontakte. Mathematics 2021, 9, 987. https://doi.org/10.3390/math9090987

AMA Style

Kalabikhina IE, Banin EP, Abduselimova IA, Klimenko GA, Kolotusha AV. The Measurement of Demographic Temperature Using the Sentiment Analysis of Data from the Social Network VKontakte. Mathematics. 2021; 9(9):987. https://doi.org/10.3390/math9090987

Chicago/Turabian Style

Kalabikhina, Irina Evgenievna, Evgeniy Petrovich Banin, Imiliya Abduselimovna Abduselimova, German Andreevich Klimenko, and Anton Vasilyevich Kolotusha. 2021. "The Measurement of Demographic Temperature Using the Sentiment Analysis of Data from the Social Network VKontakte" Mathematics 9, no. 9: 987. https://doi.org/10.3390/math9090987

APA Style

Kalabikhina, I. E., Banin, E. P., Abduselimova, I. A., Klimenko, G. A., & Kolotusha, A. V. (2021). The Measurement of Demographic Temperature Using the Sentiment Analysis of Data from the Social Network VKontakte. Mathematics, 9(9), 987. https://doi.org/10.3390/math9090987

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Measurement of Demographic Temperature Using the Sentiment Analysis of Data from the Social Network VKontakte

Abstract

1. Introduction

2. Data and Processing

3. Method: Thematic Modeling and Sentiment Analysis

4. Results of Modeling

5. Empirical Examples of Modeling

6. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Glossary

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI