Next Article in Journal
Coupled-Error-Based Formation Control for Rapid Formation Completion by Omni-Directional Robots
Previous Article in Journal
Finite Element Analysis on the Behavior of Solidified Soil Embankments on Piled Foundations under Dynamic Traffic Loads
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts

Green Solutions Lab, University of Tyumen, Tyumen 625000, Russia
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2024, 14(11), 4466; https://doi.org/10.3390/app14114466
Submission received: 15 April 2024 / Revised: 21 May 2024 / Accepted: 22 May 2024 / Published: 23 May 2024
(This article belongs to the Section Ecology Science and Engineering)

Abstract

:
Green practices are social practices that aim to harmonize the relations between people and the natural environment. They may involve minimizing the use of resources and the generation of waste and emissions. Detecting green practices in social media posts helps to understand which green practices are currently common and to develop recommendations on the scaling of green practices to reduce environmental problems. This paper describes GreenRu, a novel Russian social media dataset for detecting the mentions of green practices related to waste management. It has a sentence-level markup and consists of 1326 posts collected in Russian online communities. The total number of mentions of green waste practices is 3765. The paper assessed the effectiveness of the multi-label and one-versus-rest BERT-based models for detecting the mentions of green practices in social media posts and compared several data augmentation methods in terms of both classification metrics and human evaluation. To augment the dataset, a backtranslation method and generative language models, such as RuGPT, RuT5, and ChatGPT, were used in this study. The results enable researchers to monitor the green waste practices on social networks and develop environmental policies. Additionally, GreenRu can support machine learning models to analyze social media content, assess the prevalence and effectiveness of green waste practices, and identify ways to expand them.

1. Introduction

The modern world faces environmental problems and anthropogenic climate change, forcing global institutions and local governments to develop climate change adaptation and mitigation policies [1,2,3]. In the near future, these current policies may transform the actions of people in a local territory to meet their needs [4,5], and a set of these actions can be defined as social practices [6]. However, it is still unclear whether social practices are becoming more environmentally friendly, and how they could be changed to overcome the environmental crisis.
Social practices that aim to harmonize the relationship between people and the natural environment by reducing resource use, waste generation, pollution, and emissions are defined as green practices [7]. Embracing the idea of the distinction of the repertoires of environmental changes regarding consumption [8], green practices can be divided into adaptive and transformative in terms of their impact on consumption. Adaptive practices are a societal response to a deteriorating environmental state, but they do not imply a reduction in consumption. Transformative practices are designed to reduce the production of goods and services and society’s consumption of materials and energy. Focusing on green waste practices, adaptive practices are associated with the treatment of generated waste in contrast to transformative practices that aim to prevent waste. Figure 1 shows the division of green waste practices into transformative and adaptive practices. Transformative practices include exchanging, refusing purchases, sharing, repairing, and participating in actions to promote responsible consumption. Adaptive practices include waste sorting, studying the product labeling, waste recycling, and signing petitions. Researchers study green practices to offer citizens, eco-activists, and government agencies new ways to scale these practices [9,10].
At present, there is insufficient knowledge regarding the existing green waste practices in Russian society. Only a few studies focus on some green practices, such as waste sorting, and describe them by eliciting data from questionnaires and interviews [11,12]. Meanwhile, the content of online environmental community posts on social media is a valuable source of information about the characteristics and activities of online communities [13,14]. Social networks have a significant impact on changing social practices [15,16]. Internet communities perform the functions of recruitment, attribution, aggregation of interests, and mobilization of resources [17,18,19]. In addition, virtual communities promote participation in real activities, which can be both constructive and protesting in nature [20,21,22]. For further research, it is necessary to collect and systematize a large amount of social information about the green waste practices prevalent in Russian society, and the important source of such information is social networks using different methods [14,23,24,25]. Since the manual identification of information in posts published in online communities is a time-consuming and long-term process, it requires the use of the automated analysis of such posts. However, to date, there are very few studies on environmental practices in the context of their risks and opportunities based on big data analytics, for example, deep learning or content analysis methods [13,26].
This paper presents the GreenRu dataset for detecting the mentions of green practices in social media posts. The dataset consists of Russian social media posts from the VKontakte network, the most popular social network in Russia [27]. The dataset contains the mentions of nine green practices (Figure 1), which were manually annotated. GreenRu is marked up at the sentence level; in other words, each sentence is labeled in terms of the practices it contains. Several machine learning models in the form of both multi-label and binary classifications of sentences are applied to the present dataset. Since some green practices are currently more common than others [28], the classes, that is, the mentions of different types of green practices, are unevenly distributed in the dataset. It is expected to cause lower quality regarding the machine learning models for rarely represented classes. This paper compares several methods of data augmentation for the minority classes. It was demonstrated that augmentation can significantly improve the quality of the detection of the mention of a few common green practices. The main contributions of this work can be summarized as follows:
  • GreenRu, the first dataset for detecting the mentions of green practices in Russian social media posts, is described. The paper presents our annotation scheme for green practice mentions that can be easily adapted for application in other languages and domains.
  • The multi-label and one-versus-rest current state-of-the-art models for text classification are evaluated while performing the task of detecting the mentions of green practices.
  • The performance of several data augmentation methods is estimated to handle class imbalance since the distribution of the mentions of green practices is unequal. The results are provided both in terms of classification metrics and human evaluation. The results can be used in other tasks related to imbalanced multi-class text classification.
The paper is organized as follows. Section 2 describes the processes of data collection and annotation. The section includes the subsections presenting the types of green practices, the details of the collecting posts, and the annotation guidelines. Section 3 describes the experimental setup. It contains the dataset statistics, presents the models and the techniques for handling class imbalance that were used, as well as the metrics utilized for this study. Section 4 provides the results of our experiments for the dataset and discusses the results and limitations of the study. Section 5 concludes this paper.

2. Data Collection and Annotation

This section describes the types of green waste practices presented in the dataset and the processes of text collection and annotation.

2.1. Types of Green Waste Practices and Text Collection

The nine types of green waste practices were used to annotate the dataset (Figure 1). For a better presentation, a correspondence between the green waste practices and their IDs is contained in Table 1.
The research centers around the green practices of Tyumen, a sizable town in the Russian Federation with a population of 850,000. Despite residents’ high income and standard of living, recent surveys reveal growing dissatisfaction with the environmental situation, attributed to increased pollution, waste accumulation, and migration linked to regional economic growth. The city hosts numerous environmental communities promoting eco-friendly practices, organizing events, and engaging residents [28,29]. While Tyumen ranks 14th in the quality-of-life rating (Full rating of 250 Russian cities by quality-of-life 2023), it falls to 29th in the ecological well-being index (Rating of ecological well-being in 150 cities of Russia 2023). The collaboration potential between the city authorities and residents remains untapped due to the low public awareness of green practices [30].
The GreenRu dataset consists of social media posts from the VKontakte social network. A social graph was employed to identify the VKontakte communities discussing waste sorting. This graph was built to group the VKontakte green communities of Tyumen covering significant user topics such as animals, eco-food, eco-markets, and separate waste collection [7]. Another essential selection factor was the availability of posts created from January 2021 to June 2021 during the data collection. Thus, six VKontakte communities for text collection were selected. To collect the posts, the VK API tool (https://dev.vk.com/en/reference (accessed on 20 May 2024)) was utilized. GreenRu includes only the posts that contain textual information and does not include duplicate posts.

2.2. Annotation Guidelines

The annotation was performed at the sentence level by two experts on green practices from the University of Tyumen, Russia. At the first stage of annotation, the experts labeled each sentence with the types of green practices to which it referred. Each sentence could contain mentions of multiple green practices. Two experts labeled the posts independently. The second stage included checking and adjusting the labeled posts. In the case of a discrepancy between the labels, a consensus was achieved through a discussion by both experts.
As a result, a dataset with labeled mentions of green practices was collected. Table 2 demonstrates examples of the sentences from the dataset. GreenRu is freely available at https://github.com/green-solutions-lab/GreenRu (accessed on 20 May 2024). The dataset is presented in the form of two csv files containing training and test subsets. Each entry of the files contains the practice ID (column “id_practice”, the practice ID from Table 1), the name of the green practice (“name_practice”), the sentence containing the mention of the practice (“span”), the full text of the post (“text”), and the post ID (“id_post”). Thus, if a sentence contains references to several green practices, then each practice corresponds to a separate entry in the dataset. If a sentence does not contain references to practices, then it is present in the full text of the post but does not have a separate entry in the dataset.

3. Experiments and Evaluation

This section provides the details of splitting the dataset into training and test subsets and describes the baseline models for detecting the mentions of green practices. It also provides the experimental setup for data augmentation of the minority classes.

3.1. Dataset Statistics

For our experiments, GreenRu was divided into training and test subsets in such a way that the training and test subsets did not contain fragments of the same posts. Table 3 shows the dataset statistics. Figure 2 shows the distribution of the mentions of different green practices in the dataset. Waste sorting (1) is the most common practice in the dataset (48.7% of all green practice mentions). Waste recycling (3) and participating in actions to promote responsible consumption (8) are also frequently mentioned (10.4% and 19.1%, respectively). The rare practices include repairing (0.3%), signing petitions (1.4%), and studying the product labeling (1.9%).

3.2. Models

The problem of detecting green practice mentions is formulated as a text classification task in two ways: multi-label classification and binary classification using a one-versus-rest approach. For the one-versus-rest approach, nine binary classifiers were trained to distinguish each practice from all other practices combined. In our experiments, the multi-label text classification model comprised a transformer model followed by a classification layer. This classification layer contained nine output neurons, each corresponding to one green practice. Two transformer-based models were compared in both multi-label and one-versus-rest settings:
  • Bidirectional Encoder Representations from Transformers (BERT) [32], a transformer-based model pre-trained using a masked language modeling objective. In this paper, Conversational RuBERT (https://huggingface.co/DeepPavlov/rubert-base-cased-conversational (accessed on 20 May 2024)) [33], which was trained on OpenSubtitles [34], Dirty, Pikabu, and a Social Media segment of the Taiga corpus [35], was used.
  • Robustly optimized BERT approach (RoBERTa) [36], the model has the same architecture as BERT but uses a byte-level Byte Pair Encoding as a tokenizer and a different pre-training scheme. RuRoBERTa (https://huggingface.co/ai-forever/ruRoberta-large (accessed on 20 May 2024)) [37], which was trained on the texts of Wikipedia, news texts, literary texts, Colossal Clean Crawled Corpus [38], and Open Subtitles, was used.
Each model was fine-tuned for three epochs with a maximum sequence length of 128 tokens using standard cross-entropy loss and the AdamW optimizer [39]. The learning rate was 4 × 10 5 for BERT and 5 × 10 6 for RoBERTa.

3.3. Handling Class Imbalance

The goal of a classifier is to categorize objects in a dataset into one or more classes based on their features. In practical applications, datasets often exhibit imbalance, where some classes (minority classes) have significantly fewer instances than others (majority classes). While classification algorithms typically perform well on majority classes, their accuracy drops significantly for minority classes. This imbalance negatively impacts the overall performance of traditional classification algorithms [40]. The methods of handling imbalanced data can be divided into algorithmic-level and data-level methods [41]. Algorithmic-level methods focus on designing new classification algorithms or enhancing existing ones (for example, [42,43,44]), while data-level methods attempt to balance the data by reducing the majority class or expanding the minority class.

3.3.1. Data Augmentation

Since green practices have varying degrees of prevalence, the impact of different data augmentation methods to reduce class imbalance was explored. The impact of data augmentation was investigated using one-versus-rest models. For all augmentation methods, one synthetic example was generated for every positive example in the training subset. Thus, the number of entries of the target practice in the training subset was two times larger than in the original subset.
The following data augmentation methods were compared.
  • Simple duplication, a method that consists of duplicating the texts from the original dataset.
  • Backtranslation, a method using backtranslating phrases between any two languages. The BackTranslation library (https://pypi.org/project/BackTranslation (accessed on 20 May 2024)) based on googletrans and English as a target language was utilized.
  • Text generation (RuGPT3). RuGPT3 (https://huggingface.co/ai-forever/rugpt3medium_based_on_gpt2 (accessed on 20 May 2024)) [37] was fine-tuned for predicting the next word using the following parameters: the number of epochs—3, the maximum sequence length—256 tokens, and the learning rate— 4 × 10 5 . Next, each text mentioning minority green waste practices was transformed in the following way. If the text was more than five tokens long, it was truncated to five tokens. If the text length was shorter, the entire text was used. For each transformed text, a continuation was generated using the fine-tuned RuGPT3.
  • Decoding masked sentences (RuT5). Word-level masking was applied, inspired by [45], by replacing a continuous chunk of k words w i , w i + 1 . . w i + k with a single mask token < m a s k > . Masking was applied to 50% of the random words in each text. With different random seeds, 10,000 training examples were produced and RuT5-base (https://huggingface.co/ai-forever/ruT5-base (accessed on 20 May 2024)) [37] was fine-tuned to decode the original sequence given a masked sequence. To better distinguish between texts containing mentions of different green waste practices, the types of the corresponding practices were added at the beginning of each masked text as control codes (for example, waste sorting: text). No control codes were added to target texts. Thus, the augmentation scheme was as follows.
    • Creating data for fine-tuning RuT5. Forming a dataset of 10,000 examples for the fine-tuning of RuT5-base: adding control codes and masking to the original texts, using original texts as a target.
    • Fine-tuning on masked texts. RuT5-base was fine-tuned over three epochs with a maximum sequence length of 256 tokens, the learning rate of 4e-5, and the AdamW optimizer.
    • Masking and generating. For each text from the minority classes, 50% of tokens were randomly masked according to the procedure described before, the corresponding control code was added, and a synthetic example was generated.
  • ChatGPT, a cloud-based artificial intelligence chatbot that utilizes OpenAI’s GPT-3.5-turbo model. As a large language model, ChatGPT can create novel and contextually relevant responses to a given prompt, making it an ideal tool for data augmentation [46,47,48]. To obtain generated texts, ChatGPT was provided with the original text from the dataset using a prompt. The model’s response was used as a synthetic text, which was subsequently added to the original data during the training of the classifier. The following prompt translated into Russian was used, inspired by [46]: Please rephrase the following sentence: {text} (“Пoжалуйста, перефразируй следующее предлoжение: {text}”).

3.3.2. Class Weighting

For each one-versus-rest model, a class-weighting technique was applied. The class weight is calculated as follows:
w e i g h t y = s a m p l e s c l a s s e s * b i n c o u n t y ,
where w e i g h t y is a weight of class y, s a m p l e s is a total number of samples, c l a s s e s is a number of classes, and b i n c o u n t y is a number of occurrences for class y.

3.4. Metrics

Different metrics reflect various aspects of model performance in machine learning and computational linguistics. This study focused on two types of assessment for the performance of classifiers and data augmentation methods: F1-score and human evaluation.

3.4.1. F1-Score

To assess the performance of the models, the macro-averaged F1-score separately calculated for each class (F1) was used. The average F1-score was also calculated across all classes (F 1 a v g ) and all minority classes (Practices 2–9, F 1 a v g 2 9 ). Since Practice 9 (repairing) is rarely found in the dataset, the model performance for this practice can vary greatly depending on the model. To reduce the impact of results for this practice, the F1-score across all minority classes except Practice 9 was also calculated (F 1 a v g 2 8 ).

3.4.2. Human Evaluation

In addition to calculating metrics, human evaluation was used to assess the quality of the augmented texts. Thus, a set of 400 synthetic texts was created for random types of practices (100 for each of the four types of augmentation: backtranslation, RuGPT3, RuT5, and ChatGPT). Then, one of the experts involved in the annotation of GreenRu marked whether the generated text contained the mention of the practice for the augmentation of which it was generated. Thus, the value obtained as a result of the human evaluation was in the range from 0 to 100. Human evaluation allowed us to show the ability of augmentation methods to preserve class labels.

4. Results and Discussion

Table 4 and Table 5 demonstrate the results of the evaluation of BERT and RoBERTa. The sign + w indicates class weighting. The best result in each column is highlighted. If the average F1-score obtained for the augmented dataset outperforms the same metric for the original dataset, the value is marked with ↑.
As can be seen from Table 4 and Table 5, the one-versus-rest models fine-tuned on the original dataset outperform the multi-label models in terms of all the averaged metrics. For RoBERTa, the result of the model with class weighting slightly surpassed the model without class weighting (81.65% vs. 81.16%). For BERT, the opposite situation is observed (78.89% vs. 79.62%). In most cases, the scores obtained for the augmented dataset are higher than the scores for the original data. However, in our experiments, this effect is more pronounced for BERT. All the augmentation methods improve the results of BERT in terms of F 1 a v g 2 9 and F 1 a v g 2 8 . The highest scores were achieved using backtranslation (84.74%) and ChatGPT (84.9%) for F 1 a v g 2 9 and F 1 a v g 2 8 , respectively. In the case of RoBERTa, the highest average scores for the original data were demonstrated by the one-versus-rest model with class weighting (81.00% and 85.55% for F 1 a v g 2 9 and F 1 a v g 2 8 , respectively). In terms of F 1 a v g 2 9 , the value obtained for the original dataset was improved by all the methods except RuGPT3. The highest result was shown by ChatGPT (86.52%). In terms of F 1 a v g 2 8 , the best result on the original dataset was outperformed using simple duplication (86.00%), RuT5 (86.07%), and ChatGPT (86.02% and 85.98% without and with class weighting, respectively). The highest result was achieved with RuT5 using a decoding masked sentences procedure (86.07%). In our experiments, none of the data augmentation methods were the absolute best across all the metrics. Such a case is common for machine learning studies, for example, in works [49,50,51].
Table 6 contains the human evaluation results. The ratings demonstrate the ability of the augmentation methods to preserve the class labels. They show how many of the texts generated by the model for green practices really contain the mention of such practices. For the evaluation, 100 generated texts containing various practices were randomly chosen for each model.
The highest result was achieved by ChatGPT (94 out of 100), followed by RuT5 (89), backtranslation (82), and RuGPT3 (59). The result of RuGPT3 was substantially lower than the scores of the other methods. Probably, this is because the texts were truncated to the first five tokens for augmentation using RuGPT3. The goal of the fine-tuned model was to continue the truncated sequence of tokens. However, the truncation of the texts did not allow the model to correctly detect the types of green practices mentioned in them. In our experiments, RuGPT3 was the worst at generating texts corresponding to a certain green practice among all the considered data augmentation methods.
Three types of model errors were identified during the human evaluation.
  • Different green waste practices. The model generates a text that contains the mention of another practice instead of the required green practice. For example,
    Or bring yours to exchange (“Или принoсите свoи на oбмен”) → Or bring yours for recycling (“Или принoсите свoи на перерабoтку”).
  • Absence of green waste practices. The model generates a text with no mentions of green practices. For example,
    You can post even the most insignificant, only at first glance, actions that will immediately affect the climate footprint: you went to the store with a rag bag, and not with a plastic bag; poured coffee in a thermos cup, and not in a plastic cup; cleared up your mess and collected books for disposal (“Мoжете выкладывать даже самые незначительные, лишь на первый взгляд, действия, кoтoрые сразу же пoвлияют на климатический след: пoшел в магазин с тряпичнoй сумкoй, а не с пакетoм; налил кoфе в термoкружку, а не в пластикoвый стаканчик; расхламился и сoбрал книги на утилизацию”) → You can post even the most insignificant, in your opinion, changes in the promotion program - please do it as soon as possible (“Мoжете выкладывать даже самые незначительные, на ваш взгляд, изменения в прoграмме акции – пoжалуйста, делайте этo как мoжнo раньше”).
  • Negation of green waste practices. The model generates a text that contains a negation of the required green practice. For example,
    The volunteer association supported by [club46977103|Paketa net] organizes an event for parents, where it is possible to lend or give away something and/or to get a new toy without buying. (“Дoбрoвoльческoе oбъединение Кругoвoрoт при пoддержке [club46977103|Пакета нет] oрганизует мерoприятие для рoдителей, где вoзмoжнo oтдать вещь на время или навсегда и/или пoлучить, не пoкупая, нoвую игрушку”) → The volunteer association does not organize an event for parents, where you can give something and/or get for a child without buying a new (“Ассoциация дoбрoвoльцев не oрганизует сoбытие для рoдителей, где мoжнo дать чтo-тo и/или пoлучить для ребенка, не пoкупая нoвую”).
Most of the RuGPT3 and RuT5 errors were associated with the generation of texts related to other green practices. Most of the backtranslation and ChatGPT errors regarded the generation of texts that do not contain mentions of green practices. An error related to the negation of green practices was generated only once using backtranslation.
Even though the initial experiments show promise, there is still room for improvement. For the rarest practice (repairing, 9), in many cases, the results are very poor. Thus, models without data augmentation cannot cope with the detection of this practice due to the small number of texts. However, the use of augmented data in most cases enables better classification of this practice. Further experiments on data augmentation with different amounts of generated texts may improve the current results for minority practices. In addition, the models trained on the current version of GreenRu might be used to search for posts containing mentions of rare green practices. These posts can be checked by experts and subsequently utilized to train classification models. Another possible direction for further improvement is the use of the context of the sentence as additional information.
Some potential issues could limit the applicability of GreenRu. The dataset only contains the texts posted in the online green communities of Tyumen, Russia. The recent works on deep learning show that the generalization ability of the fine-tuned transformer-based models can be influenced by domain bias [52,53,54]. Thus, the performance of the models for detecting the mentions of green practices may deteriorate for the posts of green communities of other regions and the posts of other domains. This issue is typical for most domain-specific text corpora; in particular, this limitation is discussed by [55,56,57]. The currently used domain adaptation methods can be broadly classified into those employing deep architectures and rule-based techniques, such as instance-based and feature-based approaches to align the domain distributions [52]. Applying these techniques for the detection of green practices would be a fruitful area for further work. Another potential limitation of this research is related to the set of practices reflected in this study. GreenRu contains mentions of green practices only in waste management as one of the key aspects of harmonizing the relationship between people and the natural environment. The further extension of the dataset may include other green practices, such as caring for stray animals, cleaning and landscaping land, planting greenery, and some others.

5. Conclusions

The texts posted in green communities on social media contain multiple mentions of green practices. The automated detection of green practices on social media facilitates the comprehensive analysis of their prevalence, efficacy, and scalability, thereby informing potential strategies for expansion. The paper describes the GreenRu dataset of social media posts annotated with mentions of green waste practices at the sentence level. The dataset contains mentions of nine green waste practices covering both the adaptive and transformative types. Our baseline experiments conducted for GreenRu demonstrate that fine-tuned transformer-based models can be applied for the detection of the mentions of green practices. Considering the task of detecting the mentions of green practices as a text classification task, the multi-label and one-versus-rest approaches were compared. Moreover, several ways to handle class imbalance using data augmentation methods were assessed, both in terms of classification metrics and human evaluation. The study shows that the use of data augmentation can significantly improve the performance of detecting the mentions of rare green practices.
The theoretical significance of the study lies in the fact that it is the first to develop an approach to the automated identification of nine green practices on Russian social media. Thus, researchers will be able to monitor the prevalence of these practices on Russian social media, and the results of this monitoring will be used for developing environmental policies. Furthermore, using this approach, other researchers may identify additional green practices, thereby contributing to a more comprehensive understanding of the current processes of Russian society’s greening.
The algorithms developed to identify green practices may also be employed to enhance the functionality of search engines. GreenRu has the potential to support the creation of machine learning models aimed at extracting the mentions of green practices from textual data. This makes it possible to analyze vast amounts of social media content, assess the current prevalence and effectiveness of different types of green practices, and identify potential pathways for scaling up these practices.

Author Contributions

Conceptualization, methodology, project administration, O.Z.; writing—original draft preparation, methodology, visualization, A.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The ethical review and approval of this study were waived because no personal data were used.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are freely available at https://github.com/green-solutions-lab/GreenRu (accessed on 20 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Report of the Conference of the Parties to the United Nations Framework Convention on Climate Change (21st Session, 2015: Paris). Paris Agreement. 2015. Available online: https://unfccc.int/resource/docs/2015/cop21/eng/10.pdf (accessed on 20 May 2024).
  2. European Commission and Directorate-General for Communication. European Green Deal—Delivering on Our Targets; European Commission: Brussels, Belgium, 2021. [Google Scholar] [CrossRef]
  3. The Government of Russian Federation. Strategies for the Socio-Economic Development of the Russian Federation with Low Greenhouse Gas Emissions until 2050; The Government of Russian Federation: Moscow, Russia, 2021.
  4. Steffen, W.; Rockström, J.; Richardson, K.; Lenton, T.M.; Folke, C.; Liverman, D.; Summerhayes, C.P.; Barnosky, A.D.; Cornell, S.E.; Crucifix, M.; et al. Trajectories of the Earth System in the Anthropocene. Proc. Natl. Acad. Sci. USA 2018, 115, 8252–8259. [Google Scholar] [CrossRef] [PubMed]
  5. Becker, C.U. Ethical underpinnings for the economy of the Anthropocene: Sustainability ethics as key to a sustainable economy. Ecol. Econ. 2023, 211, 107868. [Google Scholar] [CrossRef]
  6. Giddens, A. The Constitution of Society: Outline of the Theory of Structuration; Univ of California Press: Berkeley, CA, USA, 1984. [Google Scholar]
  7. Zakharova, O.V.; Payusova, T.I.; Akhmedova, I.D.; Suvorova, L.G. Green Practices: Ways to Investigation. Sotsiologicheskie Issled. 2021, 4, 25–36. [Google Scholar] [CrossRef]
  8. Balsiger, P.; Lorenzini, J.; Sahakian, M. How do ordinary Swiss people represent and engage with environmental issues? Grappling with cultural repertoires. Sociol. Perspect. 2019, 62, 794–814. [Google Scholar] [CrossRef]
  9. Lamphere, J.A.; Shefner, J. How to green: Institutional influence in three US cities. Crit. Sociol. 2018, 44, 303–322. [Google Scholar] [CrossRef]
  10. van Lunenburg, M.; Geuijen, K.; Meijer, A. How and why do social and sustainable initiatives scale? A systematic review of the literature on social entrepreneurship and grassroots innovation. VOLUNTAS Int. J. Volunt. Nonprofit Organ. 2020, 31, 1013–1024. [Google Scholar] [CrossRef]
  11. Shabanova, M.A. Separate Waste Collection as Russians’ Voluntary Practice: The Dynamics, Factors and Potential. Sotsiologicheskie Issled. 2021, 9, 217–230. [Google Scholar] [CrossRef]
  12. Ermolaeva, Y.V.; Rybakova, M.V. Civil social practices of waste recycling in Russia (Moscow and Kazan). Iioab J. 2019, 10, 153–156. [Google Scholar]
  13. Batanina, I.A.; Brodovskaya, E.V.; Dombrovskaya, A.Y.; Parma, R.V. Environmental agenda in the Russian segment of social media: Results of the big data analysis. Izv. Tula State Univ. 2021, 2, 409–428. [Google Scholar]
  14. Kaminskaya, T.; Pomiguev, I.; Nazarova, N. Digital environmental activism as an instrument of influence on government decisions. Monit. Public Opin. Econ. Soc. Chang. 2019, 5, 382–407. [Google Scholar] [CrossRef]
  15. Shen, J.; Liang, H.; Zafar, A.U.; Shahzad, M.; Akram, U.; Ashfaq, M. Influence by osmosis: Social media green communities and pro-environmental behavior. Comput. Hum. Behav. 2023, 143, 107706. [Google Scholar] [CrossRef]
  16. Kyoi, S.; Mori, K. Development of policy measures for diffusing human pro-environmental behavior in social networks—Computer simulation of a dynamic model of mutual learning. World Dev. Sustain. 2024, 4, 100118. [Google Scholar] [CrossRef]
  17. Parma, R. Public activism of Russian citizens in offline and online spaces. Monit. Public Opin. Econ. Soc. Chang. 2021, 6, 145–170. [Google Scholar] [CrossRef]
  18. Agojo, K.; Bravo, M.; Reyes, J.; Rodriguez, J.; Santillan, A. Activism beyond the streets: Examining social media usage and youth activism in the Philippines. Asian J. Soc. Sci. 2023, 51, 180–187. [Google Scholar] [CrossRef]
  19. Mindel, V.; Overstreet, R.E.; Sternberg, H.; Mathiassen, L.; Phillips, N. Digital activism to achieve meaningful institutional change: A bricolage of crowdsourcing, social media, and data analytics. Res. Policy 2024, 53, 104951. [Google Scholar] [CrossRef]
  20. Greijdanus, H.; de Matos Fernandes, C.A.; Turner-Zwinkels, F.; Honari, A.; Roos, C.A.; Rosenbusch, H.; Postmes, T. The psychology of online activism and social movements: Relations between online and offline collective action. Curr. Opin. Psychol. 2020, 35, 49–54. [Google Scholar] [CrossRef] [PubMed]
  21. Tsepilova, O.; Golbraih, V. Environmental activism: Resource mobilisation for “garbage” protests in Russia in 2018–2020. Zhurnal Sotsiologii Sotsialnoy Antropol. 2020, 23, 136–162. [Google Scholar] [CrossRef]
  22. Kopacheva, E.; Fatemi, M.; Kucher, K. Using social-media-network ties for predicting intended protest participation in Russia. Online Soc. Netw. Media 2023, 37, 100273. [Google Scholar] [CrossRef]
  23. Klimova, A.; Kulikov, S.; Chmel, K. The Role of Social Media in Shaping Regional Ecological Protest in Russia. Monit. Public Opin. Econ. Soc. Chang. 2021, 6, 28–52. [Google Scholar] [CrossRef]
  24. Piselli, C.; Colladon, A.F.; Segneri, L.; Pisello, A. Evaluating and improving social awareness of energy communities through semantic network analysis of online news. Renew. Sustain. Energy Rev. 2022, 167, 112792. [Google Scholar] [CrossRef]
  25. Wu, M.; Long, R. How does green communication promote the green consumption intention of social media users? Environ. Impact Assess. Rev. 2024, 106, 107481. [Google Scholar] [CrossRef]
  26. Zakharova, O.; Glazkova, A.; Suvorova, L. Online Equipment Repair Community in Russia: Searching for Environmental Discourse. Sustainability 2023, 15, 12990. [Google Scholar] [CrossRef]
  27. Kozitsin, I.V. Opinion dynamics of online social network users: A micro-level analysis. J. Math. Sociol. 2023, 47, 1–41. [Google Scholar] [CrossRef]
  28. Zakharova, O.V.; Glazkova, A.V.; Pupysheva, I.N.; Kuznetsova, N.V. The Importance of Green Practices to Reduce Consumption. Chang. Soc. Personal. 2022, 6, 884–905. [Google Scholar] [CrossRef]
  29. Zakharova, O.V.; Karagulian, E.A.; Payusova, T.I. Green practices of citizens: Sources, stabilization and dissemination (case of Tyumen). Vestn. St. Petersburg Univ. Sociol. 2023, 16, 44–64. [Google Scholar] [CrossRef]
  30. Zakharova, O.V.; Karagulian, E. The Green Practices of Tyumen Residents. Traditions, Values and Meanings. Lagoonscapes 2023, 3, 151–170. [Google Scholar] [CrossRef]
  31. Bird, S. NLTK: The natural language toolkit. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Stroudsburg, PA, USA, 17–18 July 2006; pp. 69–72. [Google Scholar] [CrossRef]
  32. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  33. Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. Komp’Juternaja Lingvistika Intellektual’Nye Tehnol. 2019, 18, 333–339. [Google Scholar]
  34. Lison, P.; Tiedemann, J. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 923–929. [Google Scholar]
  35. Shavrina, T.; Shapovalova, O. To the methodology of corpus construction for machine learning: “Taiga” syntax tree corpus and parser. In Proceedings of the “CORPORA-2017” International Conference, Saint-Petersburg, Russia, 27–30 June 2017; pp. 78–84. [Google Scholar]
  36. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  37. Zmitrovich, D.; Abramov, A.; Kalmykov, A.; Tikhonova, M.; Taktasheva, E.; Astafurov, D.; Baushenko, M.; Snegirev, A.; Shavrina, T.; Markov, S.; et al. A family of pretrained transformer language models for Russian. arXiv 2023, arXiv:2309.10931. [Google Scholar]
  38. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
  39. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  40. Gosain, A.; Sardana, S. Handling class imbalance problem using oversampling techniques: A review. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; IEEE: New York, NY, USA, 2017; pp. 79–85. [Google Scholar] [CrossRef]
  41. Spelmen, V.S.; Porkodi, R. A review on handling imbalanced data. In Proceedings of the 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), Coimbatore, India, 1–3 March 2018; IEEE: New York, NY, USA, 2018; pp. 1–11. [Google Scholar] [CrossRef]
  42. Jang, J.; Kim, Y.; Choi, K.; Suh, S. Sequential targeting: A continual learning approach for data imbalance in text classification. Expert Syst. Appl. 2021, 179, 115067. [Google Scholar] [CrossRef]
  43. Hasib, K.M.; Azam, S.; Karim, A.; Al Marouf, A.; Shamrat, F.J.M.; Montaha, S.; Yeo, K.C.; Jonkman, M.; Alhajj, R.; Rokne, J.G. MCNN-LSTM: Combining CNN and LSTM to classify multi-class text in imbalanced news data. IEEE Access 2023, 11, 93048–93063. [Google Scholar] [CrossRef]
  44. Shao, H.; Zhou, X.; Lin, J.; Liu, B. Few-Shot Cross-Domain Fault Diagnosis of Bearing Driven by Task-Supervised ANIL. IEEE Internet Things J. 2024, 11, 1–1. [Google Scholar] [CrossRef]
  45. Kumar, V.; Choudhary, A.; Cho, E. Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, Suzhou, China, 7 December 2020; pp. 18–26. [Google Scholar]
  46. Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Liu, W.; Liu, N.; et al. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv 2023, arXiv:2302.13007. [Google Scholar]
  47. ValizadehAslani, T.; Shi, Y.; Wang, J.; Ren, P.; Zhang, Y.; Hu, M.; Zhao, L.; Liang, H. Two-stage fine-tuning with ChatGPT data augmentation for learning class-imbalanced data. Neurocomputing 2024, 592, 127801. [Google Scholar] [CrossRef]
  48. Latif, A.; Kim, J. Evaluation and Analysis of Large Language Models for Clinical Text Augmentation and Generation. IEEE Access 2024. [Google Scholar] [CrossRef]
  49. Şahin, G.G. To augment or not to augment? A comparative study on text augmentation techniques for low-resource NLP. Comput. Linguist. 2022, 48, 5–42. [Google Scholar] [CrossRef]
  50. Feng, S.Y.; Gangal, V.; Kang, D.; Mitamura, T.; Hovy, E. GenAug: Data Augmentation for Finetuning Text Generators. In Proceedings of the Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Punta Cana, Dominican Republic, 11–12 November 2020; pp. 29–42. [Google Scholar] [CrossRef]
  51. Queiroz Abonizio, H.; Barbon Junior, S. Pre-trained data augmentation for text classification. In Proceedings of the Brazilian Conference on Intelligent Systems, Rio Grande, Brazil, 20–23 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 551–565. [Google Scholar] [CrossRef]
  52. Farahani, A.; Voghoei, S.; Rasheed, K.; Arabnia, H.R. A brief review of domain adaptation. Adv. Data Sci. Inf. Eng. 2021, 877–894. [Google Scholar] [CrossRef]
  53. Fang, Y.; Yap, P.T.; Lin, W.; Zhu, H.; Liu, M. Source-free unsupervised domain adaptation: A survey. Neural Netw. 2024, 174, 106230. [Google Scholar] [CrossRef]
  54. Li, J.; Yu, Z.; Du, Z.; Zhu, L.; Shen, H.T. A comprehensive survey on source-free domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1–22. [Google Scholar] [CrossRef]
  55. Loukachevitch, N.; Manandhar, S.; Baral, E.; Rozhkov, I.; Braslavski, P.; Ivanov, V.; Batura, T.; Tutubalina, E. NEREL-BIO: A dataset of biomedical abstracts annotated with nested named entities. Bioinformatics 2023, 39, btad161. [Google Scholar] [CrossRef] [PubMed]
  56. Labat, S.; Demeester, T.; Hoste, V. EmoTwiCS: A corpus for modelling emotion trajectories in Dutch customer service dialogues on Twitter. Lang. Resour. Eval. 2023, 57, 1–42. [Google Scholar] [CrossRef]
  57. Maladry, A.; Lefever, E.; Van Hee, C.; Hoste, V. The limitations of irony detection in Dutch social media. Lang. Resour. Eval. 2023, 57, 1–32. [Google Scholar] [CrossRef]
Figure 1. Adaptive vs. transformative green practices.
Figure 1. Adaptive vs. transformative green practices.
Applsci 14 04466 g001
Figure 2. The distribution of the mentions of green waste practices in the dataset. The practice IDs are listed in Table 1.
Figure 2. The distribution of the mentions of green waste practices in the dataset. The practice IDs are listed in Table 1.
Applsci 14 04466 g002
Table 1. Green waste practices.
Table 1. Green waste practices.
Type of Green Waste PracticeDescriptionPractice ID
Adaptive practices
Waste sortingSeparating waste by its type1
Studying the product labelingIdentifying product packaging as a type of waste2
Waste recyclingConverting waste materials into reusable materials for further use in the production of something3
Signing petitionsSigning documents to influence the authorities4
Transformative practices
Refusing purchasesConsciously choosing not to buy certain products or services that have a negative environmental impact, thereby reducing consumption and environmental footprint5
ExchangingGiving an unnecessary item or service to receive the desired item or service6
SharingUsing one thing by different people for a fee or free of charge7
Participating in actions to promote responsible consumptionParticipating in any events (workshops, festivals, or lessons) aimed at popularizing the idea of reducing consumption8
RepairingRestoring consumer properties of things as an alternative to throwing them away9
Table 2. Examples of the sentences containing the mentions of green waste practices. The texts in Russian retain the authors’ spelling and punctuation. The practice IDs are listed in Table 1.
Table 2. Examples of the sentences containing the mentions of green waste practices. The texts in Russian retain the authors’ spelling and punctuation. The practice IDs are listed in Table 1.
SentencePractices Mentioned in the Sentence
Вместе мы сoбрали oкoлo 300 кг. различных oтхoдoв. [Together we collected about 300 kg. of various waste.] 1
“Распакуем тюменскoе”: начинаем аудит упакoвки. [“Let’s unpack Tyumen”: we begin a packaging audit.] 2
Пo слoвам oрганизатoрoв всё сoбраннoе будет перерабoтанo без вреда для oкружающей среды на завoдах-партнера. [According to the organizers, everything collected will be processed without harm to the environment at partner factories.] 3
Прoкручиваете петицию “За oтказ oт мусoрoсжигания и за предoтвращение oбразoвания oтхoдoв” дo кoнца, гoлoсуете “ЗА”. [Scroll through the petition “No incineration and waste prevention to the minimum” to the end, vote for it.] 4
На Михайлoвскoм рынке пoкупаю кoрейские салаты в свoй кoнтейнер! [In the Mikhailovsky market I buy Korean salads into my container!] 5
Завтра, 21 июля, с 11.00 дo 13.00 прoйдет июльский oбменник игрушками, детскими вещами и книгами. [Tomorrow, July 21, from 11.00 to 13.00 there will be a July exchange of toys, children’s things and books.] 6
Благoтвoрительный магазин “Лаффка” предoставит нам кoстюмы для экoлoгическoгo карнавала. [The charity store “Luffka” will provide us with costumes for an ecological carnival.] 7
Задача этих интервью - пoказать, чтo вoлoнтерoм мoжет быть каждый, вне зависимoсти oт семейнoгo пoлoжения, дoхoда, урoвня занятoсти. [The purpose of these interviews is to show that anyone can be a volunteer, regardless of marital status, income, or level of employment.] 8
Нo мoй oтец в свoбoднoе время занимается ремoнтoм бытoвoй техники - даёт втoрую жизнь вещам. [But my father, in his free time, repairs household appliances - giving a second life to things. ] 9
Желающие мoгли пoучаствoвать в мастер-классе пo изгoтoвлению кoврикoв и сумoк из ветoши, oбменяться детскими игрушками, книжками и oдеждoй и сдать сырье в перерабoтку. [Those interested could take part in a workshop on making rugs and bags from rags, exchange children’s toys, books, and clothes, and recycle raw materials.] 1, 3, 6
А мы прoдoлжаем записывать для вас уютные дoмашние видеo o тoм, как пoдгoтoвить втoрсырье к сдаче. [And we continue to record cozy home videos for you on how to prepare recyclables for collection.]1, 8
Table 3. Statistics of GreenRu. The number of tokens is calculated using the NLTK tokenizer [31]. The sign ± indicates standard deviation values. The practice IDs are listed in Table 1.
Table 3. Statistics of GreenRu. The number of tokens is calculated using the NLTK tokenizer [31]. The sign ± indicates standard deviation values. The practice IDs are listed in Table 1.
CharacteristicTraining SubsetTest Subset
Number of posts913413
Average length of posts
Symbols880.05 ± 751.46908.53 ± 761.06
Tokens154.91 ± 135.39162.33 ± 139.19
Average length of sentences containing the mentions of green practices
Symbols111.35 ± 101.23114.99 ± 101.73
Tokens18.85 ± 19.0119.86 ± 20.80
Number of mentions per practice
11275560
25517
3272121
42231
523675
614652
710962
8510209
9103
Table 4. Results for BERT.
Table 4. Results for BERT.
ModelPractice ID (F1)F 1 a v g F 1 a v g 2 9 F 1 a v g 2 8
123456789
Original dataset
Multi-label85.4849.6076.4849.2679.8690.9072.4485.4449.9371.0469.2472.00
One-versus-rest87.2575.4777.2888.6276.4192.2284.8784.5049.9379.6278.6682.77
One-versus-rest + w86.0377.2969.2483.7083.1992.5182.4385.7049.9178.8978.0082.01
Augmented dataset
Simple duplication-85.2872.2984.9582.8193.3881.5186.6049.93-79.59↑83.83↑
Simple duplication + w-78.7474.2687.8980.6992.5179.1386.2149.93-78.67↑82.78↑
Backtranslation-81.1476.6588.9784.8489.5184.7084.6587.46-84.7484.35↑
Backtranslation + w-78.2475.4089.3781.8490.3383.8985.8987.46-84.05↑83.57↑
RuGPT3-79.3974.0285.1082.6290.5183.8984.0749.93-78.69↑82.80↑
RuGPT3 + w-79.3972.8185.1081.1890.9279.1984.2169.93-80.34↑81.83
RuT5-85.2873.8087.7383.4593.0781.8184.1249.93-79.90↑84.18↑
RuT5 + w-85.8974.5582.9780.4292.6482.7886.1774.96-82.55↑83.63↑
ChatGPT-84.0476.5688.9783.5490.4485.9184.8383.29-84.70↑84.90
ChatGPT + w-77.2975.2287.4481.6490.7682.1087.0869.93-81.43↑83.08↑
Table 5. Results for RoBERTa.
Table 5. Results for RoBERTa.
ModelPractice ID (F1)F 1 a v g F 1 a v g 2 9 F a v g 2 8
123456789
Original dataset
Multi-label87.2881.2477.8783.6885.0191.4880.1486.7349.9380.3779.5183.74
One-versus-rest86.7980.2474.0191.3184.5192.5186.3684.8149.9381.1680.4684.82
One-versus-rest + w86.1081.2676.7393.9182.8192.0887.3384.7449.9381.6581.0085.55
Augmented dataset
Simple duplication-82.1477.2490.1683.5692.3586.2485.5274.96-84.02↑85.32
Simple duplication + w-82.1478.2395.5084.2790.4486.9284.4874.96-84.62↑86.00↑
Backtranslation-81.1475.6691.6184.5892.2285.3085.0878.50-84.26↑85.08
Backtranslation + w-83.0676.8391.3183.2192.5186.2485.5683.29-85.25↑85.53
RuGPT3-75.2176.6895.5079.2488.3584.6283.0849.93-79.0883.24
RuGPT3 + w-77.4876.5892.1777.6488.9186.3381.6349.93-78.8382.96
RuT5-79.3975.7192.4186.7591.5185.1185.1975.96-84.00↑85.15
RuT5 + w-80.2476.4896.4685.7992.9484.2186.3675.96-84.81↑86.07
ChatGPT-82.1477.7494.5184.9990.9986.3685.4489.98-86.5286.02↑
ChatGPT + w-80.4378.9693.7084.1491.8087.8285.0169.93-83.97↑85.98↑
Table 6. Human evaluation results. DA—data augmentation.
Table 6. Human evaluation results. DA—data augmentation.
DA MethodRatingErrors
Different Green Waste PracticeAbsence of Green Waste PracticesNegation of Green Waste Practices
Backtranslation826111
RuGPT3592318-
RuT58974-
ChatGPT94-6-
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zakharova, O.; Glazkova, A. GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts. Appl. Sci. 2024, 14, 4466. https://doi.org/10.3390/app14114466

AMA Style

Zakharova O, Glazkova A. GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts. Applied Sciences. 2024; 14(11):4466. https://doi.org/10.3390/app14114466

Chicago/Turabian Style

Zakharova, Olga, and Anna Glazkova. 2024. "GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts" Applied Sciences 14, no. 11: 4466. https://doi.org/10.3390/app14114466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop