Crowdsourcing Analysis of Twitter Data on Climate Change : Paid Workers vs . Volunteers

Web based crowdsourcing has become an important method of environmental data processing. Two alternatives are widely used today by researchers in various fields: paid data processing mediated by for-profit businesses such as Amazon’s Mechanical Turk, and volunteer data processing conducted by amateur citizen-scientists. While the first option delivers results much faster, it is not quite clear how it compares with volunteer processing in terms of quality. This study compares volunteer and paid processing of social media data originating from climate change discussions on Twitter. The same sample of Twitter messages discussing climate change was offered for processing to the volunteer workers through the Climate Tweet project, and to the paid workers through the Amazon MTurk platform. We found that paid crowdsourcing required the employment of a high redundancy data processing design to obtain quality that was comparable with volunteered processing. Among the methods applied to improve data processing accuracy, limiting the geographical locations of the paid workers appeared the most productive. Conversely, we did not find significant geographical differences in the accuracy of data processed by volunteer workers. We suggest that the main driver of the found pattern is the differences in familiarity of the paid workers with the research topic.


Introduction
Development and applications of climate change policies require their acceptance and support by the public.The traditional method of measuring public perceptions of climate change relied on surveys, such as the Climate Change in the American Mind [1].Recent development of social media, however, offered new, unobtrusive opportunities for measuring public perceptions of climate change worldwide.In this new line of research, Twitter, the fourth most popular social networking site, is the medium most frequently used in research.EBSCO Academic Search Primer database contains 23 journal articles with the words "Twitter" and "Climate change" in the abstract as compared to Facebook (the most popular social media site, 14 journal articles), YouTube (second most popular social media site, 4 journal articles), Instagram (third most popular social media site, 1 journal article), and Reddit (fifth most popular social media site, no journal articles).
Very few of these papers, however, utilized the "Big Data" advantage of social media, exploring the content of the large corpora of Twitter messages.Kirilenko and Stepchenkova [2] used a 1-million sample of tweets to research geographical variations in climate change discourse worldwide.Cody et al. [3] analyzed 1.5 million tweets containing the words "climate" to explore temporal changes in sentiment (described in the paper as "a tool to measure happiness") expressed by the people in relation to climate change.Yang et al. [4] researched the effect of climate and seasonality on depressed mood using automated content analysis of 600 million tweets.Holmberg and Hellsten [5] studied 250 thousand tweets to identify gender differences in climate change communication.Leas et al. [6] analyzed the impact of a celebrity speaking on climate change on social media discussion.Kirilenko et al. [7] and Sisco et al. [8] analyzed the impact of extreme weather events on attention to climate change in social media.
The scarcity of "Big Data" research on climate change perceptions expressed in social media is related to the challenges in content analysis of large volumes of texts.Classification of social media messages requires considerable monetary and time investments, which easily become prohibitive when large datasets are processed.Even when machine learning methods are used, supervised classification still requires a manually classified sample that serves for both algorithm training and for groundtruthing.One popular solution to this research bottleneck is to break the work into small, manageable, easily understandable tasks and then to use the Internet to outsource processing of each task to amateur scientists (referred to as "workers") acting as volunteers or contractors.This method was popularized by Howe [9] as "crowdsourcing".The most famous crowdsourcing effort is probably the Galaxy Zoo project targeted at the classification of imagery of over one million galaxies collected by the Sloan Digital Sky Survey [10], which so far has produced over 50 million classifications [11,12].
The challenge of outsourcing data processing to untrained workers (either volunteered or paid) is associated with quality control.While a significant body of literature studied the quality of paid crowdsourcing (mostly Amazon's Mechanical Turk; further "MTurk"), and few papers dealt with the quality of volunteered crowdsourcing, we are aware of only two studies that attempted to compare the performance of paid and volunteer workers processing the same data in crowdsourced projects.A research completed by Mao et al. [13] investigated the performance of volunteer and paid crowd workers in exoplanet detection through analysis of the planet transit light curves.A set of light curves was offered to volunteer citizen-scientists through the crowdsourcing platform Planet Hunters (www.planethunters.org).A visually similar interface was built as a set of the Amazon MTurk tasks and offered to the paid workers.Overall, the performance of the paid workers was the same or only slightly below that of the volunteer workers, which might be partially related to the high hourly earnings of the paid workers of $4.8-5.6/hcompared to the mean earning of below $2/h in paid crowdsourcing projects, as reported by Ross et al. [14].The authors also noticed that the unpaid citizen scientists were spending almost twice the amount of time on each task as compared to the paid workers.
In a similar study design, Redi and Povoa [15] compared the performance of volunteer participants recruited via Facebook and paid crowdsourcing workers in the estimation of the aesthetic appeal of photographs processed with various filters.The authors found that volunteered work returned a higher correlation between the mean image ratings obtained through crowdsourcing and in a lab experiment, which demonstrated better reliability from volunteered crowdsourcing.They also reported a smaller number of unreliable volunteer workers; however, the volunteers tended to leave the work unfinished more frequently.
Scientific research in high visibility fields is likely to appeal to citizen scientists, making volunteer crowdsourcing a viable alternative to paid workers and ensuring the return of supposedly better quality data.Both of the abovementioned studies that compare the volunteer and paid platforms, however, dealt with same type of data (images) and returned somewhat inconsistent outcomes.We are not aware of a similar comparison done for the textual data from social media.The exploding popularity of social networks led to an ever increasing number of publications using social media data to study public discourse in relation to various natural and/or socio-economic phenomena.Among the available social networking sites, Twitter is one of the most frequently researched, with over 4000 publications on Twitter listed at the Thomson-Reuters' Web of Science.
The goal of our study was to compare the quality of volunteer and paid workers' classification of Twitter messages (tweets) on climate change and provide recommendations on quality control.The study was a part of a larger project on studying the geographical patterns of public perceptions of climate change worldwide [2].This article is organized as follows.In the second section, we provide a brief review of research dealing with crowdsourcing, with an emphasis on quality control.Then, we present our data and methodology.The results are presented in Section four.The fifth section contains a thorough discussion of our results.Finally, Section six provides a brief conclusion, study limitations, and recommendations for further research.

Crowdsourcing in Scientific Research
Multiple authors used crowdsourcing in their research of climate change impacts.Muller et al. [16] reviewed 29 crowdsourcing projects related to climate change that involved volunteer citizen-scientists engaged in data collection and/or processing.The crowdsourced applications included measurements of snow, rainfall, and other weather data, reporting severe weather outbreaks, recording air quality data, estimating the length of plane contrails (important contributors to warming troposphere), classification of the satellite imagery of tropical cyclones, digitizing weather records found in the 19th century ship logs and many others.Other climate researchers used paid contractor workers.For example, Olteanu et al. [17] used crowdsourcing to process data on climate change coverage in mainstream news and online.Samsel et al. [18] used massive crowd processing of color schemes for digital mapping of ocean salinity change, related to climate change.When data volume is too large for manual processing, even when crowdsourcing is involved e.g., due to costs, crowdsourcing may be used to process a sample of data, which can further be used for training and validation of machine learning algorithms.Thus, Yzaguirre et al. [19] used crowdsourcing to validate their text mining application for extraction of past environmental disaster events in news archives.Paid crowdsourcing platforms are also frequently employed for collecting public opinion data.Ranney and Clark [20] used volunteer and paid online participants to collect data on knowledge about climate change.Attari [21] researched peoples' perceptions on their water use and found over two-times underestimation of consumed water.
Multiple factors promote the growing popularity of crowdsourcing.Between those, the most important is probably its speed and cost.When data processing is easily parallelized, large volumes of data can be processed quickly by breaking data analysis into small, easily comprehended micro-tasks that do not require special training and are then solved by hundreds of citizen-scientists.On one hand, multiple studies reported hourly wages of the crowd workers well below the minimum hourly wage (e.g., [14]).The crowd workers are regarded as independent contractors, which generally frees their employers from tax and legal obligations, reducing costs even further.On the other hand, projects that are deemed socially important appeal to volunteer labor and can be completed at no monetary cost at all-for example, the abovementioned Sloan Digital Sky Survey attracted over 150,000 volunteer workers [11].
Many crowdsourcing platforms are available on the Web (for a review, see [22]).Among those platforms specialized on outsourcing the micro-tasks, the Amazon's Mechanical Turk platform is probably the most popular one, having over 0.5 million registered workers ("Turkers") from 190 countries [23] The demographics of the Turkers and an introductory guide to conducting a crowdsourcing research on MTurk platform was published by Mason and Suri [24].

Quality Issues in Crowdsourcing
It has been repeatedly demonstrated that complex problems normally requiring advanced technical training can be solved by crowdsourcing; examples include civil engineering [25], bioinformatics [26], astronomy [12] and many others.Furthermore, under certain conditions, even generating novel ideas and innovations can be crowdsourced with results comparable to those obtained from experts [27].Crowdsourced results may, however, be unreliable due to the following factors:

•
Instrumental errors arising from complex data pre-and post-processing, which involves multiple third-party platforms used to prepare data for processing, send tasks to workers, collect processing results, and finally, join the processed data.

•
Involuntary errors by human raters, e.g., due to insufficiently clear instructions and workers' cognitive limitations.

•
Deliberately poor performance of the human raters.A worker may vandalize the survey and provide wrong data, may try to maximize the number of tasks processed per time unit for monetary or other benefits, may provide incorrect information regarding its geographical location, or may lack motivation [28].
The last item in this list of potential error sources has attracted a lot of attention from practitioners, due to its high potential to render project results unusable.A widely cited experiment that consisted of rating Wikipedia articles by Amazon Mechanical Turk workers demonstrated only a marginally significant correlation between the crowdsourced and experts' ratings [29].However, the same study showed that simple changes in task design aimed at discouraging workers' cheating, increased the median time spent by a worker to complete one task from 1.5 to 4.1 min, decreased the percentage of unusable classifications from 49% to 6%, and noticeably improved correlation with expert classification.
On the other hand, it seems only logical to suppose that deliberately poor performance should not occur with voluntary participants, as they do not seek maximization of monetary gain from their work.Indeed, a study of motivations of volunteer workers in a crowdsourced scientific project on galaxy image classification [30] found that the primary motivation was seeking to contribute to original scientific research (39.8% of the respondents), followed by an interest in the scientific discipline (12.4%) and discovery (10.4%).Other motivations supplied by participants may, however, contribute towards a "cutting corners" behavior: they include a desire to complete more tasks than other participants, seeking fame for discoveries, and completing a homework assignment [30].
The data quality problem is typically resolved by heavily redundant designs where a single task is assigned to multiple workers; the "true" classification value is then defined as the majority vote (the mode).The required redundancy, however, increases costs while reducing the benefits of using crowdsourcing.Since the highest threat to reliability of paid crowdsourcing results come from a small but highly active group of workers trying to game the system [29], there is an incentive to identify the poorly performing workers and exclude their results from further consideration.
Multiple methods have been suggested to reduce the impact of this group on study results; for an overview of quality control methods in crowdsourced solutions, see [31].Rouse et al. [32] demonstrated that an improvement in accuracy can be obtained simply by asking the workers if they were attentive in completing the task, and giving them an option to remove their data from consideration.A commonly used solution is to employ a worker reputation system with assigning tasks to workers with approval ratings above a certain pre-set level [33].Another set of methods of identification and expulsion of the unethical workers is based on a set of indices measuring (1) agreement with the expert "golden standard" data; (2) agreement with the other workers; (3) agreement with the attention check questions and (4) an amount of effort estimated from the task completion time [34].The "golden standard" is a subset of data that is processed by experts in the field; an important condition is that a lay person should be able to process this data easily and unambitiously.The agreement-based indices target identification of outlier workers or weigh the contributions by worker's deviation from the mean [35].The attention check and language comprehension questions are verifiable questions [29] that do not require factual knowledge [36]; the results obtained from the workers failing to answer the attention questions correctly should be discarded.Finally, the average time to complete a single task is used to identify low-quality workers presumably spending a lesser amount of time per task [34].

Data and Methodology
Twitter data was originally collected for a project on online discussions of climate change, and early results were covered in [2].Software was developed to systematically poll the Twitter social networking site for the terms "climate change" and "global warming", which resulted in over 2 million tweets collected; after filtering as described in [7], this dataset was reduced to 1.3 million georeferenced tweets.Out of this database, 600 tweets in English published within the 2012-2014 period were randomly selected for further processing.
The research design was similar to [13].The same data were offered for processing to the volunteer workers through the Climate Tweet project based on the Citizen Scientist platform hosted at the University of North Dakota [37] and to the paid workers through the Amazon MTurk platform.To follow the best crowdsourcing practices, we used only the best paid workers, defined as those with at least 95% Human Intelligence Task (HIT) approval rating-see [33] for a detailed explanation.Note that the HIT approval rating is a worker's work quality measure, calculated as a fraction of his/her completed tasks that were approved by requesters.
As a motivation, the volunteer workers were provided with an explanation of the scientific importance of the project; additionally, the screen names of the best workers were published on the project's login page.The paid workers were provided with monetary compensation of $0.40 for classification of a single bundle (HIT) of 20 tweets.Taking into account the mean processing time discussed in the next section, the mean hourly earnings of a paid worker was $2.03, which is slightly above the average Amazon MTurk earning of just below $2/h: cf.mean earnings of $1.58/h for an Indian, vs. $2.30/h for a U.S. worker [14].
The quality comparisons of data produced by volunteer and paid workers were conducted along the two dimensions: expressed attitudes to the phenomenon of climate change in a processed tweet as well as topics raised.First, workers were asked to evaluate the attitude towards climate change expressed in a tweet using a 5-point scale [−2, 2]: −2: extremely negative attitude, denial, skepticism ("Man made GLOBAL WARMING HOAX EXPOSED"); −1: denying climate change ("UN admits there has been NO global warming for the last 16 years!"),or denying that climate change is a problem, or that it is man-made ("Sunning on my porch in December.Global warming ain't so bad"); 0: neutral, unknown ("A new article on climate change is published in a newspaper"); 1: accepting that climate change exists, and/or is man-made, and/or can be a problem ("How's planet Earth doing?Take a look at the signs of climate change here"); 2: extremely supportive of the idea of climate change ("Global warming?It's like earth having a Sauna!").
Second, the workers were asked to classify the same tweet into up to three of the following 10 topics, unified into broader themes: For exact questions, refer to Appendix A.
While the task formulation offered to the workers was the same on both platforms, with very similar visual survey layout, the work flow was different due to specifics of the paid and unpaid work organization and differences in the platforms.The paid workers were offered classification tasks in 20-tweet packets; for redundancy, each task was offered multiple times to different workers, so that each tweet was processed by multiple MTurk workers (min = 20, mean = 26, max = 48).Tweets were offered to volunteer workers individually, and each tweet was processed by a fewer number of workers (min = 6, mean = 14, max = 21).For further analysis, we selected only those tweets that were processed by at least nine workers on both platforms, which reduced the number of tweets from 600 to 579.The final classification was produced by the "majority consensus" method, i.e., for each tweet, its "true" classification was decided based on which topical category, or attitude, received the largest number of "votes" from the workers [31].
For groundtruthing purposes, the "Gold Standard" tweets were selected based on [34].For this, 579 tweets were screened by the first author, who was a climate scientist, and tweets that most easily and transparently could be classified into one of the classification categories were selected (103 in total).e.g., the tweet "What happened to global warming?It's cold as **** outside" clearly falls into the category "denial and skepticism" with a negative attitude towards climate change.The selected 103 tweets classified by the experts will be further referred to as the "Expert processed" (E) dataset.The same tweets processed by the paid and volunteer workers will be referred to as P and V datasets, respectively.
The study, therefore, followed the best practices of research employing MTurk workers [38]: (1) utilizing workers' qualifications in task assignment; (2) creating a "Gold standard" expert-processed dataset; (3) using redundancy, and (4) using a majority consensus to adjudicate results.The abovementioned best practices (2)-( 4) were also employed with respect to the data produced by volunteers; however the authors were unable to apply the best practice (1) controlling qualification of the volunteered workers.

Descriptive Statistics
The processing of the whole pool of 579 tweets was done by 127 volunteers and 574 paid workers; on average, each volunteer processed 65 tweets, while each paid worker processed only 26.For paid workers, the mean processing time of a 20-tweet task was 11.8 min (35 s/tweet).Few raters spent a very short time per tweet (min = 5 s/tweet), indicating potential cheating behavior.We do not have processing time for the volunteer workers due to the software platform limitations.
Classification results differed for volunteer and paid workers with the former tending to classify tweets into the fewer number of topics.The matched pair two-tail t-test found significant differences between the number of categories in the V (mean = 1.64) and P (mean = 1.83) datasets (p = 3.3 × 10 −30 ).We also found a better interrater agreement between the volunteer workers for all topical classifications: c.f. 75% percentage agreement for V vs. 81% for P. Note that this percentage agreement is inflated by the agreement by chance; Fleiss' generalized kappa adapted by Uebersax [39] for the unequal number of raters per subject (see also [40]) showed that in fact the interrater agreement was poor.Nevertheless, it also showed a better agreement for V raters (mean kappa = 0.24) vs. P raters (mean kappa = 0.14).
For attitude classification, the paid workers demonstrated a tendency to use extreme values of −2 and +2 more frequently than the volunteers; 25.8% of all tweets were rated as extremely positive or extremely negative, vs.only 11.1% for volunteers.Similarly to topic classification, the interrater agreement was higher for V as compared to P raters as measured by percentage agreement (86% for V vs. 78% for P) and generalized kappa (0.22 for V vs. 0.13 for P).While manually examining the P and V datasets, we observed lower quality of paid worker's classification of more difficult content.For example, the tweet "GW is fact but Sandy is hardly proof.Poor logic . . .Sandy confirms the obvious impact of global warming ...." was (correctly) classified as having a positive attitude towards existence of global warming by 83% of the volunteer workers vs. 56% of paid workers.Further, only one out of 17 (6%) of the volunteer raters classified the attitude as negative vs. 20% of the paid workers.

Crowdsourced vs. Expert Classification Quality
Tweet classification was validated by comparison with expert classification (dataset E of 103 tweets).We found consistently better performance from the volunteer workers, as exhibited by a higher correlation between the V and E datasets (mean r = 0.40), as compared to the P and E datasets (mean r = 0.29)-see Table 1.The difference was statistically significant (p < 0.05).Similarly, the mean Sørensen-Dice distance between the topic classification vectors was lower for the V vs. E datasets (0.47), as compared to the P vs. E datasets (0.36).The majority consensus method to extract the "true" classification from redundant ratings provided equally high quality results for both paid and volunteer workers, with an accuracy (fraction of matches with the E dataset) of ~0.8 for topic, and ~0.7 for sentiment classification (Table 2).The acceptable "realistic" agreement between human coders as measured by an accuracy, coefficient may vary between 0.70 and 0.79 [41], as evidenced by e.g., an Amazon MTurks data analysis [42].Overall, we conclude that the lower work quality of an average individual paid worker is mitigated by quality control based on massive redundancy, so that using volunteer workers has no data quality benefits over paid workers.
Table 2. Matching expert (E) and majority consensus of volunteer (V) and paid (P) worker classifications (classification accuracy) for the full dataset and the subsample used for groundtruthing.Note higher redundancy rate for P workers as compared to V workers (every tweet was independently processed by 26 and 14 workers on average, respectively).Refer to Figure 1 to compare classification accuracy for the same redundancy rate.To estimate the effect of crowdsourcing redundancy on classification quality, we repeatedly reduced the redundancy level in V and P datasets by limiting the maximum number of classifications of a single tweet.The maximum redundancy level for tweets was reduced from 19 for the V dataset and 30 for the P dataset, down to zero.In effect, this emulated the designs in which each tweet was analyzed by a regressive number of workers.To estimate uncertainty arising from a variability in the workers' quality, we performed 10 permutations, each time removing a respective number of randomly selected classifications.The results (Figure 1) showed the quality of majority consensus classification falling faster for the paid as opposed to volunteer workers: e.g., a 70% match between a crowdsourced and expert classification was on average achieved by 12 paid workers vs. just four volunteers.

Geographical Variability
We used each worker's computer internet protocol (IP) address to determine the worker's country of residence.For paid and volunteer crowdsourcing alike, the majority of workers and the majority of completed tasks originated from the U.S., but the overall geographical distributions were very dissimilar (Table 3).Almost 95% of all paid workers and over 95% of their completed tasks came from just two countries, the U.S. and India, with the next country, the U.K., contributing to less than 0.5% of completed tasks.As opposed to that, 95% of volunteer workers came from 16 countries, and 95% of tasks were completed in eight countries.The highest percentage of completed tasks came from the U.S. (64%), followed by the U.K. (13%), with India representing only 1% of completed tasks (Figure 2).Table 3. Geographical distribution of volunteer (Nv = 127) and paid (Np = 574) workers and their completed tasks (Nv = 8198 and Np = 14860) as a percentage of the total.The top 10 countries included into the table represent 78% of volunteer workers and 97% of their completed tasks.For paid workers, the table represents 97% of workers and 97% of completed tasks.In terms of data quality, we found significant geographical differences in the P dataset (Table 4), e.g., the crowdsourced topic classification matched the expert one in 80% of the U.S. subsample, but only in 22% of the India subsample.Interestingly, we did not find a similar effect for the V dataset (Table 4).Manual examination of data originating from the IPs located in India showed multiple misclassifications.For example, a tweet "Global warming is a lie!!! Proof: Step outside!!! Brrrrr!" was mapped as a climate change impact on weather, on environment and on society.Similarly, a tweet "The End of an Illusion or no global warming …" was misclassified as global warming drivers, science

Geographical Variability
We used each worker's computer internet protocol (IP) address to determine the worker's country of residence.For paid and volunteer crowdsourcing alike, the majority of workers and the majority of completed tasks originated from the U.S., but the overall geographical distributions were very dissimilar (Table 3).Almost 95% of all paid workers and over 95% of their completed tasks came from just two countries, the U.S. and India, with the next country, the U.K., contributing to less than 0.5% of completed tasks.As opposed to that, 95% of volunteer workers came from 16 countries, and 95% of tasks were completed in eight countries.The highest percentage of completed tasks came from the U.S. (64%), followed by the U.K. (13%), with India representing only 1% of completed tasks (Figure 2).Table 3. Geographical distribution of volunteer (N v = 127) and paid (N p = 574) workers and their completed tasks (N v = 8198 and N p = 14860) as a percentage of the total.The top 10 countries included into the table represent 78% of volunteer workers and 97% of their completed tasks.For paid workers, the table represents 97% of workers and 97% of completed tasks.In terms of data quality, we found significant geographical differences in the P dataset (Table 4), e.g., the crowdsourced topic classification matched the expert one in 80% of the U.S. subsample, but only in 22% of the India subsample.Interestingly, we did not find a similar effect for the V dataset (Table 4).Manual examination of data originating from the IPs located in India showed multiple misclassifications.For example, a tweet "Global warming is a lie!!! Proof: Step outside!!! Brrrrr!" was mapped as a climate change impact on weather, on environment and on society.Similarly, a tweet "The End of an Illusion or no global warming . . ." was misclassified as global warming drivers, science and impacts on weather.On average, one tweet was classified into 2.2 categories in the India subset vs. 1.78 categories in the U.S. subset.

Country
Exclusion of the rating from the IPs originating in India provided an improvement in classification quality and, consequently, allows to greatly reduce redundancy level (Figure 3).For example, when the India subset was excluded from the data, reducing the dataset by 21%, a 70% match between a crowdsourced and expert classification, was achieved on average by six paid workers vs. 12 paid workers required for the entire dataset.and impacts on weather.On average, one tweet was classified into 2.2 categories in the India subset vs. 1.78 categories in the U.S. subset.
Exclusion of the rating from the IPs originating in India provided an improvement in classification quality and, consequently, allows to greatly reduce redundancy level (Figure 3).For example, when the India subset was excluded from the data, reducing the dataset by 21%, a 70% match between a crowdsourced and expert classification, was achieved on average by six paid workers vs. 12 paid workers required for the entire dataset.Table 4. Fraction of matching majority consensus volunteer (V) and paid (P) worker classifications with an expert classification, the two top paid crowdsourcing countries (U.S. and India), and for all other countries.The India V sample for volunteer classification was too small (<1%) to allow comparisons.The individual samples for countries other than India and the US are too small to allow a comparison.

Discussion
Amazon MTurk best practice guide [38] recommends redundant data processing as a tool to improve the accuracy of the obtained results.However, employing massively redundant research design is costly; therefore, some researchers used the majority consensus method with as few as three redundant ratings.Snow et al. [43] found that the expert-quality evaluation is already achieved and impacts on weather.On average, one tweet was classified into 2.2 categories in the India subset vs. 1.78 categories in the U.S. subset.Exclusion of the rating from the IPs originating in India provided an improvement in classification quality and, consequently, allows to greatly reduce redundancy level (Figure 3).For example, when the India subset was excluded from the data, reducing the dataset by 21%, a 70% match between a crowdsourced and expert classification, was achieved on average by six paid workers vs. 12 paid workers required for the entire dataset.Table 4. Fraction of matching majority consensus volunteer (V) and paid (P) worker classifications with an expert classification, the two top paid crowdsourcing countries (U.S. and India), and for all other countries.The India V sample for volunteer classification was too small (<1%) to allow comparisons.The individual samples for countries other than India and the US are too small to allow a comparison.

Discussion
Amazon MTurk best practice guide [38] recommends redundant data processing as a tool to improve the accuracy of the obtained results.However, employing massively redundant research design is costly; therefore, some researchers used the majority consensus method with as few as three redundant ratings.Snow et al. [43] found that the expert-quality evaluation is already achieved Table 4. Fraction of matching majority consensus volunteer (V) and paid (P) worker classifications with an expert classification, the two top paid crowdsourcing countries (U.S. and India), and for all other countries.The India V sample for volunteer classification was too small (<1%) to allow comparisons.
The individual samples for countries other than India and the US are too small to allow a comparison.

Discussion
Amazon MTurk best practice guide [38] recommends redundant data processing as a tool to improve the accuracy of the obtained results.However, employing massively redundant research design is costly; therefore, some researchers used the majority consensus method with as few as three redundant ratings.Snow et al. [43] found that the expert-quality evaluation is already achieved with N = 4 classifications per item.However, we found that for harder-to-process questions related to science, much higher redundancy (N 10) is required for paid workers.Despite the fact that only the best quality Amazon MTurk workers were selected (HIT approval rating ≥ 95%), the performance of the paid workers was still inferior to the performance of volunteers.Consequently, we found that for a particular task, the same accuracy level can be achieved with 12 paid workers as with only four volunteers.The associated cost increase may be prohibitive for many scientific projects, which makes volunteer crowdsourcing an attractive alternative.The downside of volunteer crowdsourcing is that it requires a much longer time to complete the project.In our case, Amazon MTurk processing was completed in five days, with most of the time taken up with validation of the already processed data.The Citizen Scientist platform processing took one year; on average, ~600 tweets per month were processed.We also found that an interaction between the scientists and volunteers was required to keep the public interested in donating their time to the project.
To extract "true" classification from the large redundant pool returned by the crowdsourced workers, we used the most popular and simple method of majority vote.Multiple algorithms have been proposed to reduce the "noise" originating from workers' inaccuracy e.g., [44,45] and others.Application of data cleaning methods based on assigning dataset-specific quality rating to each of the paid workers helps to reduce required redundancy.For example, Dawid and Skene [35] suggest that workers should be weighted based on the deviation of their scores from the mean; the contribution from low quality workers should then be discarded or used with a lesser weight.Ipeirotis et al. [46] demonstrate that separation of workers' error rates into true errors and systematic bias leads to significant improvement of classification, and suggests that as far as each worker processes a large number of assignments (at least 20), the redundancy can be kept to at five iterations per task without significant quality deterioration.In practice, we found that the majority (90%) of the paid workers accepted just one or two task bundles (20-40 tweets), which made these quality control methods marginally applicable.This difference in the number of tweets classified by the paid workers and volunteers may also partially explain the difference in work quality.Indeed, assuming that a higher number of samples processed by a worker leads to better training and hence better quality on the subsequent tasks, volunteer workers would outperform paid ones.
The task completion time presumably measures each worker's thoughtfulness and hence may be another measure of work quality; indeed, Snow et al. [43] found that the per-hour pay encouraged the workers to spend twice as much time processing each task, and returned more accurate results as opposed to the per-task pay.However, we did not find a significant correlation between the task completion time (min 101 s, median 571 s, mean 708 s, max 6093 s) and accuracy.We also noticed that the performance of the fastest and the slowest workers tended to be poor.
Another quality management strategy is to utilize a worker reputation system to employ only the workers with approval ratings above a certain pre-set level [33]; commonly, a 90-95% rating is used.We, however, speculate that workers' reputation might not be a very reliable indicator of their performance.Proliferation of the online rating system means that the workers have become highly motivated in the protection of their online reputation.In a handful of cases, we had to reject incomplete tasks; subsequently, we received complaints and threats to blacklist us as bad requesters.Given the time and effort required to follow up requests from unsatisfied workers and a low cost of individual tasks, there is a strong incentive to avoid a dispute and comply with workers' requests, which thus artificially boosts the approval ratings of workers.
The workers participating in our study were on average earning ~$2/h, which is similar to average MTurk earnings.It is possible that a higher pay rate would return better quality results; however, Gillick and Liu [47] hypothesized that lower compensation might attract the workers less interested in monetary rewards and hence spend more time per task.Having read the online discussion of the MTurk workers, we also noticed that they associate an unusually high pay rate with possible fraud and recommend abstaining from taking such HITs.
In our study, similarly to other research [48], the overwhelming majority (95%) of paid workers came from the U.S. and India.This is not surprising, since the Amazon MTurk workers from other countries are unable to transfer their earnings to a bank account [49].We found that discarding results from workers outside the U.S. significantly improved data quality and hence reduced the required redundancy of the design; we did not find a similar effect for the volunteer workers.Note that the geographical distribution of the volunteer and paid workers was very different; the volunteer workers came predominantly from the countries with an active public discussion of climate change on Twitter and a high level of Twitter penetration.For example, the daily number of English language tweets originating from the U.S. is ~30 times higher than those for India, but this number is only three times higher than those from the U.K. [2].We therefore speculate that the main reason for the low quality of India data was insufficient familiarity of the workers with climate change discourse in general.Consequently, geographical worker selection may be an important factor to consider in order to improve the quality of results.

Conclusions
The purpose of this research was to compare the quality of volunteer and paid workers' classification of Twitter messages on climate change.We found lower accuracy of data returned by paid crowdsourced workers as compared with volunteer workers, while the latter required significantly longer time to complete.Consequently, a similar accuracy of processed data was achieved with paid workers only with a higher design redundancy; this caused expenses to be high.While conventional methods of accuracy improvement were largely unsuccessful due to the long-tail distribution of processed tasks per worker, limiting the workers' pool to those located in the U.S. significantly improved paid workers' data quality, making it only slightly lower than the volunteers' performance.Therefore, geographical location is an important factor for worker selection.We suggest the consideration of limiting the workers' pool to those countries where the research topic is actively discussed by the public in study designs.
The study has several limitations that might have an impact on its generalizability.While climate change is a world-wide discussed issue, the framing of its various topical aspects could differ depending on the country, thus potentially affecting classifications by the raters.At the same time, it is speculated that topical aspects with little differences in framing could yield lesser geographical differences in processing quality.Another limitation concerns the usage of the simplest, but also the most common "majority filter" for error correction; more advanced methods of error correction might return more precise results.Finally, despite our efforts to make the online interface for paid and volunteer workers as similar as possible, the differences in technical configuration between crowdsourcing platforms prevented us from designing a completely identical interface for the two web sites.These limitations should be addressed in further research.

1:
accepting that CC exists and/or is man-made and/or can be a problem (How's planet Earth doing?Take a look at the signs of climate change here) 2: extremely supportive of the idea of CC (Global warming?It's like earth having a Sauna!!).Think of code 2 as though it is code 1 plus a strong emotional component and/or a call for action −1: denying CC (UN admits there has been NO global warming for the last 16 years!)or denying that CC is a problem or that it is man-made (Sunning on my porch in December.Global warming ain't so bad.) −2: extremely negative attitude, denial, skepticism ("Climate change" LOL) (Man made GLOBAL WARMING HOAX EXPOSED).Think of code −2 as though it is code −1 plus a strong emotional component. Classify

Figure 1 .
Figure 1.A fraction of matching classifications of tweets' topics (A) and attitude (B) as a function of crowdsourcing redundancy.The expert (E), majority consensus volunteer (V), and paid (P) worker datasets are being compared.Arial boundaries show the best and the worst estimates and solid lines show the mean estimates (see the text for explanation).

Figure 1 .
Figure 1.A fraction of matching classifications of tweets' topics (A) and attitude (B) as a function of crowdsourcing redundancy.The expert (E), majority consensus volunteer (V), and paid (P) worker datasets are being compared.Arial boundaries show the best and the worst estimates and solid lines show the mean estimates (see the text for explanation).

Figure 2 .
Figure 2. Percentage of classified tweets for the volunteered (A) and paid (B) workers.

Figure 3 .
Figure 3.A fraction of matching classifications of tweets' topics (A) and attitude (B) as a function of crowdsourcing redundancy.The entire expert (E), majority consensus volunteer (V), and paid (P) worker datasets are being compared to subsets of data that excludes India (see the text for explanation).

Figure 2 .
Figure 2. Percentage of classified tweets for the volunteered (A) and paid (B) workers.

Figure 2 .
Figure 2. Percentage of classified tweets for the volunteered (A) and paid (B) workers.

Figure 3 .
Figure 3.A fraction of matching classifications of tweets' topics (A) and attitude (B) as a function of crowdsourcing redundancy.The entire expert (E), majority consensus volunteer (V), and paid (P) worker datasets are being compared to subsets of data that excludes India (see the text for explanation).

Figure 3 .
Figure 3.A fraction of matching classifications of tweets' topics (A) and attitude (B) as a function of crowdsourcing redundancy.The entire expert (E), majority consensus volunteer (V), and paid (P) worker datasets are being compared to subsets of data that excludes India (see the text for explanation).

Table 1 .
Pearson's correlation between the crowdsourced volunteer (V) and paid (P) worker and expert (E) classification of the tweets.The columns represent classification of the topics 1-10 found in Section 3 and of the attitude (A).
each tweet using ten categories below.If you think that a tweet belongs to multiple categories, you may use up to three categories.If you cannot find any suitable category, leave the cells empty.The categories are in bold.The scientists found that climate is in fact cooling• IPCC said that the temperature will be up by 4 degrees C