Many aspects of everyday life are moving into the digital sphere and becoming more reliant on the digitalization of society (Marres 2017
). The digital traces and footprints human beings leave behind allow for the recording of their activities in the constantly growing realm of big data (BD) (Lazer and Radford 2017
). New analytic tools for large-scale data analysis such as artificial intelligence and machine learning allow us to extract information from data where previously nothing of value could be found. In combination with these new statistical modeling techniques, BD may enable advances in many areas as practically important such as the detection of cancer in patients from biometric data (Barrett et al. 2013
). Large amounts of data are especially useful for making predictions, be it forecasting the weather, epidemics, or road traffic (Askitas and Zimmermann 2015
; Lazer et al. 2009
BD may be described along the three “classical” characteristics of volume, variety, and velocity (Salganik 2018
; Dumbill 2012
), but also has value as a fourth characteristic (Gantz and Reinsel 2011
). Regarding volume
, there has been a vast increase in the amount of data that has become available electronically over the past two decades due to the rise of personal use of information technology and, especially, the internet. In social media, for example, Facebook and Instagram each have more than two billion active users, and Twitter, a micro-blogging service, has over 300 million. All users produce content by posting text, pictures, and videos. By interacting with other users’ content, connections and relationships are established, again stored as data. Search histories and browsing behavior leave digital traces in the online world, again becoming data of its own kind, potentially providing valuable insights into attitudes and behaviors of those online. The rise of smartphones has added to this development even further. This vast volume of data which reflects our lives online, but also offline, is accompanied by a rapid speed of how these data are stored, managed, and used for other purposes. Accordingly, the high-velocity
aspect of BD is a result of the online nature of the aforementioned services, providing instant access to people all around the world at any time. To be able to provide a smooth user experience, the processing of large amounts of data in real-time is necessary. In addition, for commercial agents, speed in transmitting information and analyzing patterns often is a key advantage over competitors (e.g., in stock market trading), making BD the basis of a new wave of business models (Hartmann et al. 2016
). Managing such volumes of data while people have access to them constantly all the time requires hardware and software resources on a grand scale. Accordingly, “’Big Data’ originally meant the volume of data that could not be processed (efficiently) by traditional database methods and tools” (Kaisler et al. 2013
). This makes BD a relational concept whose understanding depends on the state of software and hardware development.
BD is not a singular entity. Rather, it comes in many different forms and flavors and encompasses data from a wide variety
of sources such as internet connections, search engines, website usage, chat forums, email and messaging services, wearables such as fitness tracking devices, but also video portals, digital libraries, and the simple everyday usage of computers, smartphones, tablets, and internet browsers. As Lazer and Radford
) observe: “There are many discrete literatures around different BD sources, and even a complete list of those literatures would soon be obsolete.” To conclude, the variety of BD is huge and may even be growing.
BD has become somewhat of a buzzword in many areas of society as well as in the natural sciences. Often it is called the “oil of the 21st century” (Rotella 2012
) to describe its importance as a raw material and a basic element of the current economy. In contrast, social scientists and sociologists have been comparatively slow in using BD for their research: “The large majority of sociologically relevant analysis of big data is done by computer scientists, and there is relatively little reflection of the big data revolution in top sociology journals. For example, only 6 of 182 articles published between 2012 and 2016 in the American Journal of Sociology (AJS) and 9 of 240 in American Sociological Review (ASR) involve the use of big data” (Lazer and Radford 2017
). This has changed somewhat in recent years, however, as there are increasing activities to support research in this area, such as special issues in journals, workshops, tutorials, summer schools, and conferences. The first textbooks on the matter have appeared, as well as specialized journals and even whole research institutes dedicated to BD in the social sciences (Salganik 2018
; Veltri 2020
; Marres 2017
; Foster et al. 2017
). All this activity will likely result in a steep upturn in research in this area and increase the usage of web-sourced data for social research.
While data from online sources share many aspects with traditional data sources such as survey data or social book-keeping data (Graeff and Baur 2020
), they also exhibit features that are new, if not in principle, then at least in the scope of its occurrence. The variety of different data forms, such as text, pictures, and videos associated with tags, geo-codes, paradata, and metadata, are part of the reason why such data appear so appealing to social researchers (Evans and Foster 2019
). They offer new opportunities to address research questions they have not been able to answer previously (Ruths and Pfeffer 2014
). For example, BD often is the documentation of actual (online) behavior rather than just the (self-) representation of such behavior as collected through interviews or similar methods. Quite often, this is what social scientists strive to study and to explain: social interactions and behavior. For example, wearables such as fitness trackers allow the tracking of one’s physical condition much more accurately and, most importantly, are much easier and more cost-effective than questionnaires or interviews. However, these new data forms pose new questions in the handling and analysis of these kinds of data (Shah et al. 2015
). One of these questions concerns the ethical treatment of such data: whether we need new rules and guidelines for conducting research because of these differences to traditional forms of data (Weinhardt 2020
). This article discusses some ethical concerns in the use of big data for social research informed by considerations related to sociology and neighboring disciplines.
1.1. BD for Social Research
BD has now been employed in the social sciences in various ways. Twitter is often used to investigate political discourse, for example to show how anti-immigration laws shape public sentiment (Flores 2017
). Public discourse may also be studied by combining internet search data, social media postings, and digitalized newspaper articles (Vasi et al. 2015
). Online platforms also serve as major source of data, such as in the investigation of the effect of employment histories on the chances of getting a job using data from an US freelancer website (Leung 2014
). Online dating sites may be used to research the structural effects of race, gender, and education on personal interactions and mating behavior. Electronic communication may also be a key source of data. Goldberg et al.
) used a large corpus of emails among employees in a high-technology firm to look into the effects of cultural and structural embeddedness in personal networks. Digital administrative data may be used to investigate racial profiling (Legewie 2016
). These examples show the variety of research questions that may be addressed using a wide range of different data sources of BD in the social sciences.
Important advantages for social research practices directly follow from the sheer amount of data that become available for analysis. Databases used for analyses in the social sciences are usually rather small compared to the amount of data that the tech giants around the globe have to manage on a daily basis (ranging from tens of thousands to maybe several billion entries). Still, the above-mentioned definition of “big” also holds for the social sciences, as datasets with millions of entries and thousands of variables typically were impossible to handle by standard software packages in earlier days (a problem that has mostly been dealt with in the meantime where new versions of software have become available). Growth in the volume
of available data is so enormous that it may well be regarded as a change in quality. For example, almost by definition, sample size is not an issue when using BD (on the contrary, the meaning of p-value changes completely if datasets contain millions of entries). The “census-like” quality of BD allows us to draw connections where it was not possible before. Together with the still increasing capacities of modern computers and ever more sophisticated methods to analyze them, this allows for the detection of patterns among data that we were previously unable to identify (Foster et al. 2017
). It holds huge benefits for social network analyses, for example, as it becomes possible to analyze whole network structures, something that was virtually impossible to do before (Stopczynski et al. 2014
). Twitter and Wikipedia are examples often used in the analysis of many different kinds of networks (Miller 2011
The variety of BD as discussed above is maybe the strongest bonus for social research. There are rich bodies of observational data, digital trace data, textual data, and of course pictures, audios, and videos that enable us to study social phenomena from different perspectives and potentially allow deep insights into many aspects of everyday life. From the observation of actual (online) behavior to the deduction of personality traits or political leanings and emotional states from Facebook postings it seems a wide range of social science questions can be tackled using BD. This is even truer as different data sources may be combined to capture a comprehensive view of human interactions in online settings and beyond. By combining many data points from many different sources and different types of data, it becomes possible to gain insights where it had not been possible before. Furthermore, BD is not only about present-day information from the internet age. As more and more archives (such as newspapers and libraries) digitize their collections, such as Google Books and Project Gutenberg who are digitizing millions of books, BD begins to reach into (recent) history as well. Hence, using BD, it is not only possible to study current development but it is also possible to look back in time to see things that have evolved over a long time.
Due to its variety, BD may be used to study wide-ranging social phenomena, from individual behavior and attitudes to groups and organizations. It may be particularly useful for the study of human interactions, as many of the associated data forms, such as call records and chat protocols, likes and posts on social media, ratings on websites, and comments below newspaper articles, are exactly that, depictions of interactions in a digital format. Such interactions may be viewed as links and nodes within social networks. Thus, BD holds huge potential for the study of networks as well. In a similar vein, it may benefit international, trans-national, and cross-cultural research. Many online services are available internationally and generate data from different countries and cultures that may be used to study similarities and differences.
Velocity, it may be argued, is the BD feature with the least impact on social science research, as real-time processing is usually not needed to answer scientific questions per se. Still, it may be useful in terms of dissemination activities, policy evaluation and recommendations, and the realization of public awareness and societal impact of social research. While not the primary purpose of scientific inquiry, these aspects are valuable in their own right but also impact back on the sciences as they are important to legitimize its cost and importance.
1.2. Critical Perspectives on BD
Despite these potential benefits, it is not yet clear whether the new “data scape” available to social researchers really advances the possibilities of social research to a significant extent (Hamid Ekbia et al. 2015
; Crawford 2013
; Lazer et al. 2014
). Social scientists have studied the quality of the data they use for a long time and know much about data sources that help them to understand that data often are not merely an objective representation of reality, but constructed artifacts which may be framed and biased in certain ways (Baur et al. 2020
). These assertions also hold for BD which is never encountered as “raw” material, but rather as data shaped and formed by, among other things, the technological constraints and economic preferences of the providers of such data. By now, several frameworks for assessing the quality of BD based on such insights have been proposed. An obvious case in point is the fact that almost by definition, BD is a byproduct of some data-generating activity that is not specifically designed to address scientific questions (such as online trace data
, cf. Lazer et al. 2009
). Hence, they share the problem with other, more traditional types of large-scale data, as Merton already observed 50 years ago: “a circumstance which regularly confronts sociologists seeking to devise measures of theoretical concepts by drawing upon the array of social data which happen to be recorded in the statistical series established by agencies of the society—namely, the circumstance that these data of social bookkeeping which happen to be on hand are not necessarily the data which best measure the concept” (Merton 1968, p. 219
). This problem certainly also applies to BD. In addition to these issues of data quality, there is a huge danger of selectivity in BD, because those people who are online and actively using a specific service are very different from those who are offline and do not. Those people who are online may also be very selective about the information they share online. While some may be very active in posting social and political commentary on Facebook, others solely like items of pop culture or impressions of everyday life. Hence, what is available about who is heavily skewed and the truthfulness of the information provided in such online settings is still another matter of debate. It is, therefore, safe to say that the early hopes about the end of the theory where “with enough data, the numbers speak for themselves” (Anderson 2008, p. 8
) have not materialized yet and probably never will.
As BD is not only quantitative but also qualitative in nature, this in theory allows for the study of subjective meanings as in qualitative research but on a much greater scale (Fuhse et al. 2020
). However, the extraction of meaning from large bodies of text via algorithms still seems to be a challenging prospect (cf. Weichbold et al. 2020
). It is still very difficult to use automated sentiment analyses to attach simple emotional inclinations from online text snippets. In addition, artificial intelligence still struggles with, for example, the moderation of online content and its screening for breaches of net-etiquette and hate speech in online forums, as human irony and similar patterns of ambiguity are very difficult to detect correctly via algorithms. Still, despite these difficulties, given the value BD may add as a source about social life, BD should find its place in the toolbox for social researchers among other more traditional forms of data. Researchers should also be aware, however, of the knowledge that social scientists have accumulated over decades about the particularities of the social world when they use BD for their research (Mützel 2015
2. Ethical Concerns in BD Research
The research ethics of BD are somewhat different from general ethical concerns about the use of BD in society. While there is a vivid debate about the ethics of BD in general, the literature specifically on the research ethics of BD is still scarce (Moreno et al. 2013
). Historically, ethical guidelines for research have been especially important and prominent in epidemiology and the health sciences. In this context, it was defined that so-called human subject research needs to adhere to certain ethical principles (Metcalf and Crawford 2016
). Two main principles are that research subjects must not be harmed and that participation in scientific studies must be voluntary (Hoyle et al. 2002
). From this, it follows that research participants must give their consent to participate in a study and scientists must provide sufficient information about the study so that potential participants can make a reasonable decision on their participation (Keller and Lee 2003
). Consequently, “informed consent” is a major topic in research ethics, especially surrounding social research online (Froomkin 2019
), but also privacy concerns of individuals. Many other topics deserve attention and have not been touched upon here, such as the question of proper risk assessment, proper procedures to protect privacy (e.g., privacy-preserving record linkage, Vatsalan et al. 2013
), data sharing and archiving (e.g., Zimmer 2010
; Borgman 2012
; Bishop and Gray 2017
), and the issue of data ownership (Politou et al. 2018
; Ruppert 2015
). In the following, I briefly discuss four issues that overall feature less often in the discussion of research ethics for use of BD. I start by outlining the link between the value of BD and the risk involved for individual citizens.
2.1. Value and Risk
According to Gantz and Reinsel
)there is a fourth defining characteristic of BD: its value. For whom different BD applications are valuable varies widely, from private consumers to state actors and criminals. From an ethical point of view, it is important to recognize that value and risk are inherently linked. What makes BD applications so valuable to private companies, state actors, and criminals directly poses threats to privacy and self-determination of citizens. The following section discusses some real-world examples to highlight the risk to citizens by showing the value involved to other actors.
Many of the services that are now available online offer a great deal of value for their consumers, if only for entertainment and enjoyment. Otherwise, there would not be this massive new amount of data that may be used for analyses. People use social media platforms, online retailers, or streaming services because they reduce the burdens of everyday lives, ease communication with friends and family, and even bring joy and excitement. With the advent of electronic personal assistants in every smartphone or smart speaker driven by artificial intelligence and constantly improved by machine learning algorithms (such as Apple’s Siri or Amazon’s Alexa), early claims about the power of machines to know ourselves better than we do (Negroponte 1996
) may finally become true. This way, BD helps to “enhance” the online experience of consumers, based on our stated preference as expressed online through likes or previous consumption histories. Netflix, for example, an online TV streaming service, is known for its data-driven approach to bind consumers to their never-ending stream of content. They use data mining into the viewing habits of their customers to reveal preferences that the consumers themselves may not even be aware of. This way, they can suggest additional programs and series which target specific groups, and quite critically, to produce content and whole series specifically tailored towards the taste of their customers, thereby tying them to their network with the obvious aim to stabilize and increase the stream of revenue. While this practice reduces customers’ burden in dealing with the vast amount of content online, it also represents a not-so subtle influence on the cultural self-determination of individuals.
Maybe the most prominent example of extracting monetary value from BD and personal information is tailored advertising, also known as micro-targeting (Barbu 2014
). The sales of online ad space specifically targeted to certain consumer groups given what is known about them regarding social demographics, online habits, and consumer preference based on social media profiles and search histories generates billions of dollars in revenue from this business model for Facebook, Google, and other companies alike. For this purpose, personal information is sold and traded, i.e., exploited commercially, often without the conscious awareness of those affected.
BD has other uses for commercial companies also. Employers, for example, not only use social media to find recruits via advertising and tailored targeting of potential candidates but also for screening the social media appearances of their candidates for potentially incriminating information. The misuse of such data in the employment process led the State of Maryland even “to prohibit by law employers asking for Facebook and other social media passwords during employment interviews and afterwards” (Kaisler et al. 2013
). Assurance companies are another example for who BD may prove extremely valuable as it can help to predict risks of certain events from illness to car accidents from lifestyle preferences and online habits or even the actual way we drive our cars as these become connected to the internet and collect vast amounts of data. Credit companies may try to harvest personal online data to compute credit ratings and predict credit defaults. From these examples it can be seen that the (mis)use of data may interfere heavily in people’s everyday lives, increasing loan and insurance costs or even impeding their chances to gain employment.
BD also opens up completely new possibilities for political interference, state control, and surveillance. In the political realm, the Facebook/Cambridge Analytica scandal showed how the harvesting of personal information on social networks may be used to develop data-driven political campaigns, even if they are designed to misinform or spread falsehoods to influence election results (Confessore 2018
). The knowledge gained on the connection between social and political attitudes and preferences proved to be useful to influence people’s political preferences. This helps political campaigns in micro-targeting their political ads, but also was exploited by Russian intelligence agencies in Western elections by using bots and false accounts to spread false and misleading information to undermine unwanted and promote more desired candidates and parties, respectively. This poses grave risks not only at the individual level, but also to whole political systems.
State actors have for some time now recognized the vast potential that comes with the digitalization of life and communication. Historically, official statistics have been developed by state actors to govern their populations (Diaz-Bone 2019
) and administrations use BD for this purpose too (Thévenot 2020
). For example, database matching has always been a desire of police forces and security agencies as they claim it helps them identifying and catching criminals and terrorists. With the advent of widespread security cameras in private as well as public spaces, the developments in face recognition software, computing power, and bandwidth, the police increasingly receive the power to track almost anyone anywhere for (hopefully) legal purposes. With such means, predictive policing goes way beyond crime prevention (which may also be aided by data analyses) and becomes a real possibility as data algorithms identify potential crime hotspots before they even occur (Williams et al. 2017a
; Egbert and Krasmann 2019
). Still, this is nothing compared to the width and depth of data penetration the NSA, the US National Security Agency, routinely undertake in their efforts to prevent terrorist attacks and other security threats (Landau 2013
). Meanwhile, Chinese authorities seem to proceed even further into the realm of full state surveillance of citizen’s private lives by introducing a social credit system that scores a wide variety of online and offline behavior, from tax fraud and parking tickets to social media tweets critical of the regime (Creemers 2018
). Such an ability to link and search personal information on such a grand scale at the state level yields immense power and holds immense risks for individual freedom and liberty.
Maybe unsurprisingly, BD is very valuable for criminal activities also. There are many different ways to illegally extract money from people, through spam emails and phishing attempts but also things like ransomware attacks used for blackmailing whole institutions, where criminal hackers maliciously encrypt computers to collect a ransom. For example, in a data breach on a US adultery platform, private information of its customers was stolen and used for blackmailing campaigns (Zetter 2015
). Identity theft is another danger where people steal complete personal profiles from social media platforms to pretend they are this person, for example to apply for loans or other benefits. This way, criminals may not only extract money from those who are directly involved in a data breach of their personal information, but also from third parties where false identities are used to fraudulently extract money.
From these examples it becomes clear that there are severe risks from (mis)use of BD by private companies, state actors, and criminals. Those features that render BD valuable for many parties at the same time poses such a threat to the privacy and everyday lives of citizens around the globe (e.g., Jackson and Orebaugh 2018
). Hence, assessing such risks and preserving privacy for individuals are major issues in the ethics of BD, although not the only issues.
2.2. Anonymization and Re-Identification
To protect data privacy and the confidentiality of personal information, and to preclude the risks of re-identification of research subjects, research data are typically anonymized at various stages of the data-handling process. Where data are fully anonymized, they lose their character as personal data, and therefore informed consent is no longer necessary to handle the data. There are, however, differences in the definitions of privacy and personal information, depending on the national context of data sources and their usage. While the definition of what constitutes confidential data is somewhat open to discussion and may not be determined objectively, most data privacy law in Europe is now regulated by the new EU general data protection regulation (EU-GDPR 2016
). European data protection law provides an explicit list of certain information that the legislature considers to be sensitive and which therefore must be considered as a binding minimum for researchers working in Europe. However, full anonymization is typically not possible, at least not without huge losses of information important to the research question. This is already a relevant issue when archiving and sharing quantitative survey data. With BD, the risk of re-identification vastly increases, as the amount of data that is available on the internet becomes a big challenge for the practice of data anonymization (Bender et al. 2017
). This will become even more relevant as the possibility of linking different data sources increases (some even argue it becomes increasingly impossible). That de-anonymization
of research data is possible in this way with data that is already available online has been proven numerous times (Cecaj et al. 2014
; Lubarsky 2017
; Archie et al. 2018
). An early project using Facebook data released their dataset anonymized for scientific purposes only to find that the information it contained quite easily allowed for the identification of research subjects in the dataset on Facebook itself. The combination of data together with the knowledge that all subjects were students in one particular university actually made this task relatively easy (Zimmer 2010
). While this example mainly proves that researchers had not thought carefully enough about the risks of de-anonymization, other examples show that the task is increasingly difficult to fulfill. Netflix, an online service for streaming television, had publicly released a dataset of their users’ viewing habits as part of a public challenge to increase their content and service provision. A team of researchers was able to identify users by matching viewers’ preferences to ratings people provide on IMDB, an online platform for film and television reviews, which in many cases included peoples’ names (Narayanan and Shmatikov 2008
). Other researchers claimed to have identified Banksy, a street artist notorious for shielding his personal identity, by comparing publicly available records on local housing and voting with the known locations of the street artist's work (Hauge et al. 2016
). These examples make it clear that anonymization in the field of BD poses huge challenges. The problem is somewhat mitigated by the fact that researchers must not use research data for anything other than scientific purposes. For researchers who want to provide BD for other researchers to use, this can be assured by the appropriate formulation of user agreements. Hence, any uses to marketize archival data or to transgress into someone's privacy are per se improper, if not outright illegal. This will not prevent people with criminal intent, of course, and proper assurance of anonymity in the data is still an ethical demand. However, the output of scientific research and accompanying datasets is likely not of any interest for companies using data sourcing as their business model. As the data are potentially already out there, the actual risks involved in providing BD in archives specifically for researchers might be comparatively small.
2.3. Documentation and Dissemination
Documentation of research practices as well as the dissemination of source materials and results are key principles of what has been called Open Science (Fecher and Friesike 2013
; Vicente-Saez and Martinez-Fuentes 2018
). This umbrella term describes the general idea to make all stages of scientific inquiry accessible to a broader audience, including both professional scientists and non-scientists. The general goal is to facilitate the publication and communication of scientific knowledge and scientific practices. Important aspects are the open accessibility of publications and research data, as they allow for the replication of findings by other researchers. Therefore, the preregistration of research questions and hypotheses is another principle that has grown in acceptance recently and which is intended to strengthen the reproducibility of results. To make Open Science workable, general principles of data management should apply during all phases of the research process with the two imperatives of keeping stored data not only safe and secure but also retrievable and accessible. This includes planning for archiving and secondary usage of the data, ensuring that all efforts meet the FAIR principles of research data management: findable, accessible, interoperable, and re-usable (Wilkinson et al. 2016
One might argue that the question of open data and research reproducibility is a more salient issue in BD than other types of social research because the workflows and research results typically involved in such projects easily lend themselves for this purpose. Proper data management practices in data-driven research, which involves the heavy use of software and code, already demand the documentation of procedures and transformation to allow for collaboration among project members. From this internal process of project documentation, if implemented correctly, it is a comparatively small step to share code, data, and documentation on online platforms (such as github.com, gitlab.com, or others) for the benefit of other researchers and the wider public.
While the sharing of code and other research materials seems rather less problematic from an ethical point of view, sharing the actual data with a wider audience may be problematic. A sometimes neglected distinction in this regard concerns the collection and usage of data by primary researchers vs. the secondary usage of research data by other researchers (van Deth 2003
). Secondary research, on the other hand, is based on the dissemination of research data to a wider (scientific) audience in order to replicate earlier analyses or to answer completely new research questions. While there are examples for primary research that meet the criteria of BD, such as the personality experiment worldwide (Stillwell and Kosinski 2012
), in the context of BD, we typically think of using data that has been collected by others already, i.e., instances of secondary research. In case personal data is to be provided for the benefit of other researchers, efforts should be undertaken to anonymize the data, as is standard practice in current quantitative social research. Currently, the archiving and dissemination of BD used for research to other scientists is an issue that deserves more attention as there are no standardized solutions or archival practices established yet, and a discussion around this issue has also started to emerge (Williams et al. 2017b
). As documentation and dissemination become a cornerstone of scientific endeavor, the question of anonymization
becomes an even more pressing issue.
2.4. Stakeholders in Social Research Practices
Another important ethical aspect involves the question of which groups of persons might be affected through BD research and hence, who holds a stake in the scientific use of BD. Here, one needs to clarify how the research might affect them and how it can be assured that their interests are respected and their rights protected. The list of potential stakeholders obviously includes research subjects (i.e., the persons the data is about) who should be protected from harm. Often, research ethics considers certain groups as particularly vulnerable, such as children but also seniors. The particular risk a group faces must be considered in the decision whether they should be considered vulnerable. In the case of BD, the question of vulnerability, is among other things, closely linked to digital literacy and tech-savviness. This is due to their possible inexperience and ignorance towards existing perils and possibilities of BD research.
While it seems clear that we should care about the subjects of our research, there are other stakeholders to consider when we think about the ethical impact of our studies. The data providers from which we retrieve BD e.g., through online databases and websites, may also be considered stakeholders. While many of them are likely to be powerful commercial companies or state agencies, this is not always the case. Web-scraping data from websites and online databases for example may interfere with their performance and limit their functionality for other users if the scraping is done carelessly. It is therefore necessary to include data providers in our ethical considerations also.
Other researchers and scientists also have a stake in our proper scientific conduct. For example, research is seldom done alone, but in the context of cooperation with other scientists and research workers. However, project members and associates are an often-overlooked party in research ethics but still deserve attention. Researchers as a community share a stake in the ethical discussion of research issues as well. As colleagues and fellow researchers, we have an interest that everyone acts properly so that data sources remain accessible for future research and are not rendered inaccessible due to mishaps and mistakes in research practices. If other stakeholders perceive some scientists’ conduct as improper, fellow scientists may be prevented from using the same or similar data in the future. It is, therefore, in the self-interest of scientists to act ethically. While professional conduct between scientists is often covered in the general rules of good scientific practice by the various national science foundations, there are likely to be specific ethical issues in relation to fellow researchers.
Finally, yet importantly, science itself is an important societal endeavor and scientists should have certain privileges in doing their research (something that is also recognized in the EU-GDPR). The importance of scientific research and the possibility to conduct it freely without undue inference by other actors is something that should and must be weighed against the claims of other stakeholders, and also against the rightful claims of research subjects. This is where the scientific usage of BD should be privileged over other private and commercial uses.
2.5. Ethical Regulation and Institutional Review Board (IRBs)
A discussion on research ethics should acknowledge that there are also critical voices regarding the role of ethics and the “human subject research model” (Bassett and O’Riordan 2002
), especially in qualitative research (Haggerty 2004
; Hammersley 2009
), but also regarding the practical implementation of research ethics through review boards (von Unger et al. 2016
). It is argued that ethical reviews before the research is conducted in the field requires pre-fixation of the research design that is incompatible with the principle openness of qualitative inquiry, where research questions, topics of the inquiry, and sample selection are constantly adjusted in the light of experiences in the field. The analysis of some kinds of BD, for example the analysis of textual data, resembles qualitative approaches to some extent (e.g., the inductive exploration of data to find patterns not previously envisaged). For example, the collection of online data may be observational data similar to ethnographic research where acquiring informed consent may be difficult. Thus, the question arises whether the same concerns against ethical reviews can also be raised in the realm of BD.
However, overall, the implementation and deeper integration of regulatory bodies and ethics committees in the conduct of social science studies seems desirable, not least because of the advent of BD and the challenges it poses for privacy and anonymity. As this is still a relatively new area for research in the social sciences, practices and protocols still have to develop. It would be naive to simply and only trust in the self-guidance and morality of individual researchers. First, knowledge and acceptance of ethical standards and principles may vary widely between countries and disciplines and a unitary approach is therefore desirable for all stakeholders involved. Second, the transformation of international research into something akin to a capitalistic competition of scientific reputation with rankings and impact factors deciding on the standing and status of researchers (Münch 2014
), the pressure on the individual researcher in this system is too high to expect everyone to adhere simply to their conscience everywhere and all the time during every phase of research. Third, as one of the advantages of BD is the possibility to conduct international research and country comparisons more easily, this complicates the question of research ethics, as researchers have to keep their research in compliance with regulations in different countries and regions. Hence, some kind of institutionalized standard-bearer of ethical research practices, which provides oversight, but also guidance and assistance, seems desirable. While the ideal form of such a format still needs to be developed, this may take, for example, the form of an ombudsman that all stakeholders in the research process may turn to if they have questions or concerns.
3. Discussion and Conclusions
The usage of BD is more and more common in social research. While data from online sources share many aspects with traditional data sources such as survey data or archival data, they also exhibit some features that are new, if not in principle, then at least in the magnitude of its occurrence. This is the reason they offer new opportunities for social researchers to address research questions they have not been able to answer previously. Yet, BD poses new questions for the handling and ethical treatment of such data, simply because of the differences to traditional forms of data.
Ethical concerns around BD often involve issues of data sharing, privacy, and security. Data privacy scandals such as the one which included Facebook and Cambridge Analytica for influencing political elections (Zuiderveen Borgesius et al. 2018
) show the risk involved in the use of BD for research purposes. However, these and other scandals also led to a global public that is more sensitive to such large-scale breaches of privacy and the consequences they may have on a grand scale. The tech and social media giants have come under heightened pressure to be more open about their data-sharing practices with third parties as well as to be more restrictive in the way they provide access to their data. Still, the misuse of online data by various actors is likely to continue, as the potential gains are huge. It may therefore be argued that compared to the risks users of digital communication face in their daily routines, those risks from scientific inquiries are arguably relatively small (Kämper 2016
). However, researchers should not be left alone in addressing these issues. Rather, they should receive support from institutionalized agencies trained and tasked in the handling of ethical issues. These do not need to take the form of established IRBs and should provide guidance as much as restrictions.
This paper also touched upon the topic of stakeholders who may be affected by social research using BD and who hence should be part of the ethical considerations leading up to BD research projects. However, this was only a preliminary exercise and a thorough mapping of all different kinds of stakeholders, as is often done in other contexts (Aaltonen and Kujala 2016
; Brugha and Varvasovszky 2000
), is still lacking. Conducting such a stakeholder mapping may in itself be seen as an ethical requirement. When considering the stakeholders of BD research, one must consider the special role that science as a social enterprise should hold in these considerations. As the institution tasked with providing grounded knowledge for society, scientists should hold a privileged position when conducting research and, therefore, when using BD for their research.
While the article made these and other points and observations, they should not be seen as definitive conclusions, but rather as invitations for further discussion and investigation. Indeed, there should be a wide and open debate on the potential ethical pitfalls of BD in the social sciences and social scientists must take part more actively and visibly in these discussions. The peculiarities of the social world such as its pre-structuration as a symbolic realm to be interpreted by self-aware actors need to be kept in mind when discussing the potential benefits for social research. So far, the field of BD research is too often dominated by natural and computer scientists and their take on the social world, who lack knowledge about the specificity of the social objects under study (Lazer et al. 2009
). Likewise, they are often unaware of the social and ethical dilemmas such research may actually involve. As we need new rules and guidelines for conducting research in the digital realm, it, therefore, becomes an ethical requirement in itself for social scientists to share their knowledge on studying the social world, on substantial as well as methodological, and ethical topics, with scientists from other fields and disciplines. However, as this article is mostly informed by considerations related to sociology and neighboring disciplines, it may well be that in other disciplines the potential for research as well as the ethical questions differ and, therefore, need to be addressed separately.