A Big Data Reference Architecture for Emergency Management

: Nowadays, we are witnessing a shift in the way emergencies are being managed. On the one hand, the availability of big data and the evolution of geographical information systems make it possible to manage and process large quantities of information that can hugely improve the decision-making process. On the other hand, digital humanitarianism has shown to be very beneﬁcial for providing support during emergencies. Despite this, the full potential of combining automatic big data processing and digital humanitarianism approaches has not been fully realized, though there is an initial body of research. This paper aims to provide a reference architecture for emergency management that instantiates the NIST Big Data Reference Architecture to provide a common language and enable the comparison of solutions for solving similar problems.


Introduction
Having access to reliable information during emergencies is essential for effective emergency management.New technologies have mainly changed the nature and quantity of information available from different actors, such as public authorities, media, citizens, and volunteer organizations.
The growth of social media, satellite remote sensing, sensor networks, and connected devices has contributed to a data deluge beyond what can be captured, processed, and interpreted with traditional tools, which is usually known as a big data problem.According to NIST's big data definition [1], "Big Data consists of extensive datasets-primarily in the characteristics of volume, variety, velocity, and/or variability-that require a scalable architecture for efficient storage, manipulation, and analysis".Thus, big data technologies have been widely used to process data and improve disaster management decision-making processes [2,3].
Besides, social media and crowdsourcing have significantly impacted how information is processed and decisions are made.Consequently, emergency management has evolved from centralized top-down models managed by public authorities to collaborative approaches where citizen participation is encouraged.These two models represent a continuum of existing emergency management models [4].At the end of the continuum lies the command and control approach [5] (also called strategic [6]), which follows an authoritarian model and divides competencies by level of command into strategic, tactical, and operational [7].At the other end of the continuum lies the emergent human resource model [5] (also called people-centered [7] or tactical [6]), which tends to divide competencies by theme [7], such as communication, logistics, and shelter.
A common view is that traditional top-down crisis management approaches are necessary but not sufficient [8], and they should be complemented with the promotion of societal resilience.While top-down approaches can improve preparedness and planning of emergencies, an effective response during the immediate aftermath of a crisis is critically improved by citizens' resilience.
The availability and adoption of Information and Communication Technologies (ICTs) have been among the reasons that have enabled this shift in emergency management [9].Society has accustomed to immediacy and to gather and deliver information in real-time.Even when landline phone networks are unavailable or intermittently available, fiber-optic connectivity and mobile phone networks exhibit a more resilient performance, especially to establish SMS and text-based short messaging communication.As stated by Eric Gujer [10]: "The Internet plays an increasingly important role in catastrophes and conflicts.Television fundamentally changed our perception of conflicts and disasters through live broadcasts from war zones in the nineties.The Internet, cell phones, and satellites are the next stage in the media revolution".The effective use of social media has made possible phenomena, such as the Arab Spring.In the words of the protester Fawaz Rashed: "We use Facebook to schedule the protests, Twitter to coordinate, and YouTube to tell the world".In the emergency response domain, the effective usage of social media has also impacted emergency management.Disasters such as Haiti's earthquake in 2010 "have represented a paradigm shift in the use of social media for disaster response, as multiple web-based platforms emerged to collect, refine, and disseminate crisis-related social media" [11].
Despite these advances, many challenges remain in leveraging the crowd's wisdom and automatic information processing.The main identified shortfalls of crowdsourcing applications are scalability, quality control, coordination, safety, and forecasting capabilities.Several authors [12][13][14] report that crowdsourcing applications such as Ushahidi have severe scalability issues, since their inflow rate of information can reach thousands of messages per minute, surpassing crowd processing capacity, resulting in an ever-growing backlog of unprocessed requests.Another frequently discussed downside is the need for better quality control and assurance [12,13].Quality assurance is required to improve classifications and geo-location accuracy, reduce redundancy, and ensure critical control points.Concerning coordination, crowdsourcing applications have proven to be useful for gathering information during the disaster, but they do not support response coordination.Gao et al. [12] proposed to integrate groupsourcing, so the system allows the separation of requests from crowds (end users suffering the catastrophe) and requests from groups (coordination messages of responding organizations).Besides, these platforms cannot forecast the evolution of incoming messages or emergencies in areas with limited or communication ability [12].
In view of the above-identified shortfalls, it would be advantageous to provide methods and systems that would enable us to combine effectively big data-enabled automatic processing with the power of human-centered approaches in emergency management.In this paper, we aim at providing a reference architecture that enables the combination of the both approaches.
The remainder of this paper is organized as follows.Section 2 reviews existing works.Section 3 introduces the proposed Big Data Framework for Emergency Management that provides a panoramic overview of the different actors, data, tasks, and coordination means for emergency management.Section 4 presents how the reference architecture is mapped onto a case study.Section 5 analyses the results.Finally, the conclusions of the research are presented in Section 6.

National Planning Framework for Emergency Management
Since disasters tend to be repetitive, disaster management usually defines cycles that include all the activities and measures to be taken before, during, and after the disaster to reduce its impact.
The emergency management cycle usually considers four phases [15]: mitigation, preparedness, response, and recovery, although there are other proposals in the literature [16,17].These phases are not sequential, but they overlap, interrelate, and complement each other [17].They can be classified [18] into three stages: pre-disaster (mitigation and preparedness), during the disaster (response), and post-disaster (mitigation and recovery).
Mitigation comprises all activities aiming at reducing the impact of the disaster (public education, building codes and zones, buying flood and fire insurance, etc.).Preparedness defines plans about responding (e.g., emergency training, warning systems, evacuation plans).Response deals with all the activities to minimize the hazards created by the disaster (e.g., search and rescue, emergency relief, seeking shelter).Finally, recovery is the process of repairing damage and returning the community to a normal situation (e.g., temporary housing, restoring services, financial assistance).
Business processes involved in emergency management should be identified to analyze the applicability of big data techniques.To this end, we have adopted the framework United States National Planning Frameworks [19].The National Planning Frameworks provide a governmental guide to prepare for and provide a unified national preparedness to disasters and emergencies.There are five frameworks for each of the five preparedness mission areas: the National Prevention Framework [20], the National Protection Framework [21], the National Mitigation Framework [22], the National Response Framework (NRF) [23], and the National Disaster Recovery Framework [24].
Each framework defines the partners involved in emergency response, and their roles and responsibilities.Additionally, it provides a shared vocabulary for defining the core capabilities and activities that must be accomplished in incident management.There are 32 core capabilities identified.Table 1 collects the 24 core capabilities related to emergency management.The core capabilities related to terrorism have been excluded, since they are out of this work's scope.In particular, the following core capabilities in the phases have been excluded: prevention and protection (screening, search and detection; and interdiction and disruption), prevention (forensics and attribution), protection (access control and identify verification; cybersecurity; physical protective measures; risk management for protection and programs activities; and supply chain integrity and security)).Besides, the phases protection and prevention have been merged in phase preparedness since this phase is more extended in the literature, and after removing core capabilities related to terrorism, these two phases share the same core capabilities.
The core capability of intelligence and information sharing is particularly relevant in the role of community participation.The National Prevention Framework does not formalize the tasks associated with this core capability but provides a list of critical tasks: planning and direction, collection, exploitation and processing, analysis and production, dissemination, feedback and evaluation, and assessment.

NIST Big Data Reference Architecture
Reference architectures [25] aim at providing abstract software architectures that collect architectural patterns and software elements for supporting the development of systems in specific domains.With regard to big data systems, several authors have proposed reference architectures for big data systems [26][27][28], and reference architectures for big data systems in specific domains, such as security [29], industry [30], e-learning [31], cloud-based video analytics [32], and smart cities [33], to name a few.
In this section, we briefly review the NIST Big Data Reference Architecture (NBDRA), shown in Figure 1.It is the proposal that has achieved the most support from the academy and industry, being developed by a working group launched in 2013 with over six hundred participants from industry, academia, and government.It provides a vendor-neutral, technology and infrastructure-agnostic conceptual model of a big data architecture.NBDRA defines an open reference architecture representing big data systems and is intended to support data engineers, data scientists, data architects, software developers, and decision-makers to develop interoperable big data solutions.The reference architecture is organized around five major roles and two fabric roles.The five NBDRA roles are: data provider, data consumer, big data application provider, big data framework provider, and system orchestrator.The two fabric roles are management, and security and privacy.These two fabrics provide services to the five main roles.These actors and fabrics interact as follows.Data provider actors make data available to others.Data are then processed by a big data application provider that executes the data life cycle to meet the application requirements defined by a system orchestrator.The big data application provider uses the big data framework provider's resources, making the required infrastructure available for processing, and storing the data.The output of the system is received by data consumer, which exploits the insights of big data processing.Additionally, a management fabric takes charge of maintaining the data quality while addressing management tasks such as system, data, security, and privacy considerations.Specific attention is paid to security and privacy, and special measures are taken by the security and privacy fabric so that these requirement policies are met, including auditing.
Five big data processing activities for big data application providers are defined: collection, preparation, analytics, visualization, and access.The collection activity handles the interface with the data provider.The preparation and curation task deals with data validation, cleansing, standardization, reformatting, and frequently persisting the data.The analytics activity implements techniques to extract knowledge from the data.The visualization activity deals with presenting the data to the data consumer to communicate their insights optimally.Finally, the access activity provides a service for handling the requests of the data consumer.These phases can be processed in different ways attending to the requirements and big data platform capabilities.There are three main approaches [34,35]: batch processing, stream processing, and interactive processing.
After the conceptual definition of NBDRA, the NIST Big Data Public Working Group has addressed the definitions of the interfaces between NBDRA components [36], and the elaboration of guidelines for its adoption [37].
The phases defined by the NIST framework can be easily mapped to other frameworks [26,38,39].Some works [40] extended big data architectures in order to integrate metadata and quality management components so that the data ingested in the architecture was annotated to enable provenance and quality assessment.

Big Data Technologies in Emergency Management
Big data is "a term used to describe the large amount of data in the networked, digitized, sensor-laden, information-driven world" [34].Big data problems have been traditionally characterized by four dimensions: volume, velocity, variety, and veracity, known as The Four Vs [41,42].
Disaster management can be characterized as a big data problem according to these dimensions.First of all, a great variety of data sources are integrated for disaster management.According to Qadir et al. [43], data disaster can be classified into data exhaust, online activity, sensors, small data, public data, and crowdsourced data.Data exhaust refers to information passively collected (i.e., mobile call detail records, banking records, credit card history, access logs).Online activity data are all the data derived from users' social activity (i.e., SMS, emails, posts, comments, search engine activity).Sensing technologies refer to information collected by sensors, such as remote sensing (i.e., satellites and aircraft), networked sensing (i.e., networked sensor systems), and participatory sensing technologies (i.e., sensors from everyday devices such as phones or buses).Small data are data derived from individual personal traces [44] that can complement big data for providing personalized solutions.Public data are all the public data provided by official channels, such as governmental and municipal offices.Finally, crowdsourced data are the data generated actively by the population (suppliers of it are frequently known as digital humanitarians [45]), who participate in a network of volunteers to support disaster management.Depending on the different data source characteristics, spatial and non-spatial, they require different batch and stream big data processing frameworks, as detailed by Cumbane and Gidófalvi [46].
Volume and velocity dimensions are associated with the previously enumerated data sources.To name a few examples, Kwak [47] reports that the big data system for flood disaster risk assessment uses as one of its data sources the satellite Himawari-8, and it generates a file size of 329 GB per day and 930 MB per 10 min.During the 2010 Haiti earthquake, the US Geological Survey (USGS) reported that over 600,000 files representing 54 terabytes of imagery data were distributed within the first six weeks after the primary event [48].Kryvasheyeu et al. [49] report that more than 50 million tweets were posted during Hurricane Sandy.
Data veracity should be verified, since a lack of control could lead to misleading decisions.In particular, crowdsourced data usage brings several potential issues [50], such as the inaccuracy of information and rumors and the malicious use of social media.Several social media verification approaches have been followed, based on automatic intelligent processing or crowdsourcing [50].
Several authors have summarized existing research in the application of big data systems to emergency management [2,43,46,[51][52][53][54][55][56][57].The availability of resilient communication networks is one of the challenges for the application of big data technologies during emergencies, since large-scale disasters can result in massive blackouts.Song et al. [57] reviewed the main approaches for achieving network resiliency, which can be enhanced using big data [2], such as the use of ad hoc networking, delay/disruption-tolerant networking, and smartphone-based Emergency Communication Networks (ECNs).Several surveys [51,53] analyzed the application of data management techniques in disaster situations and reviewed the main challenges and solutions that data science can face in the areas of data integration and ingestion, information extraction, information retrieval, information filtering, data mining, and decision support.Goswami et al. [58] conducted a review of the application of data mining techniques for disaster management.They concluded that the main usages are prediction (e.g., forecasting of the magnitude of an earthquake [59]), detection of natural disasters (e.g., early earthquake detection based on social sensors [60]), and disaster management strategies (e.g., understanding people needs and sentiment based on blogs and social networks [61]).Li et al. [53] added other application types, in particular, disaster simulations (e.g., 3D storm surge visualizations to improve situational awareness and evacuation decisions [62]), disaster visualization (e.g., visual analytics facilities to improve information sharing across stakeholders [63]), and insurance risk modeling (e.g., evaluating the economic effects of earthquakes in construction [64]).Several authors [3,55,58] reviewed the application of big data technologies depending on the disaster type and emergency phase.
The recent systematic review by Freeman et al. [56] reveals that ICT and big data technologies have been used mostly in real natural disasters in the response phase (75% of reviewed works).The ICT tools that have been used more frequently together with big data technologies are Geographical Information System (GIS), social media tools, patient health databases, and general disaster management software.With regard to the big data consumers, Freeman et al. identified clinical first responders, community members, national governments (military or non-military), and local Non-Governmental Organizations (NGOs).They report that most articles (64.47%) are targeted to clinical first responders or community members.In contrast, the systematic review by Akter and Wamba [65] reports that most works address the mitigation phase (36.8%), followed by the response phase (28.9%).This discrepancy could have come from the different selection criteria of both works, since Freeman et al. considered only academic works dealing with real emergencies and simulations, while Akter and Wanda focused the review on works that provide theoretical insights.Finally, Yu et al. [54] conducted another systematic review that provides a detailed classification of articles according to the disaster management phase, data source, and disaster type.

Digital Humanitarianism in Disasters
The effective use of social media and crowdsourcing in the 2010 Haiti earthquake has supposed a turning point for leveraging public participation in disaster response [66].It has led to the development of new digital humanitarianism [67] that uses crowdsourcing, remote volunteer collaboration, data production and processing, social media, and crisis mapping.
Thus, in this section, we aim to review and characterize the tasks developed by humanitarian organizations in emergency management.To clarify the terminology used in our study, we define below three terms that are sometimes interchanged: social media, social networks, and crowdsourcing.
Social media are [68] "a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of User Generated Content".Depending on the level of self-disclosure and ordered by media richness, social media can be classified [68] into high self-disclosure (blogs, Social Network Sites (SNS), and virtual social networks) and low self-disclosure (e.g., wikis, content communities such as Flicker and Youtube, and virtual game worlds).SNSs are [69] "Web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system".Finally, crowdsourcing is [70] "the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined network of people in the form of an open call".These three systems have different uses in emergency management.While social media are used for general information and knowledge communication, SNSs are used for coordination and personal information communication, and crowdsourcing provides the ability to outsource the response to people out of the damaged area.In particular, it is important to point out that visualization (in particular, crisis maps) has a communicative purpose in big data platforms, but it is also an effective means for coordination among digital humanitarians.
According to Alexander [4], social media plays three critical roles during emergencies: listening function to understand people's opinions and concerns; monitoring function for improving disaster management based on people experiences; and dissemination usage during emergency planning and crisis management.Other works also review the role of social media during specific disaster phases, such as preparedness [74] and situation awareness [71].Yin et al. [71] propose an architecture for emergency situation awareness whose main data processing components are burst detection, text classification for impact assessment, online clustering for topic discovery, and geotagging.Anson et al. [74] classify potential uses of social media for preparedness as (i) improving the effectiveness of preparedness communication by tailoring prepared messages to particular target audiences, increasing the reach of these messages by scheduling them properly and identifying influential users, and evaluating the effectiveness of the campaign; (ii) discovery of community networks that can be mobilized before a disaster occurs; and (iii) providing preparedness information.
Several authors have reviewed the role of crowdsourcing in emergencies.Poblet et al. [76,78] classified social media approaches into data-oriented, communication-oriented, and crowdsourcing.Data-oriented approaches analyze social media to extract relevant information that complements existing procedures, while communication-oriented approaches aim at enhancing communication between citizens and disaster managers.In contrast, crowdsourcing is between these two approaches, since it leverages people's workforce in the disaster management cycle.Poblet et al. identified as leading roles data generation (passively, actively, or structured as reports) and microtaskers (e.g., tagging or geolocating), and reviewed the functionality provided by crowdsourcing tools.Liu [77] defined a framework for characterizing the role of crowdsourcing in emergency management.She defined six dimensions: why (types of tasks), who (types of crowds), what (types of flows), where (spatial aspects), and when (temporal aspects).This review is mainly interested in the why dimension, which identifies the following tasks: crowd sensing, crowd tagging, crowd mapping, and crowd curating.Finally, Kankanamge et al. [79] carried out a systematic review of crowdsourcing's impact on disaster risk reduction.
Finally, another interesting perspective is the usage of computational techniques for processing social media and crowdsourcing.Imran et al. [80,81,83] reviewed the computational techniques for processing social messages.During the mitigation phase, the main activity is event detection from social data streams using topic detection, new event detection, and tracking techniques.In the response and recovery phases, several techniques are used for managing the information overload, which can complement crowdtasking processing.Information classification is usually carried out using supervised machine learning based on previously labeled examples.The purposes of classification depend on the data available as well as on the information needs.They distinguish between classification by information provided (e.g., affected people, infrastructure damaged), by information source (e.g., citizens, media, government), by information credibility factors (e.g., fake news, rumors), by temporal aspects (e.g., emergency phase), by geographical location (e.g., a specific geographical area), or by factual, subjective or emotional content (e.g., citizens' feelings).Another technique is information clustering, an unsupervised machine learning technique whose primary usage is grouping similar messages or detecting anomalies.Other techniques frequently used are text summarization and semantic enrichment, mainly based on entity recognition and linking.Zhang et al. [30] carried out an interdisciplinary review of social media use in disasters.They point out the need to analyze temporal and spatial patterns to understand the information diffusion process using social network analysis techniques, providing valuable insights for understanding the rumor propagation process.
The last two perspectives can also be combined so that automatic and human elements work together in processing data pipelines, so-called Crowdsourced Stream Processing (CSP) [84].Data pipelines are a linear sequence of data processing steps where each step processes the previous one's output.The data pipeline design pattern is a classical approach to data processing that has gained attention with the rise of big data, since this pattern helps manage the big data volume, thanks mainly to the use of distributed batch-processing big data platforms [85].This adoption of big data technologies into humanitarian operations is an ongoing effort [86].It provides many benefits, such as real-time information access and improving the decision-making process.Nevertheless, there are still some challenges [86], which include the availability of big data infrastructures and staff in marginalized regions, and the need to define suitable data policies to preserve data protection and privacy.In addition, crowdsourced big data could reinforce digital inequalities [87].As Burns [67] discusses, big data is not only a new source of data for digital humanitarianism.Its adoption requires transforming humanitarian organizations that should adopt a new set of practices.

Big Data Reference Architecture for Emergency Management
This section presents the design of a big data reference architecture for emergencies based on NBDRA.The reference architecture shown in Figure 2 has been constructed inductively based on the analysis of the literature previously presented.Analytical tasks have been classified according to the CommonKADS task hierarchy [88], as explained below.The proposed reference architecture aims at developing a shared understanding of the applications of big data for emergency management.This reference architecture can be used for knowledge management by collecting and organizing best practices and for its practical implementation.
Data providers introduce information feeds in the system.The proposed reference architecture extends previous taxonomies [89,90], and includes ICT systems that provide information to the big data system [56].Data providers have been classified as:

•
Digital sensors: data collected passively through the use of digital services (e.g., mobile phones, web searches).

•
Social media and news media: the information published on the Internet (e.g., blogs, Twitter) can be traced as social sensors of people's opinions and intents.Especially relevant is geolocated social media [72].

•
Crowdsourcing: information produced actively by users in order to report information about a disaster (e.g., mobile phone reporting tool, emergency map).

•
Health Information Management Systems: health information for managing the disaster, mainly related to patients and hospital management systems.• GIS: geographical information provided by GIS systems.
The five processing activities within the big data application provider has been further detailed for emergency management.
The collection activity uses standard big data collection techniques for accessing data providers and persisting data in the big data framework provider.Depending on the disaster phase, the system orchestrator should configure access to data providers and the security and privacy fabric components to follow the established requirements and data policies.The main specificity for disaster management the integration with crowdworking software.
The preparation activity comprises data cleansing, standardization, validation, and enrichment.The proposed framework includes a list of microtasks derived from the literature review: filtering [93], tagging [94], translation [94], geocoding [95], geotagging [96], validation to check the veracity or data correctness [97], correction [98], summarization [99], and comparison [100].Many of these tasks can be done using crowdsourcing or automatic methods.For example, Imran et al. [97] use automatic techniques for filtering and classifying images, and the classification is validated using crowdsourcing.
The analytics activity aims at extracting knowledge from the ingested data.Analytic tasks have been organized based on the CommonKADS task library [88], since it provides a general framework for classifying the potential uses of big data analytics.This framework distinguishes two general task types: analytical and synthetic tasks.Analytical tasks produce a characterization of the system and are subdivided into prediction, classification, diagnosis, assessment, and monitoring.Synthetic tasks construct a description of the system and are subdivided into assignment, scheduling, planning, modeling, and design.This categorization has been used for classifying uses of big data according to NRF core capabilities in the different phases of disaster management: mitigation (Table 2), preparedness (Table 3), response (Table 4), and recovery (Table 5).
During the pre-disaster stage, big data analytics can contribute to building resilient infrastructures and communities, both in mitigation and preparedness activities.As shown in Table 2, during the mitigation phase, big data technologies can help in reducing the impact of disasters by providing a long term hazards data collection system.Big data analytics can be used for risk assessment, in order to understand vulnerabilities to threats and hazards, and develop plans and strategies to manage them.In addition, monitoring and prediction analytic tasks are also relevant, since they can help decision makers to prioritize risks and make informed decisions.Regarding preparedness activities, big data technologies can improve decision making in planning, coordination and information activitiesm as shown in Table 3.
During the disaster stage, big data technologies can provide real-time decision support for disaster management, since they can manage the variety, volumen, and velocity of the available data sources.As shown in Table 4, the main purpose of analytic tasks is providing real-time assessment.In fact, the integration of big data has transformed the decision-making process that previously was based on historical data [86].Instead, now organizations can make more informed decisions and adapt their strategy when the situation changes.As illustrated in Table 4, big data analytics can provide assessments for improving decision making in a wide range of activities, such as analysis of social media for emergency planning [101], rescue team coordination [102], and triage [103].In addition, analytic tasks can provide new insights, since they can detect hidden patterns that enable decision makers to gain a deeper understanding of the situation [86].Monitoring activities can benefit from the integration of heterogeneous sources [104], and help in detecting trends and patterns to foresee potential issues [86,105,106].Moreover, big data technologies can not only improve situational awareness, but prediction analytic tasks can enable moving from hindsight to foresight, and anticipate the consequences of the current situation.
Finally, during the aftermath of the disaster, big data technologies can contribute to monitor its recovery status, and provide assessment to evaluate the socio-economic consequences and recovery efforts, as can be seen in Table 5.
The visualization activity presents processed data to data consumers.The proposed reference architecture includes crisis maps since they are among the most popular visualization mechanisms for crowd data.They provide an overview of the emergency situation and include layers for organizing the information (e.g., incidents, safety, and security) [107].
The access activity manages communication and interaction with data consumers.For disaster management, specific attention should be paid to the communication with crowdsourcing tools, and with visual analytics tools such as crisis mapping ones.
Finally, data consumers use the output of the big data system for managing the disaster.Data consumers of the Big Data System for Emergency Management are:

•
Government: governmental partners responsible for disaster management.

•
Media: mass media communication that contributed to information distribution and sharing during the emergency cycle.

•
NGOs: participating in the emergency as first responders.

•
Citizens: citizens affected or non-affected by the emergency.

•
Crowdsourcing: digital humanitarian organizations participating proactively in emergency management.

•
Health information management systems: health systems that can use the big data insights for their decision making processes.

•
GIS: GIS systems that can aggregate information from the big data system.

•
Social media management: social media management tools that can use big data insights for improving information sharing impact.
The proposed reference architecture enables the integration of automatic (big data-based) and crowdsourcing resources as follows.
Regarding big data processing, data pipelines correspond to the processing tasks carried out by big data application providers in NBDRA according to the requirements specified in the system orchestrator.The execution of data pipelines usually requires the system orchestrator's interaction with other systems that play the role of big data application provider, management fabric, and security and privacy fabric.As Imran et al. [84] point out, crowdsourcing systems are more suitable for data entry, binary classification, and n-ary classification microtasks.The use of automatic or human processing for these tasks depends on disaster requirements and resource availability.
With reference to the integration of crowdsourcing resources, digital sensors and social media have been identified as data providers, which corresponds to the crowdsourcing roles "crowd as a reporter", "crowd as a sensor", and "crowd as a social computer" according to the crowdsourcing role taxonomy defined by Poblet et al. [76].Besides, the activities defined in the big data application provider can be executed automatically by the big data system or orchestrated as microtasks, which corresponds to the crowdsourcing role "crowd as a microtasker" of the previously mentioned taxonomy.The access activity also considers the integration of interfaces with the crowdsourcing tools [78], including the popular crisis mapping system [108].Finally, crowdsourcing also plays the role of data consumer.A digital humanitarian can benefit from the use of big data systems for optimizing their performance.

NRF Capability Task Example
Planning Assessment Simulation modelling of eruptive processes for identifying eruption scenarios for emergency planning in at Vesuvius, Italy [109] Public information and warning

Communication, Prediction
Big Data analytics for predicting extreme flood risks and create awareness in the community to mitigate its effects [110] Operational Coordination Schedule Develop scheduling plans of power supply based on disaster trends and reserves of emergency supply [111].

Community resilience Assessment
Use of big data technologies to integrate physical, social, economic, and environmental dimensions to assess neighbourhood resilience [112] Long-term vulnerability reduction Assessment Harvesting big data from residential buildings for assessment on climate change policies [113].
Risk and Disaster Resilient Assessment Assessment Geospatial zonation of seismic site effects in Seoul [114].

Threats and hazards identification Monitoring
Monitoring social media and crowdsourcing data for early identification of urban flooding [115] Table 3.Big Data granular tasks for preparedness phase.

NRF Capability Task Example
Planning Prediction Ambulance demand forecast based on weather conditions and datasets from hospitals [101] Public information and warning Communicate Use of social media to communicate that vaccine against H1N1 influenza was available [116] Operational Coordination Assessment Recommendation of using operational analytics to coordinate emergency response across Federal, State, and local agencies [117] Intelligence and Information Sharing Collection Usage of big data and open data integration mechanisms for improving information sharing from central to local governments and NGOs during preparedness in Taiwan [118].
Table 4. Big data granular tasks for response phase.

NRF Capability Task Example
Planning Assessment Analysis of geosocial media post for emergency planning [105] Public information and warning Assessment Assessment for managing affected populations based on a spatio-temporal analysis of public emotion information [106] Operational Coordination Assessment Improved coordination between rescue teams integrating geographical, satellite, census and mobile phone call reports in Kerala floods [102] Table 4. Cont.

NRF Capability Task Example
Infrastructure systems Assessment Spatial assessment of risk and resilience of critical infrastructures for flood disaster [119] Critical transportation Prediction Description and prediction of passenger flows, detection of unusual flows and its explanation based on Twitter content during several disasters in Japan [120] Environmental response; Health and safety Monitoring Big Data system for monitoring water pollution after flood disaster [104] Fatality management services Assessment Fatality estimation and tsunami hazard assessment based on big data earthquake source models [121] Fire management and suppression Prediction Real-time prediction of fire department response times in San Francisco [122] Logistics and supply chain management Assessment Decision support system for optimal facility location, its state of operation, and production-distribution across countries [123].
Mass care services Assignment Decision support system for allocation of temporary housing after the disaster [124] Mass search and rescue operations Assessment Decision support system for prioritising victims to be rescued [125] On-scene security, protection and law enforcement Classification Identification of eyewitness messages [126] Operational communications Prediction Prediction of mobile service disruption during Tokyo earthquakes [127] Public health, healthcare, and emergency medical services Assessment Triage based on big data [103] Situational assessment Classification Detecting informative tweets [128] Table 5.Big Data granular tasks for recovery phase.

NRF Capability Task Example
Planning Assessment Assessment of resilience to Emergencies and Disasters at neighbourhood level for improving planning based on big data fusion [112] Public information and warning Monitoring Monitoring social media (e.g., Twitter) and classify messages per disaster phase and mine relevant information [129] Operational Coordination Assessment Satellite-based assessment of electricity restoration efforts during Hurricane Maria in Puerto Rico [130] Infrastructure systems Assessment Evaluation resilience and recovery of public transit systems based on Big Data [131] Economic recovery Assessment Economic loss assessment for rainfall and flooding disasters based on Big Data fusion [132] Health and social services Decision support system for evaluating hospital resources during post-disaster management [133] Housing Assessment Socio-economic analysis of disaster recovery based on housing market data [134] Natural and cultural resources Assessment Recovery assessment of monuments based on sentiment analysis of tweets during memorial days [135]

Case Study
This section describes a case study to show how the defined reference architecture can be mapped onto published disaster management architectures.
Kabir et al. [136] proposed the system STIMULATE for coordinating rescue operations based on the information published by affected people in the social network Twitter.The system is deployed in a cloud environment using Hadoop and comprises three components: the tweet fetcher, tweet processing, and rescue scheduling.
The tweet fetcher component collects tweets using the Twitter streaming API.A Web interface allows filtering tweets using multiple keywords and locations.The location area can be selected on a map.Then tweets are preprocessed, replacing emojis, jargon, slang, and contractions with more common wordings.The result is stored in a MongoDB dataset [137].
The tweet processing component aims at detecting stranded individuals and determine the rescue needs and priority.For this purpose, the system extracts locations.Multi-label multi-class classification is then performed based on a taxonomy provided by the Federal Emergency Management Agency (FEMA) for rescuing stranded people.The categories are: rescue needed, DECW (diseased, elderly, children, and pregnant women), water needed, injured, sick, and flood.Then, rescue priority is calculated based on the aggregation of different factors, such as weather conditions obtained using Open Weather API (Open Weather Service available at https://openweathermap.org/api).The tweet classifier uses a deep neural network that uses Keras [138] and Tensorflow [139] libraries and has been trained with Harvey and Irma datasets, and evaluated in 15 public disaster datasets.
The rescue scheduling component provides tools for managing the rescue operation.It provides a web interface so that rescue teams can manage their tasks, and an administrator can monitor task progress.A scheduling algorithm assigns tasks to rescue teams based on the tweet processing component's priority and based on their capacity.
According to the eNRF core capabilities taxonomy, this system is used during the response phase in mass search and rescue operations.Figure 3 describes the mapping of the use case to the reference architecture.The system uses two data providers, Twitter and Open Weather API, that expose a collection of interfaces.Data consumers are government institutions and NGOs since the system aims at coordinating institutional rescue efforts and volunteers.The big data framework provider provides data facilities (MongoDB) and task distribution (Hadoop).The case study uses neither an orchestrator component nor a management fabric.The core of the STIMULATE system is mapped onto the big data application provider.The collection component consists of a web server that processes data requests and interacts with the data provider.
Data consumers carry out these requests through the collection interface within the access component, which is implemented as a web application The collection component stores the information in the data facilities of the big data application framework, in this case, the database MongoDB.The preparation component pre-processes incoming tweets chaining geocoding and transformation tasks (i.e., management of slang, emojis, and contractions).Then the analytics component performs the tweet classification activity to determine rescue priority that feeds the scheduling task.Results from the analytics component are shown in the visualization component, which provides two interfaces, for administrators and rescue teams.The interface for rescue teams shows a route map for visiting each task location in order.Access to visualization is controlled by an authorization and authentication policy defined in the security and privacy fabric.The access component enables communication with data consumers.In this case, data consumers can configure and interact with the collection component and visualization component.
From this simple case study, some advantages of using a reference architecture can be pointed out.The proposed reference architecture can help us to evaluate the architecture, propose enhancements, and improve reusability.First, the system could benefit from the usage management fabric for automating configuration, resource management, and monitoring.Second, the security and privacy fabric is only used for controlling access in the visualization component, which can be an issue since the system should preserve confidentiality, privacy, and security.Since the collection component's functionality is not specific to this problem, the system could reuse available collection components designed with security and privacy in mind.Similarly, the preparation component is generic, and the system could benefit from a library of pre-processing multi-lingual components.Finally, the developed analytic component could be reused for other purposes.The use of well-defined interfaces would enable its reuse and improvement.

Discussion
This article proposes a reference architecture for big data processing in disaster management.The reference architecture has been designed inductively, based on an extensive review of the literature and the published implementation architectures in the domain.As a framework for its definition, we have chosen NBDRA, since it provides a general framework defined in a public working group, with participants from industry, academia, and government.As a result, NBDRA provides a vendor-neutral, technology-agnostic, and infrastructure-independent ecosystem.
The proposed reference architecture has identified the key components that are relevant for disaster management, and has categorized them based on NRF core capabilities [19] and the CommonKADS task hierarchy [88].The combination of both taxonomies provides an explicit schema of knowledge reusability, and shows big data technologies' applications for every single core capability for managing disasters.Given that many stakeholders participate in emergency management, the definition of standardized interfaces is essential for effective coordination of the efforts, the provision of access to data sources taking into account privacy and security concerns, and the customization of data consumer and data provider access.NBDRA defines functional components, and an actor can play several roles (i.e., data consumer and data provider).Since NBDRA supports the representation of stacking and chaining of big data systems [34], the cooperation of the big data systems participating in disaster management can also be represented in the proposed reference architecture.The need for cooperation is widely recognized in emergency management [140], since responses require a great diversity of skills and resources.Big data integration and Extract, Transform, and Load (ETL) technologies can be crucial for breaking down and bridging data silos [141].Moreover, the proposed reference architecture can help in organizing and classifying existing experiences and sharing best practices.
We have detected that the component "security and privacy fabric" should receive more attention in this domain since most works do not mention how they address these concerns.As discussed in some reviews about big data technologies for disaster management [3,54,65], security and privacy issues are still a big challenge.Nevertheless, this problem is not specific to disaster management since big data introduces many privacy preservation challenges [142].Thus, adopting a reference architecture can provide a good starting point for fostering the sharing and adoption of best practices.
A limitation of this work is that the reference architecture has been based on published research and should be complemented by consultation with domain stakeholders.Besides, NIST Big Data Working Group has defined interfaces between the NBDRA components [36].Nevertheless, there is not an available reference implementation of NBDRA, which could foster its adoption.Another limitation of this work is that we have focused on big data architectural aspects, but other aspects should be addressed.In particular, big data potential can only be achieved if legal, organizational, semantic, and technical interoperability is reached [143].In particular, some researchers report [144] that while technical interoperability has reached a high level of maturity, semantic and legal interoperability remains a significant barrier for the sector.Future work should be carried out to address semantic interoperability, taking into account existing standards, such as OASIS Emergency Data Exchange Language (EDXL) Emergency Standards [145], and semantic interoperability based on ontologies [146,147] to exploit the potential of disaster knowledge graphs [148].

Conclusions
This paper has focused on the definition of a Big Data Reference Architecture for Emergencies based on NBDRA.The aim of this work is providing a common vocabulary that enables to discuss Big Data architecture designs and implementations.Besides, reference models foster knowledge reusability.In the emergency domain, reusability can be done at different levels: datasets, data pipelines, and data processing and visualization software components.This research work aimed at identifying the essential components of a Big Data System for emergency management.For this, an extensive literature review has been carried out, and as a result, a reference architecture has been proposed inductively based on emergency management experiences.We have adopted NBDRA as a generic framework for describing Big Data systems adapted to a specific domain.Another aspect that we have addressed has been integrating crowdsourcing elements that enable the design and execution of hybrid data pipelines for emergency management.
We believe that Big Data analytics platforms will be more frequently integrated with crowdsourcing systems shortly.Thus, it is essential to learn best practices and to define open models for sharing practices and components.This reference architecture is a first step for providing a common framework for describing Big Data systems in the disaster domain.The need for inter-organizational cooperation characterizes large scale disasters.When it comes to sharing data and data analytics, reference architectures can improve organizational cooperation since standard interfaces enable selecting swappable components, and their combination.
Our future work will be focused on two aspects.On the one hand, we will work on evaluating the reference architecture with disaster stakeholders.On the other hand, we are interested in the specification of components for exploiting disaster knowledge graphs and in the extension of NBDRA interfaces for interacting with these components.

Figure 2 .
Figure 2. High Level Big Data Framework for Emergency Management.

Figure 3 .
Figure 3. Mapping between STIMULATE use case and the reference architecture for emergency management.

Table 1 .
[19]gency core capabilities per emergency phase adapted from[19], where denotes that a core capability is required in the emergency phase.
Figure 1.Overview of the NIST Big Data Reference Architecture.

Table 2 .
Big data granular tasks for mitigation phase.