Big Data Warehouse for Healthcare-Sensitive Data Applications
2. Healthcare Data Privacy
2.1. Data Anonymisation and Sharing
2.2. Healthcare Data Requirements
- Privacy: This is essential mainly in the healthcare sector. Patients’ records and their data attributes are very exposed to attacks. So it is imperative to put in place protection mechanisms to preserve the privacy of patients and individuals when sharing healthcare datasets.
- Data Quality: High-quality data is essential for data mining and analysis (as no quality data leads to no quality results). Therefore, shared data should maintain good attribute values that are detailed enough to serve the purpose of the mining and analysis. One must also carefully take into account the curse of high-dimensionality  and maintain the truthfulness at an individual record level.
- Flexibility: Privacy protection should be flexible enough for various analysis tasks and mining techniques. The ideal approach is to implement privacy-preservation solutions independently from the mining algorithms and research purposes.
- Compatibility: Privacy-preserving models should comply with and support the system reference architecture.
- Utility: Provide a level of support to allow researchers to re-visit patient data following appropriate access control and ethics mechanisms.
2.3. Existing Data Privacy-Enhancing Techniques
2.4. Comparative Analysis of the Existing State-of-the-Art Healthcare Systems
- A back-end layer with database management system for data collection, de-identification, and anonymisation of the original datasets.
- The role-based permissions and secured views are implemented in the access control layer.
- The controller layer regulates the data access protocols for any data access and data analysis.
3. BigO System—Overview
- Children and adolescents within an age band (9–18 years old) who are data providers as:
- School students, via organised school efforts on projects around physical activity, eating, and sleep.
- Patients attending obesity clinics.
- Individual volunteers.
- Teachers running the organised school efforts with students.
- Clinicians treating patients in clinics.
- Public Health officers (researchers or policymakers) evaluating the children/adolescents behaviour indicators in a geographical region in the combination of Local Extrinsic Conditions (LECs) relevant to obesity.
- Administrators for school, clinic, and the whole BigO platform.
4. BigO Data Collection
- Personal data sources (Behavioural): This raw data are collected at the individual level from the citizen-scientists concerning the behavioural patterns that are relevant to the BigO study (e.g., how one moves, eats, sleeps). The raw data in this category are collected from personal portable and/or wearable devices. These sources are further categorised according to the mobile sensory data acquisition device; (i.e., (a) Smartphone, (b) Smartwatch, and (c) Mandometer).We combine the devices in three settings based on the requirements of the BigO system that entail data collection and the availability of the peripheral sensors (Table 2).
- Population data sources (Statistics): The raw data sources contain information characterising the population residing in a given area (countryside, cities, etc.). The data providers include national statistical authorities of the countries involved in BigO. Depending on the type of population statistics, the raw data sources are further classified as (a) Demographic data sources and (b) Socioeconomic raw data sources concerning the population of a city of interest and their administrative regions.
- Regional data sources (Geospatial): These incorporate geospatial data that are linked with the BigO areas of interest (countries, cities, or administrative city regions).
- Mapping data sources (Layered Maps): The web mapping data are provided by 3rd party APIs. Depending on the data type, these sources are further classified as a) Maps (i.e., interactive terrain maps) and (b) Points-of-Interest (PoIs).
5. Big Data Warehouse Architecture
- Original data: In the BigO system, the raw data is stored following two main schemas. These schemas were implemented using two modern databases; MongoDB and Cassandra. We created two schemas for two different reasons: the first is for security and privacy reason. Separation reduces the risk of data access violation and intentional manipulation. The second is that the two schemas are not used in the same way during the analysis and from various roles. For instance, Cassandra is used for storing time-series data, while MongoDB is used for the rest. Moreover, external sources of data were also used during the analysis, these include authority databases, and national statistical databases.
- De-identified statistics data: All information about users’ identities is removed. This consists of individual aggregated data and population statistics, i.e., the statistical data derived and computed from the original and reference data.
- Anonymised data: In addition to de-identification, we further anonymise the data so that the original data cannot be recovered from the knowledge extracted during the data analysis, using data mining algorithms.
5.1. Access Control Layer
5.2. Controller Layer
5.3. Data Storage and Integration
5.3.1. MongoDB Data
- Portal controllers: There are five portals in the BigO system; admin portal, school portal, community portal, public health authorities portal, and clinical portal. The data is collected from all these portals, and integrated before storing it MongoDB database.
- Mobile controller: This controller deals with the data collected from either mobile phones or smartwatches. The data is pre-processed, integrated, and then transferred to MongoDB. This data is of particular importance, as it is sensitive and should be handled with care.
- Analysis service: The data analysis component processes the whole Datasets which is stored in two DBs. The analysis component uses a Spark computing environment. The results of the analysis are stored back at the MongoDB database.
- Back-end analysis service: This service accesses the data stored in both DBs to extract the behavioural indicators. These new data attributes are then stored in MongoDB. The back-end analysis service is executed in the Spark environment.
5.3.2. Cassandra Data
- External sources: The external data sources include individual behavioural data and extrinsic population data. These datasets are directly transmitted to the database through the external data integration modules (Figure 6). For example, data about individual’s devices.
- Mobile app controller: Data from mobile applications collected through smartphone or smartwatch is stored using a mobile app content provider in the DB of the mobile app. The stored data is then synchronised with Cassandra DB. (Figure 6).
6. Data Security and Privacy
- Secure storage: The BigO system architecture is implemented using mainstream platforms, such as Cassandra, MongoDB, SQLite database management systems, HDFS file systems of Hadoop, Android, and iOS. These standard systems employ built-in encryption technologies to secure both the system components and the data that they contain. Section 6.1 provide more details about each type of storage.
- Secure communications: The data must be secured when it is transferred between various modules of the system. Communications between BigO modules use secure protocols, such as SSL, TLS, or HTTPS (See Section 6.1 for more details).
- Data access control: Access control is a complex issue in large systems, such as BigO. Therefore, we implemented a full solution for access control consisting of consistent policies, clear and powerful mechanisms for registration, authentication, and authorization. They are summarised in the following:
- Mobile app storage: In BigO, the mobile app data storage can be accessed only by the back-end of that mobile phone and contains only personal data of its owner. Therefore, no fine-grained access control is required in this context. It uses a simple access control based on username and password.
- Auxiliary file storage: This is a temporary storage for processing raw data and used by the mobile back-end via its controller. Like mobile app storage, it needs only a basic access control with username and password.
- Database servers: These are the main storage, and contain all the BigO data for analysis. The data is available for a variety of end-users depending on their roles. A Role-based Access Control (RAC) mechanism is employed with specific policies for granting roles and permissions. Each database is accessed via a RESTful API. (Section 6.2 for details).
- Inertial sensor, movement, and Mandometer data: These data types are collected and stored on personal devices (smartwatches, mobile phones, Mandometers). Only statistical and generalised data is extracted and submitted to the BigO server. These kinds of data do not raise privacy risks.
- Photographs: All the photos are reviewed by BigO admins. Any photo that is deemed to be irrelevant, indecent, or reveals the user’s identity is deleted from the system.
- Identifiable data: All attributes, such as username, deviceID, etc., are removed before storing the data in the data warehouse. The operation of removing such attributes is called de-identification (Section 6.3).
- Quasi-identifiable data: All data attributes, such as country, region, school, clinic, height, weight, gender, birth year, self-assessment answers, locations of photos, etc., are dealt with by anonymisation. However, because of the bad impacts on the data quality, a privacy-aware protocol is applied to take into account this type of data (Section 6.3).
6.1. Data Protection
- Authorization: User-defined roles can be defined in MongoDB to configure granular permissions for a user or an application based on the privileges they need. Moreover, one can define views that expose only a subset of data from a given collection.
- Auditing: For regulatory compliance, MongoDB security model records native audit log to track access and operations performed against the database.
- Encryption: MongoDB security system offers data encryption data on the network, disk, and backups. By encrypting database files on disk, one eliminates both the management and performance overhead of external encryption mechanisms.
- Monitoring and Backup: MongoDB ships with a variety of tools, including Mongostat, Mongotop, and MongoDB Management Service (MMS) to monitor the database. Sudden peaks in the CPU and memory loads of the host system and high operations counters in the database can indicate a denial of service attack.
- Encryption: It maintains data confidentiality. Usually, DB data encryption falls into two categories: Encryption At-Rest and Encryption In-Flight. The first refers to the protection of data that is stored on persistent storage. The second refers to the encryption of data as it moves over a network between nodes or clients and nodes within a DSE cluster.
- DSE Transparent Data Encryption (TDE): is the feature responsible for the encryption of at-rest data in a DSE system. DSE TDE protects sensitive at-rest data using a local encryption key file or a remotely stored and managed Key Management Interoperability Protocol (KMIP) encryption key.
- Authentication: refers to the process of establishing the identity of the person or system performing an operation against the database. DSE Unified Authentication facilitates connectivity to four primary mechanisms for authentication, as described below. It extends the same authentication schemes to the database, DSE Search, and DSE Analytics.
- Authorisation: In DSE, the authorizations determine which resources (i.e., tables, keyspaces, etc.) can be read, written, or modified by a connected entity, as well as their connection mechanisms. It uses the GRANT/REVOKE paradigm for authorization to prevent any improper access to the data and uses three mechanisms for user authorisations: Role-Based Access Control (RBAC), Row-Level Access Control (RLAC), Proxy Auth.
- Auditing: Data auditing allows to track and log all the user activities performed on the database to prevent unauthorised access to information and meet compliance requirements. With DSE, all or a subset of an activity that takes place on a DataStax cluster is recorded along with the identity of the user and the time the activity performed. Efficient auditing in DSE is implemented via the log4J mechanism that is built into the platform.
- Drivers: DataStax provides drivers for C/C++, C#, Java, Nodejs, ODBC, Python, PHP, and Ruby that work with any cluster size whether deployed on-premise or cloud data-centres. These drivers are configured with some features, such as SSL to ensure the users interact with the DSE clusters safely and securely.
6.1.1. Auxiliary File Storage
6.1.2. Data Transmission
6.2. Data Access Control
- BigO Administrator: This is created by the BigO developers and it is fixed. A BigO administrator can register the school and clinic administrators. The same can also review submitted pictures to remove inappropriate pictures or pictures that compromise the privacy of individuals.
- School Administrator can add/edit school details and register the teachers.
- Clinic Administrator can add/edit clinic details and register clinicians.
- Teachers can create groups, edit student groups and individual student details, such as BMI, school exercise schedule, etc., and can create registration codes for students.
- Clinicians can create registration codes for patients and edit individual patient details, such as BMI.
- Students can register through the BigO mobile app using a registration code provided by their teacher. When a teacher creates a student account, a registration code is generated and stored in the database. The student enters the registration code the first time s/he uses the app. Once the registration code is “redeemed”, the student is registered in the system and the registration code is no longer valid.
- Patients register with a registration code provided by their clinician; same as the students.
- Each user has a username and a password (for students and patients, these are auto-generated and stored on the mobile phone, without the involvement of the user). The password is salted and hashed, and the encoded password is stored in the database.
- When the mobile phone needs to access a restricted REST endpoint, it first asks for JSON Web Token (JWT) from the authentication server, by presenting the user credentials. The credentials are also salted (with the same salt) and hashed, and the authentication server compares the encoded passwords. If they match, it provides the user with a valid JWT.
- Using the JWT, the mobile app and web portals/applications can access the restricted REST endpoints, until it expires. After expiration, the mobile app asks for a new JWT from the authentication server and the process is repeated.
6.3. Data Privacy Protection
6.3.1. Deidentification and Pseudonymisation
6.3.3. Privacy-Aware Data Analysis Protocol
- Deidentification: The identifiable attributes and non-important attributes for analysis are removed in this step.
- Anonymisation Preparation: Some attributes require minor treatments that support generating secured views. This case often occurs to date and numerical but not categorical attributes. For example, it is usually not essential to keep detailed height values. So we round them off to ranges. This transformation is not an anonymisation operation and it just slightly changes the data content and the information in the data for analysis is nearly preserved. Hence, we consider this step as anonymisation preparation. Another important job of this task is to create anonymisation preliminaries (including taxonomy trees for quasi-identifiable attributes) which are used to generate secured views.
- Secured Views Generation: Secured views are created to help data scientists inspect and understand the dataset from a variety of perspectives but not reveal the linkages between the patients and their sensitive information. There are three types of secured views:
- Statistical View: This provides measures, such as standard deviations, domain ranges, and value statistics for attributes being calculated automatically.
- Anonymised View: This provides the whole view of shared datasets. For privacy protection, we applied the Privacy and Anonymity in Information Security (PAIS) algorithm  to achieve the LKC-privacy model . LKC-privacy prevents record and attribute linkage attacks for high-dimensional datasets. PAIS uses the top-down searching strategy on taxonomy trees to find sub-optimal generalisation for records. For general analysis tasks, discernibility cost is used as the measure to choose the best specialisation.
- Anatomised View: Since k-anonymity is a condition of LKC-privacy, the results of PAIS suffer from the problem of high-dimensionality. As a consequence, anonymised views may provide too general views on quasi-identifiers. Detailed or anatomised views are also provided using the anatomy technique.
- Feature Selection: After examining the datasets with different views, the data scientists can choose appropriate transformation, feature selection, and extraction methods to generate proper input data for their application-specific analysis tasks. The processing is done on the de-identified and non-anonymised dataset.
- Data Mining and Result Anonymisation: Data scientists can choose various analysis methods. The returned results can be too detailed in some cases. For instance, a decision tree (the output of the above classification algorithm) has detailed leaf nodes that link to several special individuals. Therefore, the mined results must be checked and filtered before being released for researchers to guarantee children’s privacy.
- Presentation and Evaluation: The resulting models are evaluated, and the data analysts can be restarted their analysis from the inspection step if necessary.
7. Implementation for Privacy-Aware BigO System Architecture
7.1. Description of Architectural Changes
- Separation of the MongoDB database: Unlike the BigO component diagram (Figure 6) with one MongoDB database, the collections of this database are separated into three databases:
- First MongoDB database (Original data): This database contains administrative data and collected/measured data, including collections USERS, CHILDREN, MEALS, TIMELINES, FOOD_ADVERTISEMENTS, DAILY_ANSWERS, and PHOTOS.
- Second MongoDB database (Reference data): This database stores data unrelated to individuals and used for reference. The list consists of collections SCHOOLS, CLINICS, GROUPS, REGIONS, and PUBLIC_POIS.
- Third MongoDB database (including Individual Aggregated Data and Population Statistics): Individual Aggregated Data includes collections summarising periodically behavioural data of individual children such as DAILY, WEEKLY, and STATISTICS. Population Statistics Data comprises collections COUNTERS, PUBLIC_POIS_VOTES, GEOHASH_VOTES, GEOHASH_ATTRIBUTES, and HISTOGRAMS.
- Separation of APIs: The updated BigO architecture supports four different APIs to access the Cassandra database and the three MongoDB databases.
- De-identification module: This module removes identifiable fields as well as the fields unnecessary for analysis and does not require pseudonymisation.
- Periodic aggregation module: This module aggregates periodically the behavioural data of children.
- Statistics measurement module: This module pre-computes statistics of some populations that are used for visualisation features and generating the statistical view.
- Anonymisation preparation module: This module conducts the tasks described in the Anonymisation Preparation step of the aforementioned privacy-aware protocol. The outputs of this module are Anonymisation Preliminaries and de-identified data for analysis.
- Anonymisation preliminaries: These are saved in the format of JSON or XML.
- De-identified data for analysis: This data storage should not store discrete collections like in the database of Individual Aggregated Data and Population Statistics. The data for analysis should be the combined datasets in the formats convenient for generating secured views and mining algorithms. A good choice is CSV files stored in the file system storage of Hadoop.
- Secured views: The data for analysis is accessed through secured views. There are modules taking responsibility for generating secured views and running feature selections/mining algorithms on the de-identified data for analysis.
7.2. Process of Updating Data Changes
- Changes of administration data: The administration data (e.g., email, name, address) is entered manually so that sometimes there are mistakes requiring updates. Since this data type is not extracted to be stored in other databases, the synchronisation is not a problem.
- Insertion of new measures: The behavioural measures are uploaded from the mobile app to the original databases frequently. After certain periods of time, to reflect the changes in the original data, the new summarised data is added into the individual aggregated data storage and the existing statistics are updated in the Population Statistics storage. The anonymisation preliminaries and de-identified Data for analysis are also re-computed.
8. Current Picture, Recommendations, and Future Directions
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
|BigO||Big Data for Obesity|
|GPS||Global Positioning System|
|HIPAA||Health Insurance Portability and Accountability Act|
|EHR||Electronic Health Records|
|PHA||Public Health Authorities|
|LEC||Local Extrinsic Conditions|
|GIS||Geographic Information System|
|DBMS||Database Management System|
|API||Application Programming Interface|
|JWT||JSON Web Tokens|
Appendix A. Schema Design
Appendix A.1. MongoDB Schema
- Regions: This collection stores interesting features of administration regions to represent all types of geographical or administrative areas. Therefore, a region can be a town, a city, a province or even a country. Each region has a series of coordinates that form its boundaries. The boundary information will be provided by public authorities. Depending on region type, fields of region characteristics can be added later.
- Geohash: Owing to the special characteristics such as the hierarchical structure and the ease of storing in databases, geohash is used in BigO system to represent geographical information. The data of children activities at some areas are recorded. However, only anonymous and statistic data are stored in collection Geohash_attributes and Geohash_votes.
- Public POIs: Two collections Public_pois and Public_pois_votes store aggregated activity information of children at public points of interest. Hence, the structure of collection Geohash_votes and Public_pois_votes are nearly the same.
- Schools and Clinics: These collections store the information of the organized schools and clinics.
- Groups: Students who take part in BigO system via the organized schools are divided into groups. Each group is managed by a teacher.
- Users and Children: BigO system has many types of users but only children data is collected for research. Therefore, Users collection is used to manage administrative fields while Children collection includes fields recorded for research. The available fields of a user depend on his/her role. Some special fields are shared since they are necessary for popular queries over the two collections. This duplication requires a small extra cost when inserting a new child but can improve the performance for many operations.
- Photos: Fields “photo” in collections Meals and Food_Advertisements in the previous version are separated to store in collection Photos. Grouping all photos in a specific collection smooths the management and verification of photos.
- Timelines and Mobility: Collection Mobility in the old version is replaced and extended to collection Timelines. The new collection contains not only the visited location but also the traveling activities of children.
- Daily_anwsers: This new collection contains the answers of daily questions.
- Counters: It is required to generate display IDs for children (in collection Users). These IDs are generated from an auto-incrementing sequence calling function getNextChildDisplayId(“children”). Once being called, this simple function adds 1 to field “child_seq” of the document having “_id” = “children” in collection Counters and return the new value as the new display ID. The similar technique and collection Counters can be used to create other auto-incrementing sequence.
- Sleeps, Mobility, Meals, and Food advertisements: These collections store data of daily activities of children. Especially, meals and food advertisements included photos taken by children.
- Eating habits: This collection contains the information of eating habits that can be extracted and aggregated from Meals.
- Statistics: The collection is used to store statistical information, such as the number of photos.
- Daily and Weekly: The data of activities can be aggregated daily and weekly and then stored in these collections.
Appendix A.2. Cassandra Schema
- Physical_activity_by_user: stores the data of activities (i.e., walking, standing, sitting, running, etc.) for each user.
- Physical_activity_by_date: stores the data of activities (i.e., walking, standing, sitting, running, etc.) of users for each specific date and time.
- Abarca-Gómez, L.; Abdeen, Z.A.; Hamid, Z.A.; Abu-Rmeileh, N.M.; Acosta-Cazares, B.; Acuin, C.; Adams, R.J.; Aekplakorn, W.; Afsana, K.; Aguilar-Salinas, C.A.; et al. Worldwide trends in body-mass index, underweight, overweight, and obesity from 1975 to 2016: A pooled analysis of 2416 population-based measurement studies in 128 · 9 million children, adolescents, and adults. Lancet 2017, 390, 2627–2642. [Google Scholar] [CrossRef][Green Version]
- Dobbs, R.; Manyika, J. The obesity crisis. Cairo Rev. Glob. Aff. 2015, 5, 44–57. [Google Scholar]
- Macaulay, E.; Donovan, E.; Leask, M.; Bloomfield, F.; Vickers, M.; Dearden, P.; Baker, P. The importance of early life in childhood obesity and related diseases: A report from the 2014 Gravida Strategic Summit. J. Dev. Orig. Health Dis. 2014, 5, 398–407. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Bhave, S.; Bavdekar, A.; Otiv, M. IAP national task force for childhood prevention of adult diseases: Childhood obesity. Indian Pediatr. 2004, 41, 559–576. [Google Scholar] [PubMed]
- Collaborators, G.O. Health effects of overweight and obesity in 195 countries over 25 years. N. Engl. J. Med. 2017, 377, 13–27. [Google Scholar] [CrossRef]
- Di Cesare, M.; Sorić, M.; Bovet, P.; Miranda, J.J.; Bhutta, Z.; Stevens, G.A.; Laxmaiah, A.; Kengne, A.P.; Bentham, J. The epidemiological burden of obesity in childhood: A worldwide epidemic requiring urgent action. BMC Med. 2019, 17, 1–20. [Google Scholar] [CrossRef][Green Version]
- Daumit, G.L.; Dickerson, F.B.; Wang, N.Y.; Dalcin, A.; Jerome, G.J.; Anderson, C.A.; Young, D.R.; Frick, K.D.; Yu, A.; Gennusa, J.V., III; et al. A behavioral weight-loss intervention in persons with serious mental illness. N. Engl. J. Med. 2013, 368, 1594–1602. [Google Scholar] [CrossRef][Green Version]
- Katzmarzyk, P.T.; Barreira, T.V.; Broyles, S.T.; Champagne, C.M.; Chaput, J.P.; Fogelholm, M.; Hu, G.; Johnson, W.D.; Kuriyan, R.; Kurpad, A.; et al. The international study of childhood obesity, lifestyle and the environment (ISCOLE): Design and methods. BMC Public Health 2013, 13, 900. [Google Scholar] [CrossRef][Green Version]
- Blake-Lamb, T.L.; Locks, L.M.; Perkins, M.E.; Baidal, J.A.W.; Cheng, E.R.; Taveras, E.M. Interventions for childhood obesity in the first 1000 days a systematic review. Am. J. Prev. Med. 2016, 50, 780–789. [Google Scholar] [CrossRef][Green Version]
- Briggs, A.D.; Mytton, O.T.; Kehlbacher, A.; Tiffin, R.; Rayner, M.; Scarborough, P. Overall and income specific effect on prevalence of overweight and obesity of 20% sugar sweetened drink tax in UK: Econometric and comparative risk assessment modelling study. BMJ 2013, 347, f6189. [Google Scholar] [CrossRef][Green Version]
- Yang, H.J.; Kang, J.H.; Kim, O.H.; Choi, M.; Oh, M.; Nam, J.; Sung, E. Interventions for preventing childhood obesity with smartphones and wearable device: A protocol for a non-randomized controlled trial. Int. J. Environ. Res. Public Health 2017, 14, 184. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Maramis, C.; Diou, C.; Ioakeimidis, I.; Lekka, I.; Dudnik, G.; Mars, M.; Maglaveras, N.; Bergh, C.; Delopoulos, A. Preventing obesity and eating disorders through behavioural modifications: The SPLENDID vision. In Proceedings of the 2014 4th International Conference on Wireless Mobile Communication and Healthcare-Transforming Healthcare Through Innovations in Mobile and Wireless Technologies (MOBIHEALTH), Athens, Greece, 3–5 November 2014; pp. 7–10. [Google Scholar]
- Delopoulos, A. Big Data Against Childhood Obesity, the BigO Project. In Proceedings of the 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), Cordoba, Spain, 5–7 June 2019; pp. 64–66. [Google Scholar]
- Berman, J.J. Confidentiality issues for medical data miners. Artif. Intell. Med. 2002, 26, 25–36. [Google Scholar] [CrossRef]
- Elger, B.S.; Iavindrasana, J.; Iacono, L.L.; Müller, H.; Roduit, N.; Summers, P.; Wright, J. Strategies for health data exchange for secondary, cross-institutional clinical research. Comput. Methods Programs Biomed. 2010, 99, 230–251. [Google Scholar] [CrossRef]
- Ponemon, I. Sixth Annual Benchmark Study on Privacy & Security of Healthcare Data; Technical Report; Ponemon Institute LLC: Traverse City, MI, USA, 2016. [Google Scholar]
- Aggarwal, C.C. On k-anonymity and the curse of dimensionality. In Proceedings of the VLDB, Trondheim, Norway, 30 August–2 September 2005; Volume 5, pp. 901–909. [Google Scholar]
- Fung, B.C.; Wang, K.; Fu, A.W.C.; Philip, S.Y. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques; CRC Press: Boca Raton, FL, USA, 2010. [Google Scholar]
- Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. L-diversity: Privacy Beyond K-anonymity. ACM Trans. Knowl. Discov. Data 2007, 1, 3-es. [Google Scholar] [CrossRef]
- Sweeney, L. K-anonymity: A Model for Protecting Privacy. Int. J. Uncertain. Fuzziness-Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef][Green Version]
- Nelson, G.S. Practical Implications of Sharing Data: A Primer on Data Privacy, Anonymization, and De-Identification; Technical Report; ThotWave Technologies: Chapel Hill, NC, USA, 2015. [Google Scholar]
- Kanwal, T.; Anjum, A.; Khan, A. Privacy preservation in e-health cloud: Taxonomy, privacy requirements, feasibility analysis, and opportunities. Clust. Comput. 2021, 24, 293–317. [Google Scholar] [CrossRef]
- Manios, Y.; Grammatikaki, E.; Androutsos, O.; Chinapaw, M.; Gibson, E.; Buijs, G.; Iotova, V.; Socha, P.; Annemans, L.; Wildgruber, A.; et al. A systematic approach for the development of a kindergarten-based intervention for the prevention of obesity in preschool age children: The ToyBox-study. Obes. Rev. 2012, 13, 3–12. [Google Scholar] [CrossRef]
- Paans, N.P.; Bot, M.; Brouwer, I.A.; Visser, M.; Roca, M.; Kohls, E.; Watkins, E.; Penninx, B.W. The association between depression and eating styles in four European countries: The MooDFOOD prevention study. J. Psychosom. Res. 2018, 108, 85–92. [Google Scholar] [CrossRef]
- Lakerveld, J.; Glonti, K.; Rutter, H. Individual and contextual correlates of obesity-related behaviours and obesity: The SPOTLIGHT project. Obes. Rev. 2016, 17, 5–8. [Google Scholar] [CrossRef][Green Version]
- Gibbons, C.; Del Pozo, G.B.; Andrés, J.; Lobstein, T.; Manco, M.; Lewy, H.; Bergman, E.; O’Callaghan, D.; Doherty, G.; Kudrautseva, O.; et al. Data-as-a-service platform for delivering healthy lifestyle and preventive medicine: Concept and structure of the DAPHNE project. JMIR Res. Protoc. 2016, 5, e222. [Google Scholar] [CrossRef]
- Voigt, P.; Von dem Bussche, A. The eu general data protection regulation (gdpr). In A Practical Guide, 1st ed.; Springer International Publishing: Cham, Switzerland, 2017; Volume 10, p. 3152676. [Google Scholar]
- Rantos, K.; Drosatos, G.; Demertzis, K.; Ilioudis, C.; Papanikolaou, A.; Kritsas, A. ADvoCATE: A consent management platform for personal data processing in the IoT using blockchain technology. In Proceedings of the International Conference on Security for Information Technology and Communications, Bucharest, Romania, 14–15 November 2018; pp. 300–313. [Google Scholar]
- Larrucea, X.; Moffie, M.; Asaf, S.; Santamaria, I. Towards a GDPR compliant way to secure European cross border Healthcare Industry 4.0. Comput. Stand. Interfaces 2020, 69, 103408. [Google Scholar] [CrossRef]
- Mustafa, U.; Pflugel, E.; Philip, N. A novel privacy framework for secure m-health applications: The case of the GDPR. In Proceedings of the 2019 IEEE 12th International Conference on Global Security, Safety and Sustainability (ICGS3), London, UK, 16–18 January 2019; pp. 1–9. [Google Scholar]
- Sahama, T.; Croll, P. A data warehouse architecture for clinical data warehousing. In Proceedings of the ACSW Frontiers 2007: Proceedings of 5th Australasian Symposium on Grid Computing and e-Research, 5th Australasian Information Security Workshop (Privacy Enhancing Technologies), and Australasian Workshop on Health Knowledge Management and Discovery, Victoria, Australia, 30 January–2 February 2007; pp. 227–232. [Google Scholar]
- Neamah, A.F. Flexible Data Warehouse: Towards Building an Integrated Electronic Health Record Architecture. In Proceedings of the 2020 International Conference on Smart Electronics and Communication (ICOSEC), Tamilnadu, India, 10–12 September 2020; pp. 1038–1042. [Google Scholar]
- Poenaru, C.E.; Merezeanu, D.; Dobrescu, R.; Posdarascu, E. Advanced solutions for medical information storing: Clinical data warehouse. In Proceedings of the 2017 E-Health and Bioengineering Conference (EHB), Sinaia, Romania, 22–24 June 2017; pp. 37–40. [Google Scholar]
- Sweeney, L. Datafly: A system for providing anonymity in medical data. In Database Security XI; Springer: Berlin/Heidelberg, Germany, 1998; pp. 356–381. [Google Scholar]
- Chiang, Y.C.; Hsu, T.s.; Kuo, S.; Liau, C.J.; Wang, D.W. Preserving confidentiality when sharing medical database with the Cellsecu system. Int. J. Med. Inform. 2003, 71, 17–23. [Google Scholar] [CrossRef]
- Agrawal, R.; Johnson, C. Securing electronic health records without impeding the flow of information. Int. J. Med. Inform. 2007, 76, 471–479. [Google Scholar] [CrossRef] [PubMed]
- Prasser, F.; Kohlmayer, F.; Lautenschläger, R.; Kuhn, K.A. ARX—A comprehensive tool for anonymizing biomedical data. In Proceedings of the AMIA Annual Symposium Proceedings. American Medical Informatics Association, Washington, DC, USA, 19–21 May 2014; Volume 2014, p. 984. [Google Scholar]
- Nguyen, T.A.; Le-Khac, N.A.; Kechadi, M.T. Privacy-aware data analysis middleware for data-driven ehr systems. In Proceedings of the International Conference on Future Data and Security Engineering, Ho Chi Minh City, Vietnam, 29 November–1 December 2017; pp. 335–350. [Google Scholar]
- Tran, N.H.; Nguyen-Ngoc, T.A.; Le-Khac, N.A.; Kechadi, M. A Security-Aware Access Model for Data-Driven EHR System. arXiv 2019, arXiv:1908.10229. [Google Scholar]
- Zeilenga, K. Lightweight Directory Access Protocol (LDAP): Technical Specification Road Map; Technical Report, RFC 4510, June; OpenLDAP Foundation: Minden, NV, USA, 2006. [Google Scholar]
- Sun, J.; Gao, Z. Improved mobile application security mechanism based on Kerberos. In Proceedings of the 2019 4th International Workshop on Materials Engineering and Computer Sciences, Bangkok, Thailand, 17–19 May 2019; pp. 108–112. [Google Scholar]
- Tewari, H.; Hughes, A.; Weber, S.; Barry, T. X509Cloud—Framework for a ubiquitous PKI. In Proceedings of the MILCOM 2017—2017 IEEE Military Communications Conference (MILCOM), Baltimore, MD, USA, 23–25 October 2017; pp. 225–230. [Google Scholar]
- US, I.C. Secure and Protect Cassandra Databases with IBM Security Guardium. Available online: https://www.ibm.com/developerworks/library/se-secure-protect-cassandra-databases-ibm-security-guardium-trs/index.html (accessed on 5 October 2020).
- Xiong, L.; Truta, T.M.; Fotouhi, F. Report on international workshop on privacy and anonymity in the information society (PAIS 2008). ACM SIGMOD Rec. 2009, 37, 108–111. [Google Scholar] [CrossRef]
- Rafiei, M.; Wagner, M.; van der Aalst, W.M. TLKC-privacy model for process mining. In Proceedings of the International Conference on Research Challenges in Information Science, Limassol, Cyprus, 23–25 September 2020; pp. 398–416. [Google Scholar]
|Identifiers||Person identification||name, email, address||phone number I|
|Demographics||Person classification to a specific group of the population||race, age, gender, area, postal code, education, occupation, marital status||Q-I|
|Personal Biometrics||Medical information related to physical health||X-Ray, MRI, ultrasound, blood pressure, cholesterol, heart rate, allergies, ICU incidents, tests reports||S|
|Clinical information||Medical history||diagnoses, dosages, treatment services, medication, encounters, problems, therapies||Q-I, S|
|Mental information||Related to psychological, psychiatric, and psychosocial issues||sleep problems, psychology, excessive dieting, psychological sexual disorders||S|
|Life-style and activity information||Relevant to physical activities, life-style||physical activities, exercise regime, nutrition, energy consumption through exercises||S|
|Insurance and financial matters||Related to billing, reimbursements, insurance||DRG, financial class, primary and specialist providers||Q-I, S|
|Collected Data Type||Light||Standard||Enhanced|
|GPS||SP||WB (with GPS)||WB|
|Food Barcode scanning||SP||SP||SP|
|Meal eating behaviour||-||Limited MM use||Extended MM use|
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shahid, A.; Nguyen, T.-A.N.; Kechadi, M.-T. Big Data Warehouse for Healthcare-Sensitive Data Applications. Sensors 2021, 21, 2353. https://doi.org/10.3390/s21072353
Shahid A, Nguyen T-AN, Kechadi M-T. Big Data Warehouse for Healthcare-Sensitive Data Applications. Sensors. 2021; 21(7):2353. https://doi.org/10.3390/s21072353Chicago/Turabian Style
Shahid, Arsalan, Thien-An Ngoc Nguyen, and M-Tahar Kechadi. 2021. "Big Data Warehouse for Healthcare-Sensitive Data Applications" Sensors 21, no. 7: 2353. https://doi.org/10.3390/s21072353