Availability of Real-World Data in Italy: A Tool to Navigate Regional Healthcare Utilization Databases

The purpose of the study was to map and describe the healthcare utilization databases (HUDs) available in Italy’s 19 regions and two autonomous provinces and develop a tool to navigate through them. A census of the HUDs covering the population of a single region/province and recording local-level data was conducted between January 2014 and October 2016. The characteristics of each HUD regarding the start year, data type and completeness, data management system (DMS), data protection procedures, and data quality control adopted were collected through interviews with the database managers using a standard questionnaire or directly from the website of the regional body managing them. Overall, 352 HUDs met the study criteria. The DMSs, anonymization procedures of personal identification data, and frequency of data quality control were fairly homogeneous within regions, whereas the number of HUDs, data availability, type of identification code, and anonymization procedures were considerably heterogeneous across regions. The study provides an updated inventory of the available regional HUDs in Italy and highlights the need for greater homogeneity across regions to improve comparability of health data from secondary sources. It could represent a reference model for other countries to provide information on the available HUDs and their features, enhancing epidemiological studies across countries.


Introduction
In the past few decades, large amounts of information underwent digitalization in healthcare, where a number of administrative data related to the utilization of healthcare services and financial and clinical information are routinely and continuously collected in large databases (healthcare utilization databases, HUDs). In Italy, the National Health Service (NHS) provides healthcare to all residents (about 60 million), irrespective of income, gender, or other factors. Healthcare is publicly financed, and services are either free at the point of delivery or involve co-payment of a small flat rate [1]. The Italian NHS is decentralized and organized at three levels: national, regional (19 regions and two autonomous provinces), and local. The general NHS objectives and principles are set at the national level, which also allocates the financial resources. Regional authorities are responsible for organizing and managing the healthcare services, which are delivered through local structures, as well as public and private accredited providers (local facilities providing healthcare services on behalf of the NHS) [2].
Data on the services provided to residents are collected by hospitals and local healthcare structures, entered into structured data files (HUDs) by dedicated regional offices, and periodically sent to the Ministry of Health. Data are registered and stored according to the Italian and European General Data Protection Regulation [3,4]. HUDs are then used at the regional and national level for purposes such as reimbursements, health expenditure monitoring and control, and healthcare service performance assessments. In particular, hospitals and local healthcare structures (i) provide healthcare services supporting the relative costs, (ii) register information related to the services and costs, and (iii) periodically send data to dedicated offices of their own region in order to receive the reimbursement of the provided healthcare services. Reimbursements require high data completeness and quality. Although a few mandatory HUDs were set up in the early 1990s [5], the vast majority were established in 2003 [6] or later.
As a result of NHS decentralization, the mandatory HUDs were set up at different times. The regions/provinces also set up other electronic databases, such as disease and mortality registries, not for administrative use but to monitor the health status of the population.
Although a number of studies drew data from HUDs to produce scientific evidence, very few did so by combining HUDs from different Italian regions. The reasons probably include difficulties in HUD accessibility, heterogeneous information systems, and inconsistent data completeness and quality across databases. The ability to combine regional HUDs would not only provide national-level data, but also enable comparing disease burden and healthcare service performance and expenditure among regions; it would also allow exploring rare diseases and their determinants, the adverse events of treatments, and the effects of innovative therapies. A tool enabling easy retrieval of the information stored in HUDs spanning a discrete geographical area and population would enhance the use of real-world data in epidemiological and clinical research.
The aims of this study were to make an inventory of the Italian regional HUDs, to describe them in terms of start year, data type and completeness, data management system (DMS), quality control strategy, and data protection procedures in place, and to develop a tool to navigate through them.

Materials and Methods
From January 2014 to October 2016, a survey aimed at identifying the HUDs active in Italy's 19 regions and two autonomous provinces (Trento and Bolzano) was performed by the Italian Society of Medical Statistic and Clinical Epidemiology's Working Group on Observational Studies. The survey was funded by the Italian Ministry of Health (RF-2010-2315604) in the framework of the research project ARCHES, "Electronic health databases as a source of reliable information for effective health policy" [20]. The HUDs were included in the survey according to the following inclusion criteria: covering the whole population of a single region and recording local-level data in which the observation unit is the healthcare service.
The regional bodies managing HUDs were identified by the website of the institution and invited to participate. For those answering the invitation, a standard questionnaire (Table S1, Supplementary Materials) was sent (a user guide was also provided) that was either self-administrated or administrated by one of the authors. In case of no answer, a check of available information on HUDs on the website of the regional body was made. HUD information for two regions, Emilia-Romagna and Toscana, was extracted directly from the website of the regional body managing their HUDs. The remaining regional bodies were repeatably contacted until all accepted to participate in the survey. A summary of the survey procedure is shown Figure 1. The following HUD characteristics were recorded: (i) name of the database, name of the database manager(s), start year and period covered, reference legislation; (ii) observation unit, type of information recorded, population covered, and size thereof; (iii) DMS, disease classification, type of personal identification code, and anonymization procedure used to protect patient identity (if any); (iv) missing values in specific fields and procedures and periodicity of data quality control; (v) data sources; (vi) frequency of data transmission from the sources to the regional/provincial administration.
The information obtained for each HUD was uploaded on the ARCHES project webpage [20], where it can be accessed from the region/province list or HUD list, by start year, by a combination of these criteria, or through a geographical map reporting the number of HUDs identified in each region/province.
A descriptive statistical analysis was performed to evaluate the main characteristics of the HUDs. Firstly, HUDs were grouped into three categories, 1-healthcare services, 2-conditions, diseases, other events, 3-other, based on the observational unit (e.g., healthcare service, beneficiary with a disease, any other events related to the health status or healthcare). In particular, category 1 included hospital discharge, outpatient care, residential care and hospice, home healthcare, mental healthcare, spa treatments, substance addiction treatment, blood transfusion services, cross-border healthcare, and emergency care; category 2 included disease registries, cancer registries, mortality registries, rare disease registries, infectious disease registries, accident registries, occupational health and safety registries, medical birth databases, spontaneous abortions databases, legal voluntary termination of pregnancy (VTP) databases, and screening registries; category 3 included NHS beneficiaries databases, co-pay exemptions databases, general practitioner registers, clinical laboratory services databases, drug dispensing databases (by healthcare facilities and hospitals and through contracted pharmacies), pathological anatomy databases, prosthesis registries, and vaccination registries. The following HUD characteristics were recorded: (i) name of the database, name of the database manager(s), start year and period covered, reference legislation; (ii) observation unit, type of information recorded, population covered, and size thereof; (iii) DMS, disease classification, type of personal identification code, and anonymization procedure used to protect patient identity (if any); (iv) missing values in specific fields and procedures and periodicity of data quality control; (v) data sources; (vi) frequency of data transmission from the sources to the regional/provincial administration.
The information obtained for each HUD was uploaded on the ARCHES project webpage [20], where it can be accessed from the region/province list or HUD list, by start year, by a combination of these criteria, or through a geographical map reporting the number of HUDs identified in each region/province.
A descriptive statistical analysis was performed to evaluate the main characteristics of the HUDs. Firstly, HUDs were grouped into three categories, 1-healthcare services, 2-conditions, diseases, other events, 3-other, based on the observational unit (e.g., healthcare service, beneficiary with a disease, any other events related to the health status or healthcare). In particular, category 1 included hospital discharge, outpatient care, residential care and hospice, home healthcare, mental healthcare, spa treatments, substance addiction treatment, blood transfusion services, cross-border healthcare, and emergency care; category 2 included disease registries, cancer registries, mortality registries, rare disease registries, infectious disease registries, accident registries, occupational health and safety registries, medical birth databases, spontaneous abortions databases, legal voluntary termination of pregnancy (VTP) databases, and screening registries; category 3 included NHS beneficiaries databases, co-pay exemptions databases, general practitioner registers, clinical laboratory services databases, drug dispensing databases (by healthcare facilities and hospitals and through contracted pharmacies), pathological anatomy databases, prosthesis registries, and vaccination registries.
Secondly, the absolute and percentage distributions of HUDs within each category were calculated according to region, start year, data management system, personal identification code, anonymization, coding system, data quality control, and data transmission.

Results
The survey identified 352 HUDs meeting the study criteria. The geographical distribution of the HUDs in relation to the three categories is reported in Table 1. HUDs ranged from 39 in the province of Trento (Northern Italy) to six in Sardegna (southern Italy/Islands). Healthcare services databases were the most frequent and were found in nearly all regions.  (Table 2). As expected, mandatory HUDs were present in nearly all regions, although they were set up in different years (Table S2, Supplementary Materials). Some of the other HUDs were found only in three or fewer regions; in particular, a vaccination database was identified in three regions in northern Italy, a pathological anatomy database was identified in two regions in northern and central Italy, and a clinical laboratory services registry, an occupational health and safety registry, and a blood transfusion services and accident registry were identified in two regions in northern Italy. Data Management System (DMS). Different DMSs were in use in the different regions and HUD categories (Table 3). Even though information on the software used was not available in almost 25% of HUDs, Oracle was the most common DMS in the healthcare services and conditions, diseases, other events categories, whereas Structured Query Language, SQL was the most common in the other databases. Other software (e.g., Java, Ippocrate, Netezza) was used in 14% of HUDs, and more than one DMS was employed in 11%. However, the same DMS was generally used for managing multiple HUDs within a region (Table S3, Supplementary Materials). Personal identification code. Two personal identification codes were used in the HUDs: the unique identification code (ID) generated by regional authorities and the fiscal code (FC) (see Table 4 for the definitions). More than one type of identification code was used within each HUD category, as shown in Table 4; however, information was lacking for about one-fifth of HUDs. The unique ID was used by 46% of HUDs and the FC was used by around 26%; in 6% of HUDs, the personal identification code was not reported. To protect personally identifiable data, some regions did not use any ID in those HUDs that record highly sensitive information, such as data on spontaneous abortions, legal voluntary termination of pregnancy, and substance addiction treatment (3% , Table S4, Supplementary Materials). Anonymization. The techniques used to anonymize personal data included separation (the procedure whereby any element that can lead to direct identification from personal data is removed and stored separately, with only a reference number left to allow re-identification by authorized parties), pseudonymization (an encrypted pseudonym derived from the personal data), an unspecified internal procedure, or encryption. Different anonymization techniques were identified within the HUDs categories among regions. Separation was the most common (Table 5), followed by an unspecified internal procedure. Pseudonymization and encryption were employed less frequently. The anonymization method was not reported in about 27% of HUDs. However, a single technique was used across each HUD category in each region (Table S5, Supplementary Materials). Coding system. The coding systems used in the HUDs are reported in Table 6. Disease classification was most commonly performed according to the International Classification of Diseases (ICD); 9th revision, Clinical Modification (CM), ICD9 CM [21]), which is the method used at the national level. Some HUDs and regions used the ICD 10th revision (ICD10 CM) [22]; in particular, as requested by the Word Health Organization (WHO), it was the system employed in the mortality registries (Trento, Lombardia, Marche, Puglia and Calabria), the mental healthcare registries (Trento, Veneto, Lombardia, Friuli-Venezia-Giulia, Emilia-Romagna), and the cancer (Lombardia, Umbria), substance addiction treatment (Emilia-Romagna), and disease (Piemonte) registries. In all drug dispensing databases, drugs were identified either by the Anatomical Therapeutic Chemical (ATC) classification [21] or by the Italian authorisation number (AIC) number [23] (see Table S6, Supplementary Materials, for the definitions).
The morphological, pathological, and clinical classification of malignant tumors was according to international codes (ICD-O 3, Systematized Nomenclature of Medicine (SNOMED), the classification system of malignant tumors, TNM) in all Regions.
Demographic data (e.g., province, region, and country of residence) were entered using the coding systems provided by the Italian National Institute of Statistics.
Data quality control. In about 84% of HUDs, the relevant regional/provincial administration stated that data completeness and quality were checked periodically (Table 7). In 40% of HUDs, data quality control was performed automatically at the time of recording by checking agreement between the data value and the pre-established data format and by controlling for any missing values in required fields. This method was more frequent in healthcare service HUDs. Data quality control was performed within a month or within 3-12 months from data acquisition in one-quarter and one-fifth of HUDs, respectively, whereas it was not performed in about 2% of HUDs, all belonging to the conditions, diseases, other events category; this information was not reported for 14% of HUDs. In 55% of HUDs, the presence and frequency of missing data was not mentioned; in 25%, a statement reported that there were no missing data; in 12%, missing data ranged from 1% to 50%, whereas, in 8%, missing data were mentioned but not quantified (data not shown).
Data transmission. Data were transmitted from the healthcare providers to the administration at the time of recording in 11% of HUDs (Table 8), within 3-12 months in about 25%, and within one month of being recorded in most cases. The frequency of data transmission was not defined in 3% of HUDs and was not available in about 11%.

Discussion
In this survey, we aimed to create an inventory of the Italian regional HUDs, to describe them, and to develop a tool to navigate through them. The survey identified 352 electronic healthcare databases meeting the study criteria and described them in terms of start year, data type and completeness, data management system (DMS), quality control strategy, and data protection procedures in place. The inventory of the regional HUDs found in Italy's 19 regions and two autonomous provinces is now available on a dedicated page of the ARCHES project website [20].
We found a widely different number of HUDs and start years in each region/province which reflects a highly different data availability.
The considerable homogeneity found within each region/province in important HUD features like the unique personal identification code, the anonymization technique, and the DMS adopted enable record linkage across HUDs.
Among regions, we found that the classification systems for diseases and drugs adopted were fairly homogeneous; the fact that some regions employed a more recent ICD revision highlights their greater promptness in implementing WHO recommendations. Since most administrative HUDs are regulated by national law, the same revision should be adopted everywhere.
Our survey highlighted different anonymization procedures employed by the various regions that can penalize clinical and epidemiological studies; for instance, it may hamper follow-up studies of patients with long-term, severe, or rare diseases, who are often treated at out-of-region specialist centers. In Italy, the healthcare services delivered to each resident-including those supplied by out-of-region centers-are paid for by the region where the patient resides. Although information about the services supplied by the other region are sent to the region of residence, it may not be transmitted all at the same time, resulting in temporary inconsistencies between databases that may cause the same query to yield different responses depending on the time it is submitted. To overcome these limitations, thus also enhancing the ability of HUDs to be used for nationwide healthcare monitoring and assessment, the National Unique Personal Code was established in 2016 [24] and adopted in 2017 by all regional HUDs. In doing so, Italy followed the example of other European countries, such as the Scandinavian countries, where a unique personal identification number assigned to permanent residents enables linkage of their records across multiple registries and databases [25][26][27].
The above results show the importance to explore the characteristics of different HUDs in a specific population and of the same types of HUDs in different populations and to provide a panel of metadata. The availability of these metadata facilitates the planning of epidemiological studies and regional and national studies to evaluate healthcare assistance. The difficulty to source information on HUDs characteristics could limit the studies based on large populations. In our survey, we found that data quality control procedures were in place in most HUDs but were characterized by widely different timing and methods both within and among regions. Data quality and completeness play a major role in supporting the validity of studies based on secondary sources and the ability of results to be generalized [28,29], especially because such studies are particularly prone to misclassification and the influence of confounding factors [11]. Addressing these problems requires application of appropriate study protocols and quality standards [30][31][32], as well as state-of-the-art methods of data analysis [7,8,[33][34][35][36][37]. Awareness of the data quality control approach applied in a database can help users choose their data sources, design their study, and plan data analysis.
Some limitations should be considered in this survey. A selection bias may have occurred. We contacted the regional body managing HUDs, but different HUDs are often managed by different regional officials, who may not have all been contacted and involved in the survey by the manager of the regional HUDs. As a result, some HUDs may have been missed, explaining the small number of HUDs identified in some regions.
In some regions, the technical aspects of HUDs are managed by more than one person or by contractors, which may explain some missing responses on certain HUD features, like the anonymization procedure, software, and coding system used. To handle such information bias, after the survey results were uploaded on the ARCHES website [20], the managers of the regional HUDs were invited to check the data for inaccuracies or missing information.
The selection of the 38 HUDs with information from the regional website may have biased our results. The percentages of missing data regarding data management system, personal identification code, anonymization, and data quality control were significantly higher than the 314 HUDs selected by the administrated or self-administrated questionnaires (data not shown). No significant difference was observed between the latter two. However, the concerned HUDs were only 38 among the 352 identified HUDs, suggesting a negligible entity of the error.
Our survey brings some noteworthy strengths. To the best of our knowledge, it is the first study that gives insights into the activated HUDs and their types in each of the Italian regions, describing their main characteristics. It provides a tool for navigating through Italy's regional HUDs, useful to researchers and those involved in healthcare evaluation to easily retrieve the information they require for epidemiological, clinical, and translational research and for healthcare system performance assessment. Our survey provides useful indication to identify the needed actions to optimize the use of HUD data. It employs an accurate methodology to identify HUDs, to describe them, to individuate their critical aspects, and to provide metadata to researchers, giving them the opportunity to know the information asset of a national situation in which the responsibility to manage health information is entrusted to the regions.
In practice, results of the survey may have important implications helping to fill the lacking knowledge of the activated HUDs in the Italian regions and their characteristics; therefore, they may enhance the use of real-world data in epidemiological and clinical research. On the other hand, the concerned health authorities, both local and national, could use the findings to improve homogeneity in and across regions, to provide updates, and to ensure that users can easily retrieve the information they require to foster the use of real-word data. In an international context, the survey and the tool for navigating HUDs could be a model to for other countries to provide information with regard to the available HUDs of their own country, covering a defined population and their features. The availability of metadata on the organization and the characteristics of HUDs could be important information to plan supranational projects using secondary sources of data. The possibility to retrieve these metadata in a standardized manner in different countries allows knowing and comparing the information asset between countries. In addition, these metadata could be helpful to avoid barriers for the protection of individuals with regard to the processing and circulation of personal data when HUDs are linked and used for epidemiological purposes, as already discussed by the Working Group on Observational Studies of the Italian Society of Medical Statistic and Clinical Epidemiology [38,39]. For these reasons, the proposed model could also be useful in the European context, which recently oriented health programs toward actions for a sustainable solid infrastructure on European health information through improving the availability of comparable, robust, and policy-relevant population health data and health system performance information [40]. Future updates of this dynamic tool are needed to ensure researchers and institutions with current information on the available HUDs.

Conclusions
In this survey, an inventory of the real-world regional databases of healthcare and related services managed by the regional/provincial administrations in Italy, as well as an examination of their characteristics, was provided. The ARCHES website represents a dynamic tool that allows researchers and institutions to get insights into the available Italian regional HUDs.
Results of our survey pointed out some critical issues that hamper the use of these secondary data sources in epidemiological studies, comparative effectiveness research, and assessments of healthcare system performance and health technology. Therefore, it highlights the need to improve homogeneity across regions that will allow improving the comparability of health data from secondary sources.
The survey and the tool for navigating HUDs proposed in this study can be considered as a useful model for other countries to provide information to researchers and institutions about the available HUDs in their countries. This will promote and enhance supranational large-scale epidemiological studies based on the use of health secondary data sources and will stimulate the sharing of standardized procedures to retrieve and compare the information asset between countries. Our study is consistent with the European strategy and activities on health information aiming to ensure better availability and use of health data for policy and research.
Supplementary Materials: The following are available online at http://www.mdpi.com/1660-4601/17/1/8/s1: Table S1: Questionnaire used to survey the HUDs covering the population of a single region/province and recording local-level data in Italy (self-administered or administered by one of the authors); Table S2: HUD start year according to region and category; Table S3: Type of data management system used in HUDs according to region; Table S4: Type of personal identification data used in HUDs according to the region; Table S5: The type of anonymization of the personal identification data used in HUDs according to the region; Table S6: Coding systems.