Luxembourg Fund Data Repository

: In this paper, we introduce the Luxembourg Fund Data Repository, a novel database of investment funds available for academic research that was created at the Department of Finance of the University of Luxembourg. The database contains the population of Undertakings for Collective Investment in Transferable Securities funds domiciled in Luxembourg from the starting month of their existence (March 1988) to October 2016. The fund characteristics are organized in a comprehensive database architecture encompassing static and dynamic data over the entire life of the funds. The characteristics include fund identiﬁers, o ﬃ cial name, status information, management company and other service providers, daily and monthly performance time-series, portfolio holdings, classiﬁcation of investment objective, fees, dividends, and cash ﬂows. The database was constructed after collecting and assembling complementary historical information from three data providers. Importantly, funds no longer in existence due to liquidation or mergers are included in the database, preventing survivorship bias. The database has been constructed to serve as a research dataset of high accuracy due to the maximization of population coverage, the maximization of historical coverage, and validation by using information acquired from the supervisory authority of the ﬁnancial sector of Luxembourg. License currently available to researchers of the Department of Finance of the University of Luxembourg. Future plans for extending accessibility to the global academic community.


Summary
Investors across the globe have demonstrated strong demand for regulated open-end funds in the past decade, and total net assets of worldwide regulated open-end funds totaled $46.7 trillion at the end of 2018 [1]. The total net assets under the management of European investment funds reached €15.2 trillion in 2018 according to the European Fund and Asset Management Association [2]. Though Luxembourg is the leading investment fund domicile in Europe and the second largest worldwide behind the United States (US), with €4404 billion assets under management as of March 2019 [3], academic researchers in the investment fund literature have so far been working primarily with US mutual fund data distributed by the Center for Research in Security Prices (CRSP) [4,5]. The investment fund industry of Luxembourg is a worldwide leader in cross-border fund distribution. Luxembourg-domiciled investment funds are distributed in more than 75 countries around the globe, with a particular focus on Europe, Asia, Latin America, and the Middle East [6]. The largest segment of investment funds in Luxembourg is comprised of the Undertakings for Collective Investment in Transferable Securities (UCITS). UCITS are investment vehicles that invest in liquid assets and can be publicly marketed and sold to retail investors throughout the European Union (EU). Today, UCITS funds are the most widely accepted retail investment funds worldwide and constitute a well-regulated investment product with significant levels of investor protection [3]. According to the 1 For the rest of the paper, the terms "the LFDR" and "the database" are used interchangeable. 2 Fundsquare is a subsidiary of Luxembourg Stock Exchange. All these factors can together be seen as a proxy for the study of investment funds as a product of investment activity. These factors are similar to those offered in the database of its counterpart in the United States, the CRSP Mutual Fund database, but they represent an important innovation with respect to UCITS databases. Additionally, the LFDR contains a more globally representative set of funds as a proxy for international investment flows than the CRSP Mutual Fund database. The latter contains a strong US-centric bias, as the vast majority of both underlying portfolio investments and subscribing investors are located in US. In contrast, the UCITS represent a set of investment funds that are globally diversified in terms of both their portfolio investments and their underlying unit-holder base (recognizing, however, that there are usually no US-based investors in UCITS). This greater level of representativity of global investment activity constitutes an important new addition to the study of investment funds and international investment, and it is a key aspect of the innovation that the LFDR brings.

Data Description
The database contains data for Luxembourg-domiciled UCITS covering the period from March 1988 to October 2016. The database contains both active and obsolete (liquidated or merged) UCITS, including appropriate referencing for merged funds. The database architecture reflects the structure of UCITS funds. Details about the structure and the main characteristics of UCITS with respect to fund, sub-fund(s), and share class(es) can be found in Appendix A. Section 2.1 describes the number of entities in the database in terms of funds, sub-funds, and share classes. Section 2.2 presents the data tables and the data fields that were created to organize the data of funds, sub-funds, and share classes. Section 2.3 focuses on the historical coverage of the major time-series fields by presenting figures of the distribution of historical observations per year. Finally, Section 2.4 provides details on how the core data tables are linked and on how the parent-child relationships of UCITS parts are represented in the database. Table 1 presents the database population in terms of total number of funds, sub-funds, and share classes. The database contains 4591 unique funds, 18,982 unique sub-funds, and 84,556 unique share classes. The status information such as active or obsolete status of the above-mentioned entities at the end of October 2016 is also presented in the table. In addition, there are 14,464 unique portfolios associated with the sub-funds. Figure 1 illustrates the break-down of the total fund population per year of fund's constitution date since 1988. It is worth mentioning that in the first year of UCITS existence (1988), 65 new UCITS were created and 180 fund entities, which had been constituted before, were transformed into UCITS in that year.

Data Tables and Data Fields
The content of the database is organized in data tables. Each data table groups together data fields that are conceptually related. In total, there are 24 data tables that include data fields related to four information groups: four data tables related to the fund level, seven data tables related to the  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016 Number of funds Year

Data Tables and Data Fields
The content of the database is organized in data tables. Each data table groups together data fields that are conceptually related. In total, there are 24 data tables that include data fields related to four information groups: four data tables related to the fund level, seven data tables related to the sub-fund level, 10 data tables related to the share class level, and three data tables related to portfolio data. Before presenting the underlying data fields of the data tables, the identifiers of the core entities are reported here; each single fund, sub-fund, share class entity, and portfolio entity is associated with a unique proprietary identifier (FundID, SubfundID, ShareclassID, and PortfolioID, respectively). The data tables related to the fund level include fields such fund identifiers (including FundID), official fund name, status (active or obsolete), constitution date, legal form, management company, auditor company, and custodian and contact information of the fund management company. The data tables related to the sub-fund level include fields such as sub-fund identifiers (including SubfundID), official sub-fund name, status (active or obsolete), launch date, daily time-series of total net assets (TNA), classification of investment objective based on prospectus, classification of investment objective based on portfolio assets, and monthly time-series of cash flows (namely the amount of net subscriptions and net redemptions). The data tables related to the share class level include fields such as share class identifiers including (including ShareclassID), official share class name, status (active or obsolete), launch date, subscription fee, redemption fee, daily time-series of TNA, daily time-series of net asset value (NAV), monthly time-series of return, time-series of dividends, and classification of investment objective in case of currency-hedged share classes. Finally, the data tables related to portfolio information include fields such as portfolio identifiers (including PortfolioID), list of portfolio report dates, complete list of portfolio holdings and asset allocation in terms of predefined equity sub-categories, predefined bond sub-categories, and cash.
Associating specific values of fields with effective date(s) is required in the database. Some fields change their values on a regular basis; these include NAV, which is updated daily, while other fields may change on a non-regular basis such as the fund name. Furthermore, some fields may change their value a maximum of once during the fund life, such as the end date of a fund. In order to organize the time information needed for data fields in a comprehensive and storage efficient way, the data tables are classified in the following three categories:

1.
Data tables of life-cycle fields: These data tables contain fields that are life-cycle constant or may change a maximum of once during the entire life of the fund. Thus, one date maximum is associated with the values of these fields.

2.
Data tables of period-based fields: These data tables contain fields that change on a non-regular basis. Thus, each entry of these data tables contains the start date and the end date of the effective period.

3.
Data tables of time-series fields: These data tables contain fields that change on a periodic basis. The fields constitute time-series with a constant frequency. Thus, each entry of these data tables contains one effective date. Table 2 provides an overview of the database content, including the above-mentioned classification of data tables. The first column (from left to right) of Table 2 presents the four main information groups related to UCITS: fund, sub-fund, portfolio, and share class. The second column presents the names of the 24 data tables. The third column presents the complete list of data fields per data table, while the fourth column presents the category of each data table as described in the paragraph above. The last column reports the number of entries of each data table; furthermore, in case of period-based and time-series fields, the history range of data availability is denoted in parentheses. For example, let us consider the data table with the name "Subfund.DailySeries." Each entry of this table contains the following fields: SubfundID (unique identifier of sub-fund entity), TNA (amount of total net assets for the sub-fund), date (the effective date of TNA), and currency (the currency in which TNA is expressed). The complete definitions of the fields are documented in the user manual that accompanies the database. This table contains daily time-series data and has 7,981,759 entries in total, covering the historical period from 1988-10-18 to 2016-10-21.

Historical Coverage
This section focuses on the historical coverage of the time-series fields. The distribution of the total number of entries (historical observations) across years is presented for the major time-series fields. More specifically, Table 3

Parent-Child Relationships
The structure of UCITS includes parent-child relationships among the information groups of fund, sub-fund, and share class because a fund may contain one or more sub-funds and a sub-fund may contain one or more share classes. The first type of parent-child relationship is represented in the database by linking the identifier of each parent fund with the identifiers of the underlying sub-funds. The second type of parent-child relationship is denoted by linking the identifier of each parent sub-fund with the identifiers of the underlying share classes. Figure 2 shows how the three core data tables (fund, sub-fund, and share class) are linked to each other, that is how the parent-child relationships of UCITS parts are represented in the database. In order to simplify the representation of the data tables, only the data fields related to identification information are presented in Figure 2. the database by linking the identifier of each parent fund with the identifiers of the underlying subfunds. The second type of parent-child relationship is denoted by linking the identifier of each parent sub-fund with the identifiers of the underlying share classes. Figure 2 shows how the three core data tables (fund, sub-fund, and share class) are linked to each other, that is how the parent-child relationships of UCITS parts are represented in the database. In order to simplify the representation of the data tables, only the data fields related to identification information are presented in Figure 2.

Methods
This section describes the main steps of the database creation process with regards to the selection of data providers and the fusion of data collected from different sources. The contribution and the complementarity of the data providers is also presented. Finally, the process of entity matching and merging non-overlapping historical periods is described.

Methods
This section describes the main steps of the database creation process with regards to the selection of data providers and the fusion of data collected from different sources. The contribution and the complementarity of the data providers is also presented. Finally, the process of entity matching and merging non-overlapping historical periods is described.

Selection of Data Providers
A set of data providers was specified in order to satisfy the primary objectives for the database content: the maximization of data field coverage, the maximization of fund population, and the maximization of historical coverage. Initially, a set of fields was specified after a thorough review of the existing investment fund literature and the current state-of-the-art research databases of investment funds. The European legislation that regulate this type of fund (UCITS European directives) and guidelines of the European Securities and Markets Authority (ESMA) were analyzed in order to determine UCITS-specific characteristics. Moreover, technical publications of the Association of the Luxembourg Fund Industry (ALFI) were studied as complementary information sources in order to clarify the structural particularities of UCITS that are domiciled in Luxembourg. The outcome of this step was a set of specific fields of UCITS on which the structure of the database was based, along with accurate definitions of the fields. Following that, an initial nine commercial providers of well-established financial datasets in the bibliography of investment fund research were investigated as potential data providers. The set of specified fields was the starting point for discussions with the potential data providers. Comparative statistics of the population of UCITS entities (namely the total number of funds, sub-funds, and share classes) were conducted among providers. Such comparative statistics are not disclosed in this paper due to restrictions of legal agreements. The criteria for the final decision of selection of data providers were in alignment with the three primary objectives of the database content: (i) the availability of the specified data fields, (ii) population availability among data providers, and (iii) the availability of historical observations for the specified data fields. Based on the maximization of these criteria, two commercial providers were selected, namely Morningstar and Fundsquare (the latter is a subsidiary of the Luxembourg Stock Exchange). In addition to these two commercial data providers, the CSSF kindly contributed to the construction of the new database by providing publicly available information from its historical archives. We used this contribution to Data 2020, 5, 62 9 of 15 validate various aspects of the data against the data obtained from the data providers, in particular to validate the completeness of investment fund coverage in the datasets provided by the data providers. Finally, the available data of UCITS funds covering the period from 1988 to October 2016 were collected from Morningstar, Fundsquare, and the CSSF.

Complementarity of Provided Datasets
The provided datasets from Morningstar, Fundsquare, and the CSSF were analyzed in terms of population, data fields, and historical coverage. Table 4 presents an overview of the datasets provided and sheds light on the complementarity of the three data providers given the differences of population coverage and historical coverage between them. The first column of Table 4 (from right to left) organizes the data fields of the datasets provided in information groups (fund, sub-fund, portfolio, and share class), and the data fields are further classified into the categories of life-cycle fields and time-series data fields, as defined in the Section 3.2. The following three columns present the population (total number of entities) and the availability of historical data for each provider. For some fields, only the latest values of the fields were available in the providers' dataset as there exists an overwrite policy of the values. A characteristic example is the life-cycle fields of the fund level (such as the fund name or fund management company) in the datasets of Morningstar and Fundsquare, as both providers maintain only the latest value of these data fields. In contrast, the CSSF maintains and provided the complete history of such data fields from 1988 to 2016. As can be seen in Table 4, some of the fields were not available from providers. More specifically, Fundsquare could not provide any portfolio data because they do not maintain such data, and the CSSF could not deliver any portfolio data and time-series fields due to disclosure constraints. The final population of the LFDR after the process of data merging (described in the sub-section bellow) is presented in the last column of Table 4. Considering the information from the three data sources, the final database contains complementary data and achieves maximization with regards to population coverage and historical coverage. A matching procedure had to be devised for each of the three levels (fund, sub-fund, and share class) for each UCITS provided by the data providers. It is worth mentioning that there was no need for the matching of portfolios because portfolio entities were only acquired by one provider (Morningstar). Thus, three types of entity matching were performed: fund, sub-fund, and share class matching. The decision criteria to accept or further investigate matches are described below. The matching of fund entities was based on string matching of the official fund names. Initially, an exact string matching algorithm was applied for the automatic matching of the fund names, and then a manual investigation of fund names took place for the automatically unmatched names. The matching of sub-fund entities was based on the two criteria: the matching of name of the parent fund and the matching of the ISIN identifiers of the sub-fund's underlying share classes. The matching of share class entities was based on the ISINs of share classes. Regarding the matching of entities (funds, sub-funds, and share classes), cases of unmatched entities existed. Unmatching was attributed to the misspelling of the names between data providers and/or the differentiation of ISINs. In such cases of ambiguity occurring in the matching procedure, the corresponding entities were not included in the final database. The latter decision was in accordance with the priority of ensuring the accuracy of the database.
In order to maximize the historical coverage for the time-series fields, the merging of historical observations was performed for the matched entities. As fund entities are not associated with time-series fields, the rest of this paragraph refers to time-series of sub-funds and time-series of share classes. For matched entities (either sub-funds or share classes), two cases were distinguished depending on the availability of time-series from one provider only or the availability of time-series from more than one provider. In the first case, the available time-series were introduced into the database for matched sub-funds and for matched share classes. In the second case, the merging of non-overlapping historical observations was considered given the two times-series in order to minimize the historical periods with missing data. A successful consistency check was required before merging non-overlapping historical observations for matched entities. The consistency check focused on a set of five common dates between the two time-series to be merged, and the check was successful when the values of the time-series coincided. Such consistency checks can be considered as an additional validation check for matched entities (matched sub-funds and matched share classes). After the merger of historical time-series, the resulting database contained time-series of higher historical coverage for sub-funds and share classes compared to the corresponding datasets received from the original providers.

User Notes
This section discusses potential academic applications of the database, the expected impact of the database, and its future development plans.
The database brings innovation to academic research in the investment funds by establishing a dataset that is unique in terms of both its structure and content. Historical data available in the Luxembourg fund industry heretofore were fragmented, incomplete, and unsuitable for academic research. The dataset provides a functioning and qualified platform for the expansion of academic research beyond the population of US mutual funds into an alternative population of investment funds with a much greater degree of global reach in terms of investors and investment portfolios. Such a tool was not available to researchers before the construction of the LFDR. The database is a unique and complete fund dataset from the largest UCITS fund domicile, Luxembourg. This population 3 In the absence of an assigned sub-fund ISIN, market practice typically requires the ISIN of the main share class to identify a sub-fund portfolio for the purposes of processing its trade settlement instructions. For the purposes of the settlement of subscription and redemption payments from and to the fund shareholder, the relevant share class ISIN is used. Thus, only share classes have ISINs in use in market operations. of funds can be said to be much more representative of global investment activity than US mutual funds. The unique fund domicile neutralizes the factor of potential differences in regulation that would stem from a population comprising funds from different domiciles. Moreover, it allowed for the construction design of the database that was tailored to the operational and data specificities of funds in this domicile, and it allows for a mechanism to ensure that the dataset can be verified for completeness. Through the inclusion of funds that have been closed or merged into other funds, it ensures the absence of survivorship bias, which is an importance feature for datasets of academic research [11].
In all the above respects, the innovating characteristics of the database make it a tool for a population of globally representative investment funds that is equivalent and comparable to the CRSP Mutual Fund database for its population of US mutual funds [12]. However, the LFDR provides an important additional innovation: the inclusion in the dataset of investment flows into and out of each fund portfolio and share class. Monthly data on investment flows at the sub-fund level will a provide substantial scope for research into the subject population that the CRSP does not currently provide for its population: investment flows in their own right, as well as their (relative) correlations with specific fund product characteristics, environmental variables (such as economic conditions, market news, and other one-time events), and other external phenomena.
Publicly sold investment funds such as UCITS and US mutual funds are highly regulated, providing a compact and readily observable set of data on many dimensions. Thus, the immediate impact of the database will be to allow for the pursuit of lines of research for the Luxembourg UCITS fund set that have been established for many decades for US mutual funds and that have developed into a rich body of literature. The extant literature on US mutual funds can be said to fall into three broad categories: underlying portfolio preferences of investors or investor segments in terms of asset allocation, portfolio strategies, and investment trends; the science of portfolio management in terms of explanatory and predictive models with respect to portfolio risk and return optimization; and measures of funds as investment products in their own right, such as comparative performance parameters, across funds and individual funds' performance persistence over time, and the effects their taxation and regulatory regimes. In addition, the investment funds flow data of the LFDR will allow for research in investment funds along entirely new lines that have not been possible for US mutual funds in the absence of flow data in the CRSP Mutual Fund database.
The future development of the database is currently being discussed by the project team at the University of Luxembourg. As the current dataset contains population of UCITS until 2016, the future plans focus primarily on the integration of the data from 2016 to now, as well as the provision of regular updates in the future. In addition, making the database available to the global academic community under license restrictions is being investigated and would require appropriate agreements with the data providers. Other future work includes a search for still missing historical data and the resolution of the remaining identification issues resulting from unmatched fund and sub-fund entities. Regarding the still missing historical portfolio data for some funds, the option of soliciting additional data providers is being considered. Finally, error reporting and correction mechanisms are being developed to identify, validate, and process any errors reported by the users of the database.
these levels and the characteristic information per level are illustrated in Figure A1. A simplified example of UCTIS is also presented in Figure A1 to illustrate the structure with respect to funds, sub-funds, and share classes. This UCITS example is a simplified part of an actual fund and is used only here for illustration purposes. The information of this example is publicly available [13].
UCITS constitute a legal entity formed either as a single fund or as an umbrella fund consisting of multiple sub-funds (also known as compartments) [14]. Though an umbrella fund consists of a number of sub-funds, it forms a single legal entity. Considering that the vast majority of UCITS are umbrella funds and that a single-compartment fund can be seen as a simplified version of an umbrella fund, the rest of this section assumes an umbrella fund model. UCITS may have appointed a management company or may be part of a self-managed UCITS investment company. In Luxembourg, the UCITS management companies are authorized according to Article 101 of Chapter 15 of the 2010 Luxembourg Law related to undertakings for collective investment [15]. Apart from the management company, other service providers related to the fund level of UCITS include services of transfer agency, fund accounting, and custodian/depositary services. UCITS that are domiciled in Luxembourg must be authorized by the CSSF before beginning their activity. Afterwards, they are supervised by the CSSF on an ongoing basis by means of regular reporting. The disclosure requirements of UCITS include fund prospectuses, extensive annual reports, and the publication of diverse monthly and semi-annual time-series.
According to the work of [14] and in accordance with Article 29(1) of the EU Regulation No 1095/2010 of the European Parliament, sub-funds are separate parts of a common fund vehicle and have their own investment objectives. Assets of one compartment, also known as asset portfolio or pool of assets, are distinct from assets of other compartments. Compartments are usually legally segregated from other compartments, meaning that a liability arising in one compartment cannot be offset by the assets in other compartments of the fund. In other words, each sub-fund corresponds to a distinct asset portfolio and distinct liabilities. In addition, each sub-fund differs in its investment strategy from the other compartments and may appoint its own investment manager and/or investment advisor apart from the investment manager of the entire fund [9].
While the European directives on UCITS cover funds and compartments, they do not clearly present the definition and scope of share classes, although they recognize their existence [15]. In fact, UCITS or one of their compartments can be sub-divided by share classes. Share classes are categories of share that belong to the same UCITS and allow for subsets of investors in UCITS to achieve some level of customization. Such customization accommodates the specific needs of the investors or the terms and conditions for the subscription into the fund (e.g., a distinct fee structure, the distribution or capitalization of revenues, a particular tax treatment under national law, or a distinct minimum investment amount) [14]. The share classes in a sub-fund together have the same claim to a single pool of assets (namely the asset portfolio of the sub-fund) and there is no segregation of these assets between share classes 4 . The asset value of each share class is determined by an apportionment of the change in value of the pool of assets on the basis of a distribution coefficient. Though there is no legal segregation of assets between share classes, expenses defined as a pertaining to specific share class are attributed to that share class only and are integrated into the calculation of its share value. Any investment outcome relating to specific arrangements for a given share class, such as a foreign exchange hedge transaction, is attributed to that share class only. In terms of information disclosure, the CSSF requirements include the preparation of a key investor information document for each share class. 4 A possible exception is an asset, such as a currency receivable, representing claims related to foreign exchange hedging transactions undertaken exclusively on behalf of a specific share class. In such a case, the asset (and its fluctuating value) insures solely to the share class and is incorporated only into that share class's share value.