6.1.2. Synonym Conflicts
Synonym conflicts result from naming semantically similar components of data cubes using different URIs. These components include:
Measure properties. For example, in case of a dataset that measures unemployment, the alternatives for modelling the measure property could be either to re-use sdmx-measure:obsValue property or to define a proprietary measure properties (e.g., eg:unemployment).
Attribute properties. For example, the unit of measure can be represented by the sdmx-attribute:unitMeasure property or, alternatively, by a proprietary property (e.g., eg:unitMeasure).
Dimension properties. For example, SDMX dimension properties (e.g., sdmx:refArea) are commonly re-used for common dimensions’ properties (e.g., temporal, geospatial, gender, and age). An alternative could be to define a new dimension property (e.g., eg:geo) instead.
Code lists. For example, for the unit of the measure property, alternative practices include either to re-use the QUDT vocabulary or to re-use resources of the DBpedia vocabulary.
Hierarchical relation properties and levels of hierarchies. For example, hierarchical relations can be expressed using the dcterms:isPartOf and dcterms:hasPart properties or, alternatively, using new URIs.
The following paragraphs facilitate the understanding of LOSD synonym conflicts by elaborating the definition using the different practices adopted by data portals.
C1.1: Naming the Measure Property
The measures of data cubes are commonly modelled as RDF properties (i.e., using URIs). All data portals investigated in this paper define and use a proprietary measure property. As a result, synonym conflicts are created hampering the interoperability of datasets. A practice to address this conflict is to define each proprietary property as sub-property of
sdmx-measure:obsValue. This practice is also suggested by the QB vocabulary specification because it facilitates readability and processing of the RDF datasets. However, this practice is also considered to be a redundancy because it does not provide additional semantic value to the measure [
12].
Table 3 presents details about the practices used by data portals regarding the name of the measure property.
Table 4 summarizes the practices used for the names of the measure properties that result in synonym conflicts.
C1.2: Naming the Unit of Measure Property
The unit of the measure defines the quantity or increment used to count or describe the measure of a data cube. The unit of the measure property is commonly represented using URIs. DCLG, VLO, and the Scottish data portals re-use the
sdmx-attribute:unitMeasure property for the unit of measure, a practice also suggested by the QB vocabulary. In addition, e-Stat defines and uses a proprietary attribute property (i.e.,
cd-attribute:unitMeasure), which is, however, linked with the
sdmx-attribute:unitMeasure using the
dcterms:relation property. The rest of the data portals do not use a unit of measure property.
Table 5 presents the practices used by data portals.
Finally,
Table 6 presents the three distinct practices regarding the names used for the unit of measure property. Using different URIs to express semantically similar unit properties result in synonym conflicts, although e-Stat’s practice could partially solve address this interoperability conflict. In addition, the fact that some data portals do not use units of measure may result in schema isomorphism conflicts as semantically similar data cubes will have different number of dimensions.
C1.3: Naming the Common Dimension Properties
The geospatial, temporal, gender, and age dimensions are the most common dimensions used to describe statistical data. Dimension properties in data cubes are commonly named with URIs and are defined in the structure of the data cube (i.e., qb:DataStructureDefinition). A challenge, hence, is to decide on the URIs that will be used for the common dimensions.
Regarding the geospatial dimension, most of the data portals re-use the sdmx-dimension:refArea property. Only VLO and ISTAT define and use proprietary properties (e.g., milieu:referentiegebied). VLO’s property is rdfs:subPropertyOf of the sdmx-dimension:refArea.
Regarding the temporal dimension, the Scottish data portal and DCLG re-use the sdmx-dimension:refPeriod while VLO, e-Stat, and ISTAT define proprietary properties (i.e., milieu:tijdsperiode, cd-dimension:timePeriod, and cen:haAnno respectively). VLO’s property is rdfs:subPropertyOf sdmx-dimension:timePeriod, while e-Stat’s property is related to sdmx-dimension:refPeriod as well as to cen:haAnno using the dcterms:relation property. Finally, Irish CSO does not use a temporal dimension because all observations refer to year 2011.
Regarding the gender dimension, most data portals (i.e., the Scottish data portal, ISTAT, and Irish CSO) define and use proprietary properties. e-Stat also defines and uses a proprietary property (i.e., cd-dimension:sex), which is related to sdmx-dimension:sex and cen:haSesso properties using the dcterms:relation property. DCLG re-uses the sdmx-dimension:sex property. VLO does not use a gender dimension.
As with the gender dimension, most data portals (i.e., the Scottish data portal, ISTAT, and Irish CSO) define and use proprietary properties for the age dimension. In particular, ISTAT defines various age properties for different age groups. For example, cen:haClasseEta15Anni property represents an age classification consisting of two categories: lower/higher 15 years, while cen:haClassiEta16Categorie represents 16 age-group categories. This practice allows using code lists that include only values that are used in the data cube. e-Stat also defines and uses a proprietary age property (i.e., cd-dimension:age), which is, however, related to the sdmx-dimension:age using the dcterms:relation property. DCLG re-uses the sdmx-dimension:age property. VLO does not use an age dimension.
All the approaches used by data portals for the geospatial, temporal, gender, and age dimensions are presented in
Table 7.
Using different URIs for semantically similar dimensions results in synonym schema conflicts.
Table 8 summarizes the four distinct practices used by data portals to represent the geospatial, temporal, gender, and age dimension properties.
Finally, the fact that some of the data portals do not use all the common dimensions also causes schema isomorphism conflicts.
C1.4: Naming Hierarchical Structures
Statistical data often include hierarchical structures (e.g., geographical divisions). Hierarchical structures include generalization/specialization relations (e.g., Greece is part of Europe) and hierarchical levels (e.g., country, region, city). The QB vocabulary suggests using skos:narrower property (or define a sub-property of it) to define relationships in hierarchical code lists. However, it also suggests using qb:parentChildProperty in some cases (e.g., when publishers wish to be able to re-use existing code lists).
Nevertheless, some data portals (i.e., DCLG and the Scottish data portals) define and use proprietary properties (e.g., spatial:within, spatial:parent, and spatial:contains) to indicate relations in hierarchical structures. For example, DCLG defines that West Midlands (i.e., an English region) is spatial:within England. This practice, however, that does not use a standard vocabulary makes the interpretation of the data difficult. Moreover, e-Stat uses dcterms:isPartOf and dcterms:hasPart properties. For example, Toyota-Shi (a Japanese city) dcterms:isPartOf Aichi-ken (a Japanese prefecture). ISTAT, VLO, and Irish CSO do not define hierarchical relations.
To define the hierarchical levels, some data portals (e.g., ISTAT, DCLG, and the Scottish data portals) use
rdf:type property. For example, DCLG defines that West Midlands is
rdf:type http://opendatacommunities.org/def/ontology/admingeo/Region while a specific country (e.g., England) is
rdf:type http://opendatacommunities.org/def/ontology/admingeo/Country. e-Stat defines a new property (i.e.,
sacs:administrativeClass) to define hierarchical levels. For example, Toyota-Shi
sacs:administrativeClass sacs:City, while Aichi-ken
sacs:administrativeClass sacs:Prefecture. Irish CSO and VLO do not define hierarchical levels.
Table 9 presents the practices of data portals regarding hierarchical relations and structures.
Using different URIs for semantically similar hierarchical relations and hierarchical levels result in synonym conflicts. In addition, the lack of URIs in some data portals for the hierarchical relations and hierarchical levels result in schema isomorphism results.
Table 10 presents the distinct practices for defining hierarchical relations.
The distinct practices for defining hierarchical levels are presented in
Table 11.
C1.5: Populating the Unit of Measure
The values of the units of measure are usually URIs extracted either from code lists (e.g.,
skos:ConceptScheme) or from vocabularies (e.g., the QUDT units Vocabulary (
http://qudt.org/)). The QB vocabulary specification recommends re-using common code lists and vocabularies for the values of the unit of the measure. For example, DCLG uses (i) QUDT (e.g.,
qudt:Percent), (ii) DBpedia resources for currency units (e.g.,
http://dbpedia.org/resource/Pound_sterling), and also (iii) defines a new code list (i.e.,
http://opendatacommunities.org/def/concept-scheme/measure-units) with additional units such as
http://opendatacommunities.org/def/concept/measure-units/pounds and
http://opendatacommunities.org/def/concept/measure-units/pounds-per-hour. At the same time, the Scottish data portal, e-Stat, and VLO define their own code lists (
http://statistics.gov.scot/def/concept-scheme/measure-units,
cd-attribute:UnitMeasureConceptScheme, and
https://id.milieuinfo.be/vocab/imjv/conceptscheme/eenheden#id respectively) with measurement units such as
http://statistics.gov.scot/def/concept/measure-units/percentage,
cd-attribute:code/unitMeasure-year, and
https://id.milieuinfo.be/vocab/imjv/concept/eenheid/Meter#id respectively. In VLO, some unit values are also related to QUDT using
rdfs:SeeAlso. For example,
https://id.milieuinfo.be/vocab/imjv/concept/eenheid/Meter#id rdfs:SeeAlso qudt:Meter. ISTAT and Irish CSO do not define units of measure.
Table 12 presents the practices used by data portals.
Finally,
Table 13 presents the four distinct practices of data portals. Using different URIs to express semantically similar units result in synonym conflicts.
C1.6: Populating the Temporal Dimension
Temporal dimensions may refer to time periods (e.g., ’2019’) or points of time (e.g., ’01-05-2019’). The values that are used to populate the temporal dimension can be drawn from a code list or a vocabulary or, alternatively, they can be encoded as data values (e.g., an
xsd:dateTime). The QB vocabulary, for example, suggests re-using the
reference.data.gov.uk vocabulary and declare this within the data structure definition of the data cube.
DCLG re-uses values from the
reference.data.gov.uk vocabulary (e.g.,
http://reference.data.gov.uk/id/year/2015) while the Scottish data portal defines and uses a proprietary code list for each data set (e.g.,
http://statistics.gov.scot/def/code-list/house-sales-prices/refPeriod) that re-uses values from the
reference.data.gov.uk vocabulary. Proprietary code lists allow defining additional values related to the ones included in existing code lists. On the contrary, ISTAT, e-Stat, and VLO use literal values for the temporal dimension (e.g., “2010”^^xsd:gYear-eStat, “2011”-ISTAT, or 2010-VLO). Finally, Irish CSO does not define a temporal dimension. These different practices, however, may result in synonym conflicts when different values are used for the same temporal value.
Table 14 presents the practices used by data portals.
Table 15 presents the distinct practices of the data portals for populating the temporal dimension.
C1.7: Populating the Gender Dimension
As with the other dimensions, the gender dimension is also commonly populated using URIs from code lists. The QB vocabulary recommends re-using directly the sdmx-dimension:sex property for the temporal dimension in order to be able to re-use the default code list for it that includes sdmx-code:sex-F (female), sdmx-code:sex-M (male), sdmx-code:sex-U (undefined), sdmx-code:sex-N (not applicable), and sdmx-code:sex-T (total).
Nevertheless, the Scottish data portal, the Irish CSO, and e-Stat define proprietary code lists (
http://statistics.gov.scot/def/concept-scheme/gender,
http://data.cso.ie/census-2011/classification/gender, and cd-code:SexConceptScheme respectively) for the gender dimension. The first one with five values: male, female, all, unknown, and not-specified (e.g.,
http://statistics.gov.scot/def/concept/gender/not-specified), the second one with three values: male, female and both (e.g.,
http://data.cso.ie/census-2011/classification/gender/both), and the third one with three values: female, male, and all (e.g.,
cd-code:sex-all). The latest practice is usually preferred when there is a need for additional values that are not provided by the SDMX vocabulary. However, e-Stat’s values are related to
sdmx:sex values using
skos:closeMatch. Finally, DCLG and ISTAT use the SDMX code list to populate gender dimensions, while VLO does not use a gender dimension.
Table 16 presents the practices used by data portals.
Table 17 presents the distinct practices of data portals for populating the gender dimension. Using different URIs for semantically similar gender values results in synonym conflicts.