The Use of National Strategic Reference Framework Data in Knowledge Graphs and Data Mining to Identify Red Flags

: Red Flags in ﬁscal projects are warning signs that may indicate underlying problems with their implementation. In this paper, we present how National Strategic Reference Framework Open Data can be used to take full advantage of semantic web technologies and data mining techniques to build a knowledge-based system that identiﬁes Red Flags. We collected the data from the Open Data API provided by the Greek Ministry of Economy and Finance. Data modeling consist of two ontologies; the Vocabulary of Fiscal Projects, describing the ﬁscal projects and the National Strategic Reference Framework Greece Vocabulary, illustrating the Greek National Strategic Reference Framework data. We transformed the data into RDF triples and uploaded them onto an OpenLink Virtuoso Server, so that we could retrieve them via SPARQL queries. Performance indicators were deﬁned to assess the state of the project and Density-Based Spatial Clustering of Applications with Noise, (DBSCAN) was used to identify Red Flags. User’s demands is that rejected projects should raise Red Flags, to avoid project failure and assist the auditor to organize the monitoring process efﬁciently, by avoiding to examine most of the non-problematic projects. We performed a use case scenario in which an auditor has to examine NSRF projects, approximately 12 months before the end of the programming period. The system retrieved the ﬁscal information, calculated the performance indicators and identiﬁed the Red Flags. The last update of the projects status after the end of the programming period was retrieved and extracted the number of rejected projects, to test whether the user requirements are satisﬁed. Rejected projects consist of 3.8% of the total projects. The results of the use case scenario show that RedFlags platform is more likely to identify project failures and not raise Red Flags on not rejected projects. Therefore, the RedFlags platform using open data, assists the auditor to organize the monitoring process better.


Introduction
A very large amount of the European Union's total budget is spent on regional policy, via the structural funds with the main purpose of reducing the economic disparities between the member states and supporting job creation, business competitiveness, economic growth, sustainable development, and improving the quality of life.
In Greece, the National Strategic Reference Framework (NSRF) establishes the priorities for spending these funds at national level, for a time window of seven years, to raise the competitiveness of the economy, develop human capital and ensure higher employment and income, as well as better social integration [1]. The General Secretariat for Investments and NSRF of Greece, provide online services for access of all interested parties to the NSRF Project Data and to the transparency of the public sector, in accordance with the provisions of Chapter A of the Law 4305/2014 (Government Gazette 237/A) regarding Open disposition and further use of documents, information and data of the public sector [2]. The data shapes, and automatically discover the number of clusters [19]. DBSCAN is a robust clustering algorithm which has been compared with other data mining algorithms and on a variety of datasets. Recent studies showed that it can be used as part of a system which identified clusters to solve single target and multi-target regression tasks on several datasets [20] and can be used to generate the fault clustering templates for reducing the influence of noise on diagnostic accuracy of rolling bearing datasets. [21]. Additionally, it has been tested on high-dimensional datasets in which clusters are formed by both distance and density structures, where many clustering algorithms fail to identify these clusters correctly [22].
There are various approaches which combine data mining methods and knowledge discovery with Semantic Web data, which support different data mining tasks and improve the Semantic Web [23]. The purpose of this paper is to propose a framework and implement a Knowledge Based system to monitor NSRF projects, using open data and semantic web technologies with linked data principles, to be able to link with other datasets and SPARQL endpoint to retrieve data, performance indicators to monitor the implementation, data mining techniques to identify Red Flags and techniques to visualize the results. This knowledge based system was developed as a web application; RedFlags 7 . The rest of the paper is structured as follows: Section 2 provides the complete design of the knowledge based system from the data extraction to the data mining techniques. Section 3 reports the results, Section 4 includes the user requirements and a use case scenario and Section 5 concludes this paper with some directions for future research.

Overview
In this section, we describe the knowledge discovery process; the NSRF data used in RedFlags application, the vocabularies to semantically represent them, as well as the process for retrieving the needed data using SPARQL queries, defining performance indicators and using data mining techniques to identify Red Flags (Figure 1).

Data
The official website of the Greek Ministry for Development and Competitiveness publishes data related to the implementation process and the economic activity of the NSRF projects for the programming period at http://2013.anaptyxi.gov.gr/. In order to strengthen the transparency of the public sector the database is being updated daily and can be accessed through the Open Data API [24]. These data provide information about two main categories of actions, projects and support-grants.

•
Projects: "A group of activities aiming at the realisation of a functionally complete and distinct result. Some projects may consist of other subprojects." [3]. • Support-Grants: "An advantage in any form whatsoever conferred on a selective basis to organisations involved in economic activity private or public ('undertakings') by national public authorities with the potential to distort competition and affect trade between member states of the European Union. The advantage can take different forms of assistance including the direct transfer of resources, such as grants and soft loans, and also indirect assistance, for example, relief from charges that an undertaking normally has to bear, such as a tax exemption or the provision of services, loans, at a favourable rate." [3]. There is also a category with 181 Priority Projects ..."the selection of which was made by the Greek authorities in cooperation with the qualified European Commission Services, based on criteria related to the maturity, size and importance of their social and economic impact. The Priority Projects consist of other Projects or Support-Grants" [3].
These data include information about the following: public expenditure budget, contracts signed, payment amounts, the start and end date, status, location, description of projects, number and the title of their subprojects, the thematic priority and the operational programme in which they belong, beneficiaries or other involved organisations and various related documents (pictures, pdfs and docs). Also, some projects may involve expropriations. An expropriation is defined as ..."obligatory, according to the law and based on a defined compensation, acquisition of one's property by the state, for reasons of public necessity or utility" [3]. The expropriation data consist of information about the area, the compensation money and the decisions based on which they are implemented.

Semantic Data Modeling
Existing vocabularies that could be used to describe fiscal projects and their implementation process are FRAPO 8 , an ontology for describing the administrative information of research projects, FP6 and FP7 9 , that were used to model information for European Commission's Framework Programme research projects. These ontologies were very specific about modeling information regarding the research projects and could not be used in our case, which was to describe the properties of a financial project and its implementation process for the Greek NSRF that consists not only of research projects as well infrastructure projects, projects regarding energy, the environment, culture and tourism.
The absence of an ontology that describe financial projects, led us to develop the Vocabulary of Fiscal Projects (VFP) [25] and National Strategic Reference Framework Greece Vocabulary (NSRF-GR) [26] ontologies that could be used as a basis for the semantic representation of the fiscal projects and the Greek NSRF data respectively. VFP is identified by the namespace URI http://purl.org/vocab/vfp#, the preferred prefix is vfp and is also available through the GitHub repository 10 . The design is based on the research of other EU countries' web portals that provide similar information about projects. Table 1 shows the four main classes we defined to optimise the coverage of terminology in the context of fiscal project data.
The main class of ontology is vfp:Project. A project is always associated with some organisations (vfp:Organization), a location (vfp:Place) and some documents (vfp:Document). A more detailed cross reference of the ontology classes and properties is available on its webpage 11 . Figure 2 depicts the classes and their relations. NSRF-GR Vocabulary extends VFP with new classes and properties to describe NSRF data in as much detail as possible. It is identified by the namespace URI http://purl.org/ vocab/nsrf-gr#, the preferred prefix is nsrf-gr and is also available through the GitHub repository 12 . The classes and its relations are shown in Figure 3. For each project category we created another class, subclass of vfp:Project. More details about the classes and the properties can be found at the cross reference section of the ontology's web page 13

NSRF Knowledge Graph and Data Retrieval
We retrieved the data through the Open Data API using Python scripts and stored them in a local database. The transformation of the NSRF data to knowledge graph, is done by using the UnifiedViews 14 ETL tool. The main advantage of this tool is that it can extract data straight from relational databases and then transform it to RDF triples [27][28][29]. After the transformation process, the RDF files were uploaded to an OpenLink Virtuoso Server 15 . Then we used SPARQL queries to retrieve the data from the server and analyze them.
In order to semantically represent the information we extracted from the Open Data Portal about the Greek NSRF projects, we used properties from the VFP ontology to describe the title (vfp:title) of the project, the public expenditure budget (vfp:budget), the total amount of signed contracts (vfp:contracts), the payment amount (vfp:payments), the current status (vfp:currentStatus), the location (vfp:location), a detailed description of the project (vfp:description), its start (vfp:startDate) and end date (vfp:endDate), a status report (vfp:statusReport) and the report date (vfp:statusDate), as well as the url of the project (vfp:url) and the documents related to this project (vfp:document). Also, we used properties from the NSRF-GR vocabulary to represent the project's beneficiary (nsrf-gr:body), the operational programme to which it belongs (nsrf-gr:operational), its thematic priority (nsrf-gr:thematic) and the number of the subprojects it has (nsrf-gr: countSubproject). Finally, all projects have a unique code notated as MIS and were assigned the rdf:type of nsrf-gr:Project.
The object properties vfp:currentStatus, nsrf-gr:operational, nsrf-gr:location, nsrf-gr:body, nsrf-gr:thematic weren't assigned to literal terms, but instead we chose to use code lists. The code lists were semantically represented using SKOS 16 , since it's a widespread vocabulary that provides a standard way to organize knowledge using RDF and allows the hierarchical ordering of terms [28].
The query in Listing 1 can be executed on the SPARQL ENDPOINT 17 to retrieve information about the title, description, beneficiary, current status, location, operational programme, thematic priority and the url of the NSRF projects. 14  All the IRIs that resulted from the SPARQL query are dereferenceable and point to HTML pages with information about the resources. For the IRI dereferencing we used the RDFBrowser [30], which is an open source Linked Data content negotiator and HTML description generator. Figure 4 shows the HTML representation of the resource project with MIS code 200000. The SPARQL query in Listing 2 can be used to retrieve information about the budget, the contracts, the payments, the start and the end date of the NSRF projects. The results are also shown in Table 2

Performance Indicators
The process of monitoring and evaluating systems is based on indicators that assess the state of a project [6,7,9,10,13,31]. We use three indicators using the contract, budget and payment amounts from the retrieved data. These indicators track the way in which NSRF projects evolve towards completion and consist of the input features in the clustering algorithm.
The completion index is defined by the Greek Ministry of Economy and Finance as the ratio of payments registered at the moment of data retrieval to the updated budget amount at the moment of data retrieval [3]. We define two other indices, namely, payment completion and contract completion as follows: Payment completion is defined as the ratio of the payments registered at the moment of data retrieval to the updated contracted amount. The payments completion index shows the status of the payments over the contracts at the time we retrieved the data, while the completion index shows the status of the payments over the budget of the whole project.
Contract completion is the updated contracted amounts to the updated budget at the moment of data retrieval.
The indices range should lie between 0 and 1. A value over 1 means that there is a significant change in a project that was unable to be covered by its fiscal plan and explains why an indicator exceeds the upper limit.
Indicators for each project can be calculated and retrieved using the SPARQL query of Listing 3.  The information if a project is a Red Flag is not available in the official data portal of the Ministry. The available data, described in Section 2.2, concern public expenditure budgets, contracts signed, payment amounts, the start and end date, status, location, description of projects, number and the title of their subprojects, the thematic priority and the operational programme in which they belong, beneficiaries or other involved organisations and various related documents (pictures, pdfs and docs). Supervised approaches are used when we have prior knowledge of what the output values for our samples should be. Therefore, unsupervised learning is appropriate to act on data without categorization [29,32]. Partitioning and hierarchical clustering algorithms are more effective on compact and well separated clusters, however in the presence of noise and outliers in the data, these methods are not very effective [33][34][35]. We selected Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, to detect areas with high density (clusters of any shape) in the defined feature space ( Figure 5) in order to eventually reveal the projects that could be considered as Red Flags.
Having defined a 3-dimensional feature space described in Section 2.5, each project is represented by one point. Let ε the radius of a neighborhood with respect to some point P and MinPts is the minimum number of neighbours within this radius. The notion of density in the feature space is based on the following definitions [19,33]: • A point P 1 is a core point if at least MinPts points are within distance ε. Those points are said to be directly reachable from P 1 . • A point P 2 is density reachable to a point P 1 with regard to ε and MinPts, if there is a path of core points where each point of the path is directly reachable from the previous one. • A point P 2 is density connected to a point P 1 with regard to ε and MinPts, if there is a point P 3 such that P 1 and P 2 are density reachable from P 3 with respect to ε and MinPts. • A group of density connected points form a density based cluster and points that are not reachable from any other point are outliers.
Based on these density conditions, there are three different kinds of points: core points, density reachable points and outliers, as shown in Figure 5. Points P 1 , P 2 are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the project itself). Because they are all reachable from one another, they form a single cluster. Point P 3 is not core point, but is reachable from P 1 and thus belongs to the cluster as well. Point P r is a noise point that is neither a core point nor directly-reachable.
DBSCAN computes the Euclidean distance from an arbitrary selected point (starting point) and the other points and finds the neighbours within their ε-distance of the starting one. If the number of neighbours is equal to or greater than the MinPts, they form a cluster. These points are considered as "visited". This process is repeated with the rest core points until the cluster is fully expanded and then, these iterations are also repeated with the unvisited points to form other clusters. If the number of neighbours is less than MinPts, the point is marked as a Red Flag.
The rule of thumb, to specify MinPts is to use at least the number of dimensions of the data set plus one. In this case MinPts was set to k = dim(data) + 1 = 4 [32,36]. The optimal ε radius was specified using a 4-dimensional tree which computes the 4-nearest neighbours' distances of every point. Figure 6 shows the points sorted by distance in ascending order and the optimal ε parameter is selected to be the knee of the curve, the value where a sharp change occurs and is around 0.015 [19].

Red Flags
Red Flags are defined as clusters of projects with extreme behaviour compared to other clusters of projects. Red Flags are warning signs that do not indicate guilt or innocence [37]. Clusters with a number of projects less than, or equal to 5% of the total number of projects are characterized as extreme clusters. The 5% threshold was selected by trial and error by testing different thresholds.

Results
NSRF data for the programming period 2007-2013 consists of 11.558 projects that were contracted and executed. The proposed performance indicators as retrieved from the SPARQL query (Listing 3) are shown in Table 3.  Table 4 shows the basic descriptive statistics concerning the performance indicators of projects. The large standard deviation indicates the existence of extreme values.  Figure 7 shows how the projects are distributed over the principal components of the feature space. The feature space consisted of completion, payments completion and contract completion. DBSCAN detected areas with high density in the defined feature space and revealed 92 groups of projects. Table 5 shows the ten most populated clusters. Cluster 1 (Figure 7) consists of 8150 projects, a number that exceeds the threshold of 0.05 ( 8150 11558 = 0.7051 > 0.05). These projects do not indicate extreme behaviour and have been successfully executed.
The other clusters are Red Flag clusters as they have less than 5% of the total number of projects. The second most populated cluster is cluster 4 and it includes 4.41% < 5% of the total projects ( 510 11558 = 0.0441), the third, cluster 0 consists 4.38% < 5% of the total projects ( 506 11558 = 0.0438) and so forth. In total, 3408 projects were identified as Red Flags consisting of 29.49% ( 3408 11558 = 0.2949) of the total projects.

User Requirements and Use Case Scenario
Red Flags are an indication to monitor funded projects during their implementation in order to prevent and guide competent authorities to improve or correct weaknesses or prevent failures in operations, accounts and systems. Therefore, RedFlags platform's user requirements are:

1.
Rejected projects should raise Red Flags, in order to avert project failure if possible.

2.
Assist competent authorities to organize the monitoring process efficiently, without loss or misspend of time, by avoiding to examine most of the non-problematic projects.
According to the user requirements, we performed a use case scenario. In this scenario the competent monitoring authority has to examine NSRF projects, approximately 12 months before the end of the programming period. RedFlags platform assists the authority to organize the monitoring process and examine first the projects that raised a Red Flag. Marking a project as Red Flag, means that this project has probably significant problems, such as higher payments than the available budget (completion index), or higher payments than the available contract amounts (contract completion index). These projects have high priority to be examined to avoid rejection. Since the available ground truth is the rejection at the end of the programming period when the data retrieved, we will evaluate the performance of RedFlags platform on 438 rejected projects over 11558 NSRF projects. Under these circumstances, the use case scenario will show the performance of the RedFlags platform on imbalanced data, since the proportion of rejected projects consist of 3.8% ( 438 11558 = 0.038) of the dataset (low prevalence).
The system retrieved the fiscal information, calculated the indicators of the NSRF projects as described in Section 2.5 and identified the Red Flags. To test whether the user requirements are satisfied, we checked the last update of the projects after the end of the NSRF programming period and extracted the number of rejected projects. The following tables show the results of this use case scenario.
The contingency table (Table 6) of rejected projects and projects classified as Red Flags shows that 312 projects raised Red Flag and were rejected (True Positives-TP), 126 projects were rejected and didn't raise Red Flag (False Negatives-FN), 8024 didn't raise Red Flag and were not rejected (True Negatives-TN) and 3096 classified as Red Flags but were not rejected (False Positive-FP). Out of the 11558 projects, 3408 projects were marked as Red Flags. According to Table 7, prevalence is equal to 3.8% (P r = TP+FN TP+TN+FP+FN = 0.038) and is defined as the proportion of rejected projects to the total number of NSRF projects. Low prevalence is expected for a successful NSRF programming period, as a higher percentage of this metric means that the NSRF program encountered problems and that more and more projects failed to complete.

Rejected Not Rejected Total
Red Flag P r,1 = 312 11558 = 0.027 P r,0 = 3096 11558 = 0.268 P r = 0.295 No Red Flag P nr,1 = 126 11558 = 0.011 P nr,0 = 8024 11558 = 0.694 P nr = 0.705 Total P 1 = 0.038 P 0 = 0.962 1 By these terms, Precision (Positive Predictive Value-PPV) and Negative Predictive Value (NPV) are equal to 9% (PPV = TP TP+FP = 0.09) and 98% (NPV = TN TN+FN = 0.98), respectively. Precision corresponds to the estimated probability that a project randomly selected from the indicated Red Flags is rejected. Negative Predictive Value corresponds to the probability that a project randomly selected from the set of not indicated projects as Red Flags is not rejected. However, both metrics depend on the prevalence, which in this case is low and they are not intrinsic to the test, as recall and true negative rate are [50]. The overall accuracy (ACC) of the RedFlags platform is equal to 72% (ACC = TP+TN TP+TN+FP+FP = 0.72). Based on Table 7, which presents the joint probabilities for rejected and Red Flags projects, the conditional probabilities of Table 8 were calculated (see also Figure 8). The results show that recall (Sensitivity, or True Positive Rate-TPR), which is the percentage of raising Red Flags at projects that were rejected after 12 months, is 71% (TPR = P(r|1) = P r,1 P 1 = 0.71). Recall corresponds to the estimated probability that a project randomly selected from the indicated Red Flags projects will be rejected.  Moreover, specificity (SPC) is equal to 72% (SPC = P(nr|0) = P nr,0 P 0 = 0.72) and is related to the RedFlags platform's ability to correctly not raising Red Flags at projects that will not be rejected at the end of the programming period.
In other words, the auditor will not examine first the 72% of the projects that will not be rejected, whereas he will first check the 28% of the projects that will raise a Red Flag but won't be rejected (False Positive Rate-False Alarm), which is satisfactory according to the user's demands. Marking projects as Red Flags does not necessarily mean that these projects will be rejected after 12 months, whereas a project that has been rejected should have raised a Red Flag.
Furthermore, we calculated the Positive likelihood ratio (LR+), Negative likelihood ratio (LR-) and the Diagnostic Odds Ratio (DOR). LR+ is defined as the ratio P(r|1) P(r|0) = 0.71 0.28 = 2.54. The greater the value of the LR+, the more likely a Red Flag indication is a Red Flag warning for a rejected project. In other words, rejected projects are more likely to raise Red Flags than not rejected, since the ratio is greater than 1. On the other hand, the algorithm avoided an LR+ < 1 which would imply that not rejected projects are more likely than rejected projects to receive Red Flags.
LR-is defined as the ratio P(nr|1) P(nr|0) = 0.29 0.72 = 0.40. The meaning of LR-< 1 is that a not rejected project is more likely not to raise a Red Flag than a rejected project. A value greater than 1 would imply that rejected projects are more likely not to raise a Red Flag than not rejected projects.
DOR, which is independent of prevalence, measures the effectiveness of the algorithm. DOR is defined as the ratio of LR+ LR− = 2.54 0.40 = 6.35. The value of DOR is greater than one meaning that the algorithm is discriminating correctly. Therefore, the RedFlags platform user requirements are satisfied. In other words, RedFlags is more likely to raise Red Flags on rejected projects and is more likely not to raise a Red Flag on not rejected projects and eventually assist the competent authorities to organize the monitoring process efficiently, by avoiding to examine most of the nonproblematic projects.

Conclusions
We presented how open data can be used with semantic web technologies and data mining techniques to identify possible failures as "Red Flags" in National Strategic Reference Framework projects. The identification is implemented by the RedFlags application, constructed as an interactive knowledge based system. We used data from the Open Data API provided by the Greek Ministry of Economy and Finance. The semantic description of these data involved the development of two ontologies, VFP and NSRF-GR. The NSRF data were transformed into RDF triples and uploaded to an Openlink Virtuoso Server, while RDFBrowser undertook the process of content negotiation and HTML generation. Performance indicators were defined to track the progress of NSRF projects and provided the inputs to the clustering algorithm. The DBSCAN algorithm was used to identify Red Flags.
The RedFlags platform was based on two user requirements. The first requirement is that the rejected projects should raise Red Flags, in order to avoid failure if possible and the second is that there is a need to assist auditors to organize the monitoring process efficiently, without loss or misspend of time, by avoiding to examine most of the nonproblematic projects. In the use scenario, an auditor has to examine the NSRF projects in Greece, approximately 12 months before the end of the programming period. The system retrieved the fiscal information, calculated the indicators of the NSRF projects and used the DBSCAN algorithm to identify the Red Flags. RedFlags platform marked 29.5% of the projects as Red Flags. The meaning of the indicated Red Flag projects, is that these projects have probably significant problems, due to updates of budget or payment amount, or due to other factors and have high priority to be examined to avoid rejection. However, the available ground truth is the rejection at the end of the programming period when the data retrieved and we evaluated the performance of RedFlags platform on 438 rejected projects over 11558 NSRF projects.
To test whether the user requirements are satisfied, we checked the last update of the projects that were conducted after the end of the NSRF programming period and extracted the number of rejected projects. The number of rejected projects correspond to prevalence which is equal to 3.8% of the total projects. In terms of rejection, low prevalence corresponds to a successful NSRF programming period, as higher values of this metric means that more and more projects failed to complete.
The estimated probability that a project randomly selected from the indicated Red Flags projects will be rejected was 71% (Recall) and the estimated probability to correctly not raising Red Flags at projects that will not be rejected was 72% (Specificity). Moreover, the positive likelihood ratio showed that rejected projects are more likely than not rejected projects to receive Red Flags, whereas the negative likelihood ratio showed that rejected projects are more likely not to raise a Red Flag than not rejected projects. Finally, the diagnostics odds ratio, which is independent of prevalence, showed that the RedFlags platform is discriminating correctly. Therefore, RedFlags platform assists the auditor to organize the monitoring process and give high priority at the projects that raised a Red Flag, as rejected projects have higher probability to raise a Red Flag.
Currently the resources in our data have been described by W3C's open standards and have HTTP IRIs so humans can access them and get useful information, but they still don't have links to other datasets. So, our next step will include creating links to IRIs of other published data in order to achieve 5 star Linked Open Data [51,52]. Specifically, we plan to create semantic links between documents that were uploaded to Diavgeia, the official repository where all the decisions of governmental and administrative acts are posted, and the NSRF projects to expand the Greek Linked Open Data (LOD) cloud [53][54][55]. This will give us access to relevant information about the projects in order to create additional performance indicators and increase the efficiency of the data mining algorithm. In addition, we will further improve the ontology by implementing some upper ontology like BFO 18 and by reusing terms from other ontologies. Moreover, we will look into adding constraints and validating our graphs by using technologies such as SHACL 19 or ShEx 20 [56]. Finally, even though the ontologies have their specification drafts, they need to be updated with more detailed documentation and SPARQL examples so consumers, outside of the data portal, can easily compose and execute SPARQL queries using the correct properties.
The findings of this study have been included at the results of the commitment about Linked, Open and Participatory Budgets of the Third Greek Action Plan on Open Government [57]. Public bodies could adapt efficiently to the RedFlags Knowledge-Based system as an early warning indicator, in order to make smarter strategies preventing possible failure of projects. Citizens could monitor the progress of a project to find Red Flags, while data journalists could produce data stories about EU funds and relate them with the trends of the Greek economy.