1. Introduction
Institutional scientific information systems (SISs) dedicated to universities have a long history. With the advent of the open access movement, and in par, since the launch of the open source systems, such as ePrints (2000) or DSpace (2002), the main solution for building such institutional systems has become a repository.
The practical motivations for setting up an IR are many, but in most cases, it is seen as a means of improving institutional visibility. In [
1], the authors point out that the main motivations for building institutional repositories are as follows:
To provide a showcase for the academic output of an institution (e.g., facilitating increased visibility; generating indicators of academic quality);
To improve dissemination of research output;
The management of research (and research information);
The long-term preservation of resources;
Breaking down access barriers to content.
However, many authors point out that achieving these goals with repository systems is very problematic. In particular, in [
2], it is argued that dissemination fails mainly because the current journal system is perceived as the most reliable path to a scientific career. This issue alone means that institutional repositories have a very low level of completeness, which makes it particularly difficult to achieve the aforementioned objectives. Another very problematic goal is point 3 above (i.e., the management of research), especially for very simple and insufficient tools within typical IR concerning the assessment of publications.
Problems with the acceptance of institutional repositories among researchers have also been reported by [
3,
4]. On the other hand, one can observe that most researchers, including those at the top of their fields, have no problem creating and maintaining their home pages with complete publication lists, research projects, and sometimes blogs. Therefore, by the end of the first decade of this century, a new approach emerged that focused on collecting complete bibliographies of researchers, not limited to open access. This approach is not limited to publications but instead focuses on the researcher in order to present diverse facets of their scientific activities. Probably a first attempt was made by Jie Tang with ArnetMiner ([
5]). The proposed approach was based on data mining algorithms applied on scientific social networks. A similar approach extended by the idea to give researchers a possibility to create and edit their profiles has been proposed with the launch of Google Scholar and, shortly thereafter, ResearchGate and Academia.edu.
For institutional SIS, a similar approach has been observed with the development of the systems known as Current Research Information Systems (CRISs), also known as RIS or RIMS. The CRIS-type system is based on a data model consisting of such entities like researchers, institutions, research (projects), and the research results (publications, patents, etc.). In general, CRIS systems aim to support research institutions in the management of scientific research in the broadest sense, in particular, by using aggregated research information and the evaluation measures of research results to build conditions conducive to research progress. As the main beneficiary of the institutional CRIS type systems is the university management, the system still is not directly dedicated to the researchers.
In 2010, a dedicated project was launched at Warsaw University of Technology (WUT) in order to address deficiencies in the scientific information infrastructure at Polish universities. It has ended in 2012 with a first version of the OMEGA-PSIR system, integrating basic functionalities of the IR, CRIS, and Research Profiling System (RPS). Since 2013, it has been used at the University as a Research Knowledge Database, see
https://repo.pw.edu.pl/index.seam?lang=en (accessed on 26 August 2025).
In parallel with the development of the system, research related to SISs was conducted. The main results of the research have been presented in [
6,
7,
8]. In [
6,
7], we have presented how the AI methods that have been elaborated for the system can be used to support data acquisition and enrich semantics of the acquired data. In [
8], we discuss how focusing the system on researchers and their activities may influence the acceptance of the system within the academic community, which is in our opinion one of the most important factors when assessing the system success.
In this paper, we summarize a research on the system, focusing on technical solutions. The needs to integrate the functionalities of the IR, CRIS, and RPS resulted in the development of an advanced XML database system. The proposed technical solutions have had a significant impact on the system’s capabilities (including the modeling features), ease of maintenance, and its susceptibility to new solutions.
The remainder of this paper is structured as follows.
Section 2 summarizes related work concerning ideas for institutional information systems and related modeling issues.
Section 3 outlines the aims of the work, including the motivations behind the system design and key functional goals.
Section 4 presents the core modeling principles applied in OMEGA-PSIR, including object-oriented XML structures, hierarchical modeling, and versioning. We show how to proceed with modeling the objects evolving in time, while in other objects, referring to them, we would rather like to preserve historical data.
Section 5 discusses the outcomes of this modeling approach, especially in terms of querying, analytics, and interoperability.
Section 6 provides a case study of how the platform functions as a comprehensive university knowledge base.
Section 7 presents evidence of system acceptance and widespread adoption across multiple institutions, highlighting the effectiveness and relevance of the proposed approach.
Section 8 summarizes our contributions.
2. Related Work
The history of repositories for scientific outputs dates back to the 1980s. In [
9], an evolution of approach to institutional SIS at the University of Houston is presented. The history of repositories and their categorization is presented in [
10]. The discussion of IRs and their functionality began at the start of this century, alongside the launch of the Open Access idea and the appearance of modern repository systems, such as DSpace and Eprints. In [
11], an institutional repository (IR) is described as essential infrastructure for scholarship in the digital age.
Critical comments related to IRs emerged soon after (e.g., [
2,
3,
4]). In [
12], it has been noted that the initial initiative to develop institutional repositories came mainly from libraries rather than direct lobbying by research staff or academics. Although Ref. [
4] was published in 2008, the paper is still cited, and many authors discuss the problems of the faculty’s acceptance level of IR (see, for example, [
13,
14]).
Agreeing with the low acceptance of IR by faculty staff, some librarians propose supplementing IR contents with data from internet resources. In [
15], the authors propose using the websites of faculty staff to build the lists of missing document descriptions and then verifying the resulting lists for duplicates. A more effective proposal was presented in [
16], where global information resources (both open, such as arXiv or Pubmed, and commercial systems (such as WoS or Scopus)) are used to import OA documents and place them in the repository. Nevertheless, the substance of the repository is still incomplete, and it does not encompass the entire outcomes of the faculty research. Clearly, without specialized systems, these approaches require librarians to perform a lot of manual work, making the whole process costly.
First discussions on integrating IR with CRIS can be observed around 2010. In [
17,
18], the discussion shows the needs for integrating CRIS and IR. A summary of the discussion at an early stage is provided by [
19]. The main problems concerning the integration are with the goals of the two systems and then different approaches to the modeling issues—on the one side, the model is flat and as simple as possible (Dublin Core), and on the other, it is a very complex CERIF data model. Therefore, in the conclusions, the authors suggest the integration of separate systems rather than the integration of the functionalities of the two system types into one platform.
One of the first systems integrating functionalities of IR with CRIS was OMEGA-PSIR [
20]. Our motivations for adopting this solution are presented in [
8]. The idea of integrating IR functionality into CRIS spread very rapidly—an exhaustive report by Bryan et al. from 2018 [
21] indicated that 54% of CRIS installations (usually home-made systems) integrate with IR functionality.
From the beginning of the WUT project, special attention was given to issues related to open science requirements and acceptance of the system by scientists while also attempting to consider science management requirements (i.e., CRIS functionality). This approach fundamentally influenced the choice of technology. Specifically, we opted for (1) XML databases, which are better suited for modeling complex data and (2) full-text information retrieval, which is especially important for IR requirements.
Advantages of using XML for modeling complex data structures are presented in [
22,
23]. We have considered using BaseX (v7.0 and later) [
24] or eXist (v2.0.0 and later) [
25]. However, both database systems are schemaless; therefore, for the sake of requirements of high-quality data in CRIS, we decided to build a dedicated system with an XSD-based schema.
A presentation of the CERIF-related modeling issues are provided in [
26,
27,
28]. The XML-based CERIF format is proposed in [
27], yet its function is limited to data exchange, rather than serving as a CRIS data model. In [
28], the relational data model is discussed in detail; in addition, the temporal issues of the model are proposed. Problems of using CERIF in national systems are reported in [
29], mainly due to specific details of the standard that make CERIF too difficult to use. In [
30], the authors discuss how they adopt CERIF to their needs.
4. The OMEGA-PSIR Data Model
4.1. Core Principle: Object-Oriented Modeling Through XML-Based Structures
In the OMEGA-PSIR system, the approach to data modeling for research information management and processing is based on object modeling principles so that real-world entities are categorized as separate object types [
22]. Each object type is defined by its own attributes and behaviors; objects interact with other objects through typed associations, ensuring modularity, reusability, and clear semantics. The objects are uniquely identified by object identifiers. These identifiers are constructed to be globally unique—rather than merely unique within a collection type or a specific OMEGA-PSIR instance—namely, they are generated as UUIDs (universal unique identifiers). They are also persistent, ensuring long-term stability and enabling unambiguous referencing, export, and cross-system integration of records.
As mentioned, the model presentation is provided in XML, and the W3C XML Schema Definition (XSD) is used to define object types, respectively. Among the types, we distinguish
main object types. The main object types correspond to the CERIF’s
base entities (see [
28])—they represent physical entities in the modeled world, and their instances appear in the database independently of any other objects, connected or not, with instances of other main object types. The instances of the main object types build the
type collections. We also distinguish auxiliary object types, such that their instances can only be present in relation with an instance of main object type, i.e., they do not create collections.
In addition, there are special object types—Term and Termtype. They provide a dynamic alternative to hardcoded enumerations, allowing system administrators to define and manage vocabularies directly within the application interface for admin users.
The structure of the model is hierarchical. Each object type is defined by its attributes:
Non-repeatable and repeatable attributes;
Simple attributes and the composed ones, containing complex substructures to be nested within the object.
Typed relationships between object types are thus defined by means of composed attributes. The typed relationships can be defined between two main object types or between a main object type and an auxiliary object type. In the first case, the relationship instance is implemented by nesting a local version of an object from a given type collection. A relationship between main object type and auxiliary type is also implemented by nesting an object, the difference is that the nested object is not a local version of any object from a collection.
These basic constructs make it possible to model quite advanced cases. In the following, we will illustrate it by an example from the institutional CRIS domain, where a research output of a university staff should be collected.
Consider the relationship between a publication and the authors in an institutional CRIS. In the data model, both object types are main types. The relationship between the two objects is obvious: they may be linked by a typed relationship authorship (many-to-many), which additionally may be categorized by role as firstauthor, lastauthor, correspondingauthor, etc. What is important here, the object of the researcher type is evolving in time:
An authoring person may change their surname (e.g., due to marriage or legal change);
Their institutional affiliation or position may evolve over time;
They may obtain a scientific degree;
Certain external identifiers (e.g., ORCID and Scopus ID) may be added or revised retrospectively.
Therefore, in the publication object instance, we would like to preserve the important data about the authoring person as they are valid at the time of publishing the given publication. Hence, in our example, the definition of the type
Article, inherited from the
Publication type, contains a repeatable complex attribute with
Person as the defined attribute
domain. As a result, in an instance of
Article, corresponding versions of instances of
Person are embedded. Each nested version stores values valid for the corresponding author at the time of publication of the article, such as, e.g., name, affiliation, scientific title, or role. The referred object in the collection may change in time, but the values stored within embedded versions are “frozen”. Note that the substructure
person embedded in the article object contains an identifier (in the attribute
objectID) that links to the main instance of the researcher object in the
Person collection. As a result, many new and interesting features are available. For instance, querying for all publications by a given researcher using its object identifier will also return a publication by that author with eventually a changed surname. The potential of the proposed modeling will be discussed in more detail in
Section 5.
The idea of creating a new version of an object and embedding it within the hosting object is of special value for modeling time-related values as it is especially important for CRIS-type systems. In the presented approach, we do not distinguish relationship objects—the relationship is modeled by a composed attribute, where a local version of an object instance is stored.
Noteworthy, in CERIF, the temporal aspects are solved by a special relationship entity type, where each relationship has the attributes
StartDate and
EndDate [
28]. Such a relationship entity, typed as
authorship, links
publication entity with the
person, so there is no space for evolving attributes of the person in the relationship entity. Instead, such historical data have to be stored separately in other tables, linked to the
person entity. It means that, for example, searching for all publications (co-)authored by PhD students in the last 10 years would require a lot of selections and joins in order to find relevant positions.
By allowing a local snapshot of the researcher with data valid at the time of publication to be embedded in publication records, the system maintains historical fidelity and temporal accuracy. This is particularly important for statistical analyses concerning bibliometry, funding audits, scientific careers, or inter-institutional cooperation as it provides the correct background for time-based statistical analysis (see
Section 5.3). Generalizing, the same individual may be represented differently in different contexts (roles) whenever the historical information about the researcher must remain intact, e.g., as a project leader in the project record, as a conference chair, or as a seminar speaker.
A simplified example of a class diagram for the article object type is shown in
Figure 1. As one can see, in the article object class, we have a structure for storing nested author structures, whereas within this structure, there is a substructure for nesting the author affiliation (a unit). An XML structure for the object instances of
article,
persons with their affiliation is schematically shown in
Figure 2. It presents object instances for four collections:
articles,
persons,
units, and
journalseries. Each collection contains all instances of the given object type. The
article object instance with ID
ID-a-1 includes two nested
person local object versions representing two authors of the article, each containing a nested
unit versions in a composed
affiliation attribute. As the main unit instances on the list of
Unit type, these nested unit versions are also hierarchical—each includes its parent unit version recursively up to the top-level unit. Arrows represent symbolic links based on object identifiers, showing how nested objects relate to their corresponding entries in the collections. For clarity, only a few such arrows are shown, but, in principle, every object sharing the same identifier is logically linked across the collections.
While the nested versions generally mirror their collection-level counterparts, they may differ—for example, ID-p-2 in ID-a-1 has a different affiliation than the ID-p-2 entry in the persons collection. This reflects the fact that the current affiliation of the person is ID-u-3, but in the context of the publication, the author was affiliated with ID-u-4. These symbolic links and hierarchical embeddings allow the system to preserve historical accuracy and support traceable, context-specific representations of object relationships without sacrificing model consistency or search abilities.
Although this approach introduces denormalization and leads to some redundancy, it ensures that semantic integrity of each local version is preserved independently, making it resistant to changes in the main objects (in collections) and thus enabling the tracking of changes related to the object’s history. Additional side-effects, quite important from the point of view of querying possibilities, are that
No extra structures are needed to model the history;
The hosting objects are searchable by the attributes of the nested versions;
No extra joins are needed in the runtime when an index is generated and for presenting relationships between the objects.
Obviously, the CRIS-related data in the presented model can be easily translated to CERIF. OMEGA-PSIR has participated in an euroCRIS project (2017–2018), and, as one of the first, had implemented transfer to OpenAIRE in the CERIF format—see [
35]. Also, it is worth noting that the hierarchical structure of data in our model does not preclude the possibility of translating bibliographic data into flat formats, such as Dublin Core, BibTeX, MODS, or RIS, so that bibliographic records can be easily exported to bibliography managers (e.g., to Mendeley, Zotero, or EndNote). Actually, with the dedicated web browsers plugins one can easily import data to the corresponding managers.
4.2. Extending the OMEGA-PSIR Schema to Support Functional Requirements
In the course of adapting OMEGA-PSIR to the operational realities of our institution, we identified a number of situations where the standard modeling through XML schema was insufficient. Although the base XML schema provides a solid foundation for representing objects and their relationships, it did not fully accommodate the nuances of our workflows, editorial practices, or reporting needs. To bridge this gap, we introduced a series of custom schema-level specifications. These were not intended to change the structural validation of XML documents but rather to guide system behavior in areas such as user interface logic, access control, data consistency, and validation-related workflows.
One key area where extensions were needed was in distinguishing between main and auxiliary objects. While instances of main object types, such as publications, researchers, or projects, are managed independently and can be reused across the database, auxiliary objects, such as an author’s contact information or a local affiliation note, appear only within a specific context. This clear distinction allows the system to treat them differently when storing, presenting in the interface, and indexing them. For example, some elements can be excluded from full-text search or displayed only as elements embedded in forms.
Another functional requirement involves how data are shared and reused throughout the system. In some cases, it is appropriate to include full nested structures—for instance, when the embedded content is tightly bound to its parent and the temporal context is relevant, such as the author’s affiliation at the time of a publication. In other cases, especially when the referenced object is large, independently updated, or shared across multiple records, and when temporal fidelity is not applicable, simple linking is preferred. For instance, when describing a book chapter, it is sufficient to reference the book itself without nesting its full structure since the book’s metadata does not evolve in relation to the chapter. To support these use cases, we introduced a mechanism that allows the system to treat certain fields as links rather than embedded versions. This approach ensures system-wide consistency when referenced objects are updated, prevents unnecessary duplication, and improves clarity and maintainability. It also supports lightweight associations between records, such as referencing a patent from an achievement, without embedding the entire metadata structure. This selective flattening of complex object graphs significantly improves clarity, performance, and semantic precision in large-scale deployments.
We also found it necessary to determine how deeply certain metadata structures should be nested as a complex attribute value. For instance, detailed author profiles should be shown in full when viewing a standalone person record but may be limited to a specific substructure when the same researcher is embedded within a publication record as an author. By controlling the allowed depth of such structures, we are able to reduce redundancy and ensure consistency across different relationships with given record types.
In some contexts, data are imported or referenced from authoritative sources and must remain consistent with those origins. To preserve this integrity, certain fields were made read-only in specific contexts—visible to users but not editable. This prevents accidental overwriting of curated or centrally maintained content and ensures that updates are synchronized from a single source of truth.
Finally, to ensure data quality, we incorporated validation rules into the schema itself, which are invoked before saving an object. These include requirements for field completeness, uniqueness of key attributes (such as email addresses), and conformance to specific structures (e.g., ORCID IDs, ISBN, or DOIs). Embedding these rules at the schema level helps streamline both user input and automated data imports while ensuring reliable output for reporting or integration.
Overall, these schema extensions enable us to adapt the OMEGA-PSIR’s data model to more accurately represent real-world usage scenarios. Instead of relying on ad hoc application logic, we encoded functional requirements directly into the schema in a portable, declarative, and maintainable way, which enables consistent data management across the system.
4.3. Dynamic Dictionaries
As mentioned above, the OMEGA-PSIR modeling tool includes a specialized mechanism for defining controlled vocabularies and vocabulary-based attributes without “hard coding” by using two basic built-in object types: Term and Termtype. The Termtype class acts as a definition at the dictionary category schema level (e.g., “publicationCategory” in the Article object class), while Term represents individual entries (e.g., “original work,” “extended abstract,” “errata,” or “editorial”) within that type.
Attributes controlled by the dictionary in the main object class can be used to categorize a class into subclasses. An attribute defined as Term links a field to a controlled dictionary defined by the associated Termtype. For example, the attribute publicationCategory in the Article object class will link to the appropriate dictionary defined by the associated Termtype.
One of the most important applications of dynamic dictionaries is the user-controlled dictionary of external identifiers. This category plays an important role in scientific databases because many object instances are imported, either wholly or partially, from various external sources. One can define many different types of external identifiers so that in an object instance one can have many external identifiers, “typed” by the controlled dictionary. This approach significantly improves semantic integrity of data. For example, the class person may have an attribute externalIdentifiers, defined as a list of TermField objects. Each TermField object is a pair (Term, String). The value of Term specifies the type of identifier—for instance, “ORCID”, ISNI, “Scopus ID”, or “Web of Science ID”—while the value of String holds the actual identifier. Obviously, for each object class, one can have a dynamic class-specific dictionary of external identifiers. This approach enables the system to manage various identifier types flexibly and ensures consistent classification across all objects.
The solution with dynamic dictionaries provides a balance between structure and flexibility. On the one hand, the controlled vocabularies play an important role in semantic consistency and validation so that it allows the avoidance of issues with typos, upper/lower case, or inconsistent category usage; on the other hand, they can be maintained by the system administrator instead of hardcoding values in the XSD schema (e.g., via <xs:enumeration>).
The term model supports multilingualism and user interface integration. The contents of dictionaries are fully manageable via the user interface, with the options including sorting order, descriptions, and deprecation flags. Authorized users can manage term sets directly through the system interface, ensuring that dictionaries remain accurate, current, and aligned with institutional semantics. Because attribute values are linked to predefined terms by reference rather than stored as free-form text, the system provides validation and semantic consistency across records.
This approach offers several significant advantages. Most importantly, it provides runtime configurability, enabling the addition, removal, or update of classification values without the need for changes to the underlying XML schema or system redeployment. Thus, it enables agile adaptation to evolving institutional policies or external standards.
The dynamic dictionaries approach reflects the OMEGA-PSIR’s model-centric philosophy: maintaining schema-aligned structure while preserving the flexibility to support institution-specific or evolving classification needs without sacrificing validation, searchability, or semantic integrity.
4.4. Hierarchical Structures
Hierarchical structures have a long story in knowledge modeling (see, e.g., [
36,
37]). As a matter of fact, it was one of the important motivations for research on OODB databases. Clearly, also in the scientific information system hierarchical structures play an important role. For example, university units build a hierarchical structure: a research team belongs to an institute, which is part of a department. There are also many classification schemes that have to be modeled in a form of hierarchy or even poly-hierarchy (lattice).
For this reason, it is possible to model hierarchies or poly-hierarchical structures for any XSD-defined object class. To model tree structures, an object-type model can contain a predefined parent attribute, which serves as a special mechanism to express parent–child relationships. If a lattice structure must be modeled, the parent attribute must be repeatable. The parent attribute may be complex, storing a nested substructure of the parent object. For instance, a research team could be linked as a child to an institute, which could be linked to a department.
Let us consider the tree structure of the university units. In the data model, the object type unit is the main type. The relationship parent between units is used to model university structure tree. As the unit type is main, there is a collection of unit instances. Each unit contains the type-specific attributes, say, namePL, nameEN, otherforms, and description. In addition, each unit instance (except for the top unit) contains the attribute parent, where a nested instance version of the parent unit is stored. So, for example, the unit instance describing Institute of Computer Science stores in the attribute parent a nested version of the unit instance describing Faculty of Electronics and Information Technology, which, in turn, in the attribute parent stores a nested version of the unit instance describing Warsaw University of Technology. Symbolically, it can be presented as follows:
<object type=unit>
<namePL> Instytut Informatyki </namePL>
<nameEN> Institute of Computer Science</nameEN>
... ....
<parent>
<object type=unit>
<namePL> Wydział Elektroniki</namePL>
<nameEN> Faculty of Electronics </nameEN>
<address> Warszawa, Nowowiejska 18/19 </address>
....
</object>
</parent>
<parent>
<object type=unit>
<namePL> Politechnika Warszawska </namePL>
<nameEN> Warsaw University of Technology </nameEN>
<address> Warszawa, Pl. Politechniki 1 </address>
</object>
</parent>
.... ....
</object>
As in the case of OODB, the child objects that are related to a parent do not have to repeat some attribute values that are defined in the parent object. So, if the address of the Institute of Computer Science is the same as that of the Electronics Faculty, it is not provided at the Institute unit level; instead, it is inherited from the parent object.
Hierarchical structures created with nesting can be used efficiently in an end-user interface for browsing and searching. As one can see, all the parent attributes are present in the subordinate unit so that the typed index for the unit can be defined. This allows one to search the unit by the attributes of the higher-level units. Specifically, when querying the parent unit ID in an index built for a typed path, all subordinate subunits can be retrieved. Therefore, search operations can automatically retrieve objects belonging to all descendants of a given level without expanding queries.
Hierarchical search refers not only to the collection of the objects with a hierarchy, but it may also refer to other object types, such that they embed a type for which a hierarchy is defined. Clearly, if within an object of a given type, say, A, an object of type B, for which a hierarchy is defined, is nested with all its ascendants, one can search hierarchically for objects A, reaching an object of type A by each of the ascendants of the instance of B nested in A. For instance, if in the object type project we have the attribute responsibleUnit, containing a unit object, then in a project instance within the responsibleUnit attribute we will have an instance of a unit nested with all the higher-level units. Therefore, a query for all projects associated with a given unit will include not only projects linked directly to that unit but also those linked to any of its subunits, going down to the bottom unit.
It is worth mentioning that hierarchical structures can be used for designing more intuitive GUI. In particular, a hierarchy can be presented in the form of hierarchical facets for drill-down querying.
Figure 3 illustrates how facets can be presented for filtering projects by project type, which is a tree structure. And thanks to the explicit nesting of hierarchically higher nodes in the subordinated node, when querying for projects of a given type, the system automatically includes in the search the projects classified not only directly under the specified type but also under any of its descendant subtypes.
In addition to the parent type attribute, one can also use the type related, which is used to express the lateral relationship between the objects. The system automatically recognizes types containing parent and related attributes and treats them according to standardized navigation and search rules. The user interface can dynamically build and render tree-based navigation components, facilitating intuitive exploration of various hierarchies, such as, e.g., organizational, partOf, or thematic structures.
The implemented modeling flexibility allows OMEGA-PSIR to represent complex knowledge structures. One of the advanced structures we have modeled in OMEGA-PSIR is a poly-hierarchical thesaurus known as Medical Subject Headings (MeSH), with the structure fully compatible with the original thesaurus (
https://www.nlm.nih.gov/mesh/meshhome.html (accessed on 26 August 2025)).
The hierarchical and semantic linking approach in OMEGA-PSIR is not limited to the typed objects. Also in the case of dynamic dictionaries, in addition to simple flat vocabularies, OMEGA-PSIR supports modeling of hierarchical dictionaries through the internal structure of the Term and Termtype entities. Both classes, Term and Termtype may optionally include the attributes parent and related, allowing terms to be organized into structured taxonomies or semantic networks. The parent attribute defines a (poly-) hierarchical relationship, enabling the construction of tree-like vocabularies where broader terms group more specific child terms. The related attribute allows for the creation of lateral semantic links between terms that are associated conceptually but do not follow strict parent–child logic.
Such a generalized handling of hierarchical types for both typed objects and dynamic dictionaries ensures a consistent user experience and simplifies the development of both search and reporting tools, as well as providing natural navigation of complex structures in the GUI. This is explained in more detail below.
Hierarchical querying is computationally very efficient because the hierarchy is nested in the records so when the composed structures are indexed, one simple query with a single term, without any preprocessing provides an answer containing all relevant hierarchically classified objects. In contrast, implementing similar functionality with flat structures, or implemented with a traditional relational database, would typically require complex, recursive, or multi-join queries, increasing both the implementation complexity and runtime cost.
The built-in attribute types parent and related natively support the thesaurus-like structures of the OMEGA-PSIR modeling approach. It enhances both the expressiveness and performance of metadata-driven search. In addition, for specific applications, it is very simple to implement a thesaurus supported search or enhance the system ontology by external classification standards. The knowledge enhancing structures may be locally defined and can be aligned with external classification standards.
4.5. Historical Tracking and Record Versioning
OMEGA-PSIR implements a built-in mechanism for historical tracking and versioning of all core records. Each time a record is saved—whether through user interaction or automated processing—the system captures the current state and stores it as a historical version in a dedicated archive. This snapshot includes metadata such as the modification timestamp, the user responsible for the change, and a structured representation of the differences between the current and previous version.
The archived versions are stored in a separate database collection and can be retrieved without affecting the active data store. Each historical entry functions as a complete and independently queryable snapshot, ensuring that prior states are preserved even after subsequent updates.
This versioning mechanism plays a crucial role in the overall integrity and auditability of the system. It allows administrators to track changes over time, understand who made specific modifications, and when they occurred. In the event of incorrect or undesired edits, previously stored versions can be restored, thereby serving as a safeguard against data loss or human error.
Importantly, historical tracking also supports scenarios in which data must reflect its original context. For instance, when entering a legacy publication into the system, it is possible to associate it with a historical version of an author record—one that captures the author’s name, position, and affiliation as they were at the time of publication, not as they exist currently. This ensures temporal consistency and semantic accuracy across records.
Through this mechanism, OMEGA-PSIR not only maintains data quality and transparency but also enables the faithful reconstruction of academic trajectories and institutional histories over time.
4.6. Syntactic and Semantic Versioning
OMEGA-PSIR distinguishes carefully between two dimensions of versioning: syntactic versioning, which refers to changes in the structure of the data model itself, and semantic versioning, which captures changes in the interpretation or intended meaning of data fields. This distinction is critical in the context of a schema-driven system that integrates tightly with external standards, long-term archiving, and analytics.
Syntactic versioning relates to modifications in the XML schema (XSD) or related validation layers. This includes the introduction of new attributes or elements, changes in data types, restructuring of nested components, or updates to validation rules. Each release of the schema is assigned an explicit version number, and the system ensures backward compatibility where feasible by maintaining appropriate conversion logic. These syntactic changes are essential for evolving the model to accommodate new research outputs, workflows, or system capabilities while ensuring that data remain valid and correctly interpretable across time.
Semantic versioning, on the other hand, addresses changes in how a given attribute or structure is intended to be used or understood. For instance, the definition of a term like “publication category” may evolve to align with new national reporting standards, or an affiliation field may shift from reflecting formal employment to including informal collaborations. Although the underlying XML structure may remain unchanged, such semantic shifts have significant implications for how data are entered, displayed, aggregated, and analyzed.
OMEGA-PSIR tracks both forms of versioning explicitly. Schema changes are managed through XSD version control and change logs, while semantic changes are documented through updates to dictionaries or controlled vocabularies.
5. Outcomes of the Modeling Approach
The object-oriented, hierarchical data modeling approach described in
Section 4 was designed not only to capture the complexity of research information in a semantically rich and temporally accurate way but also to enable advanced system functionalities that go beyond what is possible with flat or relational data structures.
This section demonstrates how these modeling choices, along with the text database technology, directly support powerful capabilities across three areas. First, the use of structured, embedded objects with typed relationships enables expressive, semantically rich querying mechanisms, allowing users to retrieve information with precision and context-awareness. Second, these querying capabilities integrate seamlessly with analytical tools, such as pivot tables and reporting pipelines, making it easy to summarize and visualize complex datasets without flattening or transforming the model. Finally, the system’s identifier-centric design naturally supports interoperability with external standards and platforms, enabling participation in the global research infrastructure and the implementation of FAIR and Linked Data principles.
5.1. Custom Query Engine with XPath-Inspired Syntax
To support flexible querying over the complex object model, OMEGA-PSIR introduces a custom search engine layered on top of SOLR (
https://solr.apache.org/ (accessed on 26 August 2025)). This query engine accepts a structured query language that is to a large extent compatible with the XPath standard (
https://www.w3.org/TR/xpath/ (accessed on 26 August 2025)).
The query language supports hierarchical paths to attributes, logical combinations of filters, range conditions, and support for cross-object querying via references. This syntax allows users and services to express queries in a form that mirrors the XML structure of the data, making it intuitive for developers familiar with XML and XPath while still being mapped internally to Mongo (
https://www.mongodb.com/ (accessed on 26 August 2025)) and SOLR queries. This makes the system similar to XML databases like BaseX [
24] or eXist [
25], except that both databases are schemaless, whereas in OMEGA-PSIR schema plays important role in controlling data quality and completeness. In addition, the OMEGA-PSIR, schema is responsible for the index definition of the database.
This XPath-inspired syntax not only provides an intuitive interface for developers and advanced users but also gives rise to a high degree of model awareness because queries mirror the schema’s nested structure. In addition, the use of XPath in the query layer directly synergizes with the analytical tools, namely, a pivot table mechanism, described in
Section 5.3. Both the querying and analytical tool rely on the same path-based access model. In the pivot component, XPath expressions are used for defining rows and columns of a table to group data to be used for calculating aggregates in pivot cells. This unified approach ensures consistency between querying and reporting, facilitating reuse of query logic across modules, and reduces the learning curve for power users working across both search and analytics functionalities.
5.2. Expanded Retrieval Functionality
A major advantage of the nested data structures in OMEGA-PSIR is the ability to perform expressive and precise querying. Searching in the system can be performed in two ways:
By taking into account the graph structure and indicating in the query how particular types of objects should be connected—this is a graph search, sometimes also called concept-based search;
By ignoring the graph structure and treating individual network elements as parts of a text document, for example, the text of the publication’s description is “extended” by the text of the person’s description, building a flat unstructured text, which is indexed, so we perform a search in such extended unstructured texts—we then have a text search.
For both ways, leveraging XPath as the primary query language, the system enables users to filter and retrieve records based on deeply embedded attributes, typed relationships, and context-specific values. Clearly, the graph search and text search can be integrated within one search interface.
5.2.1. Graph Search
Graph search is characterized by the fact that the search is for exact values. In particular, graph search makes it possible to search for objects connected to other objects. For example, one wants to search for publications connected to a specific person using the complex author where the author ID is nested:
This search is independent of the local values of the person name, so even if a local nested author’s name differs from the one in the main object, the publication will be retrieved. Another simple query such as
These expressions allow for highly targeted retrieval, such as identifying the number of first-author contributions by a unit (including its subunits) in a given year. Note that here the search for the affiliation is independent of the language. It is also independent of possible changes of the unit names, stored in the local version of the author’s affiliation attribute.
In implementing the end-user interface, the graph search can be successfully used. For example, in a simple search for publications, the system reacts on the text entered to the search field, providing suggestions from the main
author collection. If the user selects one of the suggestions, it will result in attaching Xpath query with the selected person ID as in the first example, so the answer will cover all the publications authored by the indicated person. If though, the user still does not select any object, just leaving a text, the system will automatically interpret this as a text retrieval.
Figure 4 illustrates this.
5.2.2. Text Search
The text search is implemented with the power of SOLR text indexing. Also in this case, the embedded structures can be used for indexing, which gives rise to a powerful text searching. By adding the embedded structures to the index, the indexed objects are semantically enriched, so the quality of full-text retrieval is essentially improved. In particular, the standard SOLR functions, such as searching for similar objects, work better. In OMEGA-PSIR, the logged users have a choice—they can perform search in three modes: (1) without contextual, (2) medium-level contextual search, and (3) high-level contextual search.
The syntax for text search reflects the possibilities of the SOLR query language, such as, e.g., masking terms, terms adjacency, and neighborhood. In addition, the SOLR capabilities power advanced user interface features, including faceted search with hierarchical filtering, saved search configurations, and guided query builders that translate user selections into executable XPath expressions.
In flat metadata repositories, like DSpace or EPrints, querying is typically limited to non-structured attributes, whereas querying in OMEGA-PSIR enables constructing queries across nested structures, sometimes storing time related values and hierarchical relationships. Typed path indexing allows users to search by the attributes of related objects. For example, as an author structure is embedded in a publication, one can search for publications using embedded (historical) data, such as the country of external authors’ affiliated institutions, the scientific degree of internal authors, or even the authors’ age. Such querying is as simple as if it were for a flat structure—no runtime joins are required.
5.3. Analytical Tools and Reporting in OMEGA-PSIR
The hierarchical and semantically rich structure of OMEGA-PSIR objects also makes it possible to deliver advanced analytics and reporting features without flattening the data or introducing external processing layers.
Central to this capability is the pivot table subsystem, which allows users to create multidimensional summaries of data, grouped by any combination of attributes. Unlike traditional pivot tables built over flat relational tables (see [
38]), OMEGA-PSIR’s implementation works directly on XML-based records. Row and column dimensions are defined using XPath expressions that traverse nested structures, and aggregators apply counting, summing, or averaging logic to grouped subsets.
A pivot table consists of four conceptual parts:
Input data—a set of objects of a given type being an answer to a query;
Rows—define the grouping of data along one dimension (e.g., department, journal, and author);
Columns—define another grouping dimension (e.g., year and publication type);
Aggregators—define how the data are summarized within each row–column intersection (e.g., count of publications and sum of project budgets).
Given a query result from a standard search (Input data), the logged user can mark objects, for which she wants to build a table with statistics. As a result of running a pivot, a two-dimensional matrix is generated, where rows and columns correspond to the distinct values of attributes from Rows/Columns definitions, and each cell contains an aggregated value (e.g., count, sum, and average) computed for the subset of objects from the query results that match the corresponding row–column pair of values. Below, we provide an example illustrating the idea of building statistics and, at the same time, how the proposed modeling approach makes it possible to use historical data.
Presume we would like to analyze how WUT PhD students or diploma students contributed to the articles within the period 2014–2024. The nested author attribute stores position of the researcher. It is, therefore, possible to use the following XPath in order to find input data for the pivot table, that is, the journal articles where the students contributed:
article[
author[position=’Diploma student’ or position=’PhD student’]
and journalissue[year >= 2014 and year <= 2024]
]
Figure 5,
Figure 6,
Figure 7 and
Figure 8 illustrate pivot tables that can be obtained for the result of the above query. Each table presents journal articles for the 2014–2024 period, (co-)authored by WUT students (PhD or diploma students), grouped according to different criteria. In each table, the columns represent the year of publication.
To build the pivot table as on
Figure 5, we have to use
journalissue/year as the column definition (Issue year) and
author/position as the row definition (Author’s position). The aggregate function used here is an xpath function
count(.), resulting in counting objects falling into a cell.
For the pivot table from
Figure 6, the rows are defined by the XPath formula:
journalissue/year
-
author[position=’Diploma student’ or position=’PhD student’]/birthYear
In addition, the age values are grouped here.
The pivot table from
Figure 7 uses two fields in the Rows definition: the publication language and the author’s status. This means publications are first grouped by language and then further subdivided by the students’ position.
Note that the initial records are distributed into table cells in a way that are not necessarily disjoint. A single publication may have multiple authors and can, therefore, appear in multiple groups simultaneously (e.g., both the “Diploma student” and “PhD student” categories). This is reflected in the totals presented in the last row and column, which are not simple arithmetic sums of the rows values.
The table in
Figure 8, in addition to the standard record-counting aggregator, also defines a second aggregator that sums the WOS Impact Factor values (
“sum(indicator/IF)”). As a result, for each year in the table, two sub-columns are displayed —the number of articles
count and cumulated
IF.
The graphical interface enables users to define pivot configurations without the necessity of coding. To this end, the GUI provides a predefined collection of attributes specific for a given object type (including the embedded ones), together with the corresponding XPath expressions that form the basis for calculations. Drop-down menus and drag-and-drop components enable the user to select the available attributes and subsequently define columns and rows. Following this, the user can define aggregates. These definitions can be saved and reused, exported, or embedded in dashboards and reports.
More advanced users can leverage JavaScript expressions to define “virtual attributes,” conditional aggregations, or dynamic normalizations. This hybrid XPath/JavaScript approach facilitates advanced analytical queries.
In addition to pivot tables, the system includes a general-purpose reporting engine that transforms search results into custom output formats, with post-processing logic applied via JavaScript. These reports can be integrated with external platforms (e.g., Power BI) or accessed through the system’s open API for institutional reporting pipelines.
5.4. Interoperability and Linked Data in OMEGA-PSIR
The growing role of external data sources and the resulting role of global object identifiers for scientific information systems has recently been highlighted in [
39]. The object-based and identifier-centric design of OMEGA-PSIR also supports robust interoperability within the broader research information ecosystem. By treating external identifiers as first-class citizens and associating them with their semantic types via dynamic dictionaries, the system can reliably detect duplicates, reconcile records across sources, and produce Linked Open Data (LOD) relationships automatically.
Each object in OMEGA-PSIR may store multiple identifiers (e.g., DOI, ORCID, and Scopus ID), and each identifier can be associated with one or more external sources defined in the system. These sources are described using a dedicated
Source type, which includes the rules for constructing LOD links based on stored external identifiers. Having these, the system dynamically generates semantic links to external services, thereby ensuring global resolvability and supporting machine-actionable referencing, which is in line with the FAIR principles [
40]. For example, a unique DOI value stored in a publication record can be used to generate multiple semantic links to external resources, such as, e.g., OpenAlex, ORCID, doi.org, europepmc.org, crossref.org, or ncbi.nlm.nih.gov, to mention a few.
These external identifiers play a crucial role in deduplication workflows. They enable the system to detect and consolidate existing records as new data are imported or entered manually. Each record is validated before saving, and the system enforces the uniqueness of external identifiers by displaying warnings in the user interface if potential conflicts are found. Since identifiers can be added or updated at any time, a background process scans the database continuously for similar records to ensure ongoing deduplication. This process is further enhanced by the system’s ability to track record versions and merge duplicates without losing provenance or object relationships.
OMEGA-PSIR not only resolves identifiers but also aligns structurally with major metadata standards, including, i.e., CERIF, DataCite, and RDF. OMEGA-PSIR supports metadata export in RDFa and JSON-LD, participates in OpenAIRE harvesting (see [
35]), and provides automated DOI registration and updating.
The modeling approach notably enables inter-institutional data sharing between various OMEGA-PSIR instances. Using identifier matching and scheduled synchronization, local systems can transfer selected records to centralized or federated platforms, as demonstrated by the Polish Platform of Medical Research (PPM) [
41,
42]. In this model, university-level OMEGA-PSIR instances maintain autonomy while facilitating unified national aggregation, reporting, and discovery.
6. Modeling of Academic World—A Case Study
As mentioned in the introduction, the main goal of the WUT project from 2010 to 2013 was to implement a system integrating the functionalities of a CRIS-type system with an IR and an RPS. This integration was important because it made the platform valuable not only to university research management authorities (CRIS) but also to faculty for solving knowledge management and sharing problems. The solution has proven attractive for Polish universities, and from 2015 to 2025, it was deployed at 48 more institutions. There is extensive literature on the role of knowledge management systems at universities (see [
43,
44,
45]). As noted in [
45],
“KM and knowledge sharing among academics ought to be a critical factor in knowledge intensive organizations like universities.”
In this section, we will present a case study on the use of OMEGA-PSIR as a knowledge base (KB). We will demonstrate how the modeling solutions presented in the paper influence the functionalities of the integrated platform and upgrade knowledge-sharing capabilities.
The main part of a university KB is the data model that defines objects and relationships, compliant with the CERIF model. As CRIS, OMEGA-PSIR covers the life cycle of the entire research process, from project proposals through projects to various research outputs, including publications, reports, prototypes, software, patents, and research data. The data model reflects the main actors (i.e., researchers who are at the center of the model), the organizational infrastructure of the university (the units to which researchers are affiliated), laboratories, teams, projects, and forms of academic output. These objects are linked to one another, forming a semantic network that is presented schematically in
Figure 9.
CRIS requires that the research output of all researchers be complete. There are, therefore, functions harvesting publication records from the internet. The records can be harvested from arXiv, PubMed, WoS, Scopus, or any other registered source. Additionally, with the interoperability functionality, the author depositing a publication, instead of filling all the fields, can just download directly to KB the bibliographic description from a global source (CrossRef, Scopus, PubMed, etc.). If bibliometric parameters are available (citations, impact factor, etc.), additionally they are also downloaded.
OMEGA-PSIR also covers RPS functionality, including the automatic creation of profiles for researchers and units that constitute the university’s formal hierarchical structure. All achievements gathered (automatically or manually) in KB are automatically assigned to the corresponding profiles. Based on the collected information, the system is able to generate a visualization of staff networking, and cooperation with the academic and external communities, making the profiles a rich source of information. Moreover, automatically generated tag clouds are used to make profiles searchable. These clouds are created using the profile’s record of achievements (e.g., publications, patents, and projects). This feature becomes an important tool for researchers in looking for research partners.
The KB platform also plays the role of IR. In this capacity, it provides researchers with the ability to deposit publications, or theses. Unlike traditional IR systems, KB replaces the concept of flat collections with hierarchical unit structures, so that the researchers achievements are automatically assigned to the affiliating unit profile, which then automatically appear in all profiles of the higher-level units. This means that authors do not have to declare a collection for their deposited publications or theses.
The comprehensive array of information in the system fosters an environment conducive to the implementation of diverse functionalities that ultimately benefit the entire university community, especially in terms of knowledge-sharing. One of such developments is searching for an expert in a given domain. The search is based on a real record of the researcher’s achievements, such as, e.g., publications or grants awarded. The selection criteria may vary depending on the user’s needs. For example, they may be based on bibliometric measures (e.g., cumulated WOS IF, citations, or a multi-criterial assessment). The algorithm is described in more detail in [
7].
A recently concluded project sought to develop new functions, also with a particular focus on enhancing the knowledge-sharing capabilities of the system:
A mentoring module, aiming at finding a mentor that could assist in preparing a project application; a group of mentors can be easily identified, and if an expert accepts the proposal, she can serve the younger colleagues in preparing project applications; a catalog of research infrastructure (equipment and laboratories) and offers for specialized research services; it gave us two important features: (1) integration of the infrastructure catalog with projects, enabling implementation of a workflow for requesting research services for running projects—in effect, one can monitor the usage of the equipment; (2) the catalog became an important factor of knowledge sharing, strengthening the research integration within the staff.
In order to share knowledge about cooperation of the university with external world, a specialized module about institutional partners has been recently developed; it makes possible to search categorized partners (universities, business, health institutions, etc.) by various attributes, including a cooperation subject, and shows the common achievement, such as publications, patents, and projects (see
Figure 10); additionally, it shows a graph of key cooperating persons on both sides.
An event announcement table, being a central place where researchers can announce their seminars, workshops, and eventually deposit presentations.
Concluding, the proposed model of the KB, as well as the functionality, go significantly beyond the traditional scope of the CRIS-like system.
8. Conclusions
In this article, we have summarized the results of the research conducted at WUT, related to the design, construction, and development of OMEGA-PSIR. The system was designed as a solution for creating a knowledge base. After analyzing various solutions with institutional scientific information systems, we have decided to integrate the functionalities of CRIS, RPS, and IR systems, taking into account that the system types are functionally different, and the design assumptions may be conflicting. As a result, a specialized XML database has been implemented.
The primary focus of this study was to present the modeling approach based on XML, hierarchical structuring, and versioning. We have shown the efficacy of the presented approach in modeling the history of evolving objects. In the proposed approach, the temporal relationships between objects can be modeled by embedding local versions of main objects with “frozen data” within hosting objects. We have demonstrated how this approach enhances the search functionalities and possibilities of the statistical reporting system with pivots. In addition, the merits of employing modeling techniques to represent hierarchical structures, accompanied by the inheritance of properties and values, have been demonstrated. Furthermore, we have discussed and proposed the utilization of dynamic dictionaries, which have the potential to enhance the flexibility of the model.
The potential of OMEGA-PSIR was illustrated through the case study showing how the system functions as an integrated university knowledge base. Several advanced modules have been developed to strengthen the system’s knowledge-sharing role: expert search, a mentoring support module for grant preparation, a research infrastructure catalog, a partner institution registry, and a platform for academic event announcements. Another module is a recommendation system, currently under development. By combining the completeness of statistical data on the achievements of project leaders at the time of grant submission with the individual track record of the applying researcher, the system will be able to suggest the most appropriate project category for a given applicant.
To conclude, by implementing the platform designed for advanced research information management, we have obtained the following:
Nontrivial modeling solutions based on XML, providing many interesting features, such as modeling evolving objects, analytical tools for creating pivots with historical data and/or hierarchies, and support for multi-versioning.
The system with the functionalities integrating possibilities of CRIS, IR, and RPS, which are well accepted by the universities in Poland and very positively evaluated by various groups of internal and external users.
We claim that IR, CRIS, and RPS systems are all pieces of the same puzzle. Our 12 years of experience with maintaining the system and running it shows that the applied combination of functions (CRIS, IR, and RPS) created a very positive synergy effect. In particular, system profiling significantly improves the level of acceptance of the system by the academic community, which in turn improves the completeness of the database, which, in turn, is a very important factor for the proper functioning of the CRIS system. Furthermore, the integration of IR functions with KB influences the acceptance of open science practices.