Next Article in Journal
An Early Investigation of the HHL Quantum Linear Solver for Scientific Applications
Previous Article in Journal
In-Context Learning for Low-Resource Machine Translation: A Study on Tarifit with Large Language Models
Previous Article in Special Issue
A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Discoverability: A Metadata Framework for Empirical Research in Theses

by
Giannis Vassiliou
1,†,
George Tsamis
1,†,
Stavroula Chatzinikolaou
2,†,
Thomas Nipurakis
1,† and
Nikos Papadakis
1,*,†
1
Electrical and Computer Engineering, Hellenic Mediterranean University, 71410 Heraklion, Greece
2
Directorate of Veterinary, 81100 North Aegean, Greece
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Algorithms 2025, 18(8), 490; https://doi.org/10.3390/a18080490
Submission received: 18 June 2025 / Revised: 30 July 2025 / Accepted: 31 July 2025 / Published: 6 August 2025

Abstract

Despite the significant volume of empirical research found in student-authored academic theses—particularly in the social sciences—these works are often poorly documented and difficult to discover within institutional repositories. A key reason for this is the lack of appropriate metadata frameworks that balance descriptive richness with usability. General standards such as Dublin Core are too simplistic to capture critical research details, while more robust models like the Data Documentation Initiative (DDI) are too complex for non-specialist users and not designed for use with student theses. This paper presents the design and validation of a lightweight, web-based metadata framework specifically tailored to document empirical research in academic theses. We are the first to adapt existing hybrid Dublin Core–DDI approaches specifically for thesis documentation, with a novel focus on cross-methodological research and non-expert usability. The model was developed through a structured analysis of actual student theses and refined to support intuitive, structured metadata entry without requiring technical expertise. The resulting system enhances the discoverability, classification, and reuse of empirical theses within institutional repositories, offering a scalable solution to elevate the visibility of the gray literature in higher education.

1. Introduction

The increasing digitization of academic content and the rise in institutional repositories have profoundly reshaped how scholarly work is created, stored, and accessed. A significant portion of this content consists of student-authored theses and dissertations—collectively referred to as the gray literature. While these works often contain valuable empirical research, they are frequently overlooked in scholarly ecosystems due to inadequate metadata practices that hinder their discoverability, classification, and reuse [1,2,3,4].
A key challenge lies in the limitations of existing metadata standards. General-purpose schemas such as Dublin Core [5,6,7,8] and MARC [9,10] have been widely adopted in academic repositories, yet they offer only basic descriptive capabilities and fail to capture the structural and methodological complexity of empirical research, particularly in the social sciences. On the other hand, more specialized models—most notably the Data Documentation Initiative (DDI) [11]—provide rich metadata structures suitable for quantitative datasets but remain too complex and domain-specific for practical use in thesis documentation by students or non-specialist staff.
This paper addresses the gap between simplicity and descriptive depth by introducing a novel, hybrid metadata framework specifically designed to support the documentation of empirical content in student theses. Our approach integrates selected elements from DDI into the familiar structure of Dublin Core, creating a lightweight but expressive model that can accommodate quantitative, qualitative, and mixed-methods research. This positions our framework as a practical solution that balances usability and metadata richness—two qualities often found at odds in existing standards.
The motivation for this work stems from the pressing need to make student-authored empirical research more visible, reusable, and systematically integrated into scholarly infrastructures. By equipping repositories with a metadata system that captures essential research characteristics without overwhelming users, we aim to bridge the gap between professional data documentation practices and everyday academic workflows. Furthermore, our framework contributes to ongoing efforts to elevate the status of the gray literature as a discoverable and credible source of scholarly knowledge.
To assess the validity and utility of the proposed model, we applied it to a representative sample of sociology theses. This empirical investigation enabled us to refine the metadata fields based on real-world documentation patterns and led to the development of a functional web-based prototype. The system supports guided metadata entry, modular form generation, and repository-ready output formats—all without requiring technical expertise.

1.1. The Problem: A Hidden Empirical Investigation on Youth Unemployment

Maria, a sociology graduate student, conducted a thesis titled “Youth Unemployment in Rural Greece: A Qualitative Study”. Her work included in-depth interviews and rich qualitative analysis. Despite being archived in her university’s repository, her thesis could not be found by other researchers, students, or policymakers searching for studies on rural unemployment or qualitative methods.
Why? Because
  • The repository only stored basic metadata (title, author, date);
  • No fields indicated that her study used interviews, focused on rural areas, or involved qualitative analysis;
  • The search engine could not differentiate her work from hundreds of unrelated theses.
As a result, her valuable research remained invisible in academic searches and unusable for comparative or secondary studies.

1.2. The Solution: A Hybrid Metadata Framework

Using the proposed hybrid metadata framework, Maria’s thesis would be documented in the following way:
  • Research Strategy: Qualitative.
  • Data Collection Method: Semi-structured interviews.
  • Population: unemployed youth (ages 18–30).
  • Geographic Coverage: Rural Greece.
With this metadata in place,
  • A student researching qualitative methods could now filter by methodology;
  • A policymaker looking for data on rural youth unemployment would find Maria’s thesis in seconds;
  • The thesis becomes part of a discoverable, structured academic network.
This hybrid framework we propose bridges the gap between simplicity and specificity—making empirical theses like Maria’s findable, reusable, and valued in the broader research ecosystem.
The main contributions of this paper are summarized as follows:
  • We introduce a hybrid metadata framework that merges selected elements of Dublin Core and DDI, offering a user-friendly yet expressive structure for documenting empirical research in theses across methodological types.
  • We address a well-documented gap in existing standards by proposing a model that is both granular and accessible, particularly for non-specialist users working with qualitative and mixed-methods research.
  • We validate the framework through a structured analysis of actual student theses, which informed the refinement of metadata fields and ensured alignment with real documentation practices.
  • We implement a web-based prototype using open-source technologies, enabling students and institutional staff to input structured metadata through an intuitive interface.

Novelty and Methodological Contribution

While the use of Dublin Core and DDI is not new individually, our methodology demonstrates originality in the following ways:
  • Conducting a structured, thesis-specific content analysis to empirically inform field selection.
  • Designing a hybrid model optimized for non-specialist academic contexts.
  • Operationalizing the framework as a working prototype validated by real users.
  • Addressing a documented but unresolved institutional need for empirical thesis metadata documentation.
Notably, the DDI (Data Documentation Initiative) was originally developed to support the documentation of empirical research studies, not student theses. This makes our work particularly innovative, as it applies DDI in a context for which it was not initially designed. By adapting DDI to the structure and needs of academic theses, we demonstrate its flexibility and extend its utility to educational and institutional research practices beyond its conventional domain.
Thus, the contribution is methodological in its applied rigor and practical in delivering a scalable metadata model ready for institutional adoption.
The metadata landscape is rapidly evolving, with standards like Schema.org and RDF enabling semantic web integration and AI-driven tools like NLP facilitating automated metadata extraction. While our proposed hybrid model leverages established standards (Dublin Core, DDI) to address immediate repository needs, it is designed with future extensibility in mind. By balancing simplicity, empirical specificity, and potential integration with emerging technologies, this framework aims to position student theses as discoverable assets in both traditional and semantic scholarly ecosystems.
The remainder of this paper is organized as follows: Section 2 reviews related work on metadata standards and prior initiatives for documenting empirical research. Section 3 details the methodology used to design and validate the proposed model. Section 4 outlines the documentation requirements for various types of theses. Section 5 elaborates on the theoretical foundations of the selected standards. Section 6 presents the empirical validation process and final metadata model. Section 7 discusses key findings and their implications. Finally, Section 8 concludes with limitations and future directions.

2. Related Work

Library and information science have long been concerned with the description and organization of knowledge resources. The emergence of metadata standards has facilitated interoperability and improved access to digital content. Key frameworks—MARC and Dublin Core—have shaped how academic resources are documented and retrieved. However, multiple reviews—including a recent systematic one by Mosha and Ngulube [12]—have highlighted ongoing limitations in how metadata standards are implemented in institutional repositories, especially concerning empirical research and the educational gray literature.
Recent studies have further documented the scope of this problem across different contexts. Nicholson and Bennett [2] found that institutional repository deposit guidelines may actually deter data discovery, creating barriers rather than facilitating access to research outputs. Similarly, Osman et al. [3] evaluated the quality of Electronic Theses and Dissertations (ETDs) descriptions in Malaysian institutional repositories, revealing significant deficiencies in metadata completeness and consistency. At a global scale, Khan et al. [4] conducted an analysis of research data repositories using the re3data registry, identifying widespread gaps in research data repository practices and metadata standards implementation.
These findings collectively underscore a persistent challenge: while institutions recognize the importance of proper metadata documentation, current standards and practices remain inadequate for capturing the complexity and diversity of empirical research, particularly in student-authored academic works.

2.1. MARC

The MARC (Machine-Readable Cataloging) standard was developed in the 1960s by the Library of Congress to encode bibliographic information in a machine-readable format [10,13]. While effective for traditional library systems, MARC is limited in capturing empirical or research-specific metadata. As noted by Mosha and Ngulube [12], standards like MARC are widely implemented but not sufficient for describing the methodological complexity and reuse requirements of research data in repositories.

2.2. Dublin Core

Introduced in 1995 by the Dublin Core Metadata Initiative (DCMI), Dublin Core offers a simple, 15-element schema designed for broad applicability across digital resources [14,15]. It is widely used in institutional repositories, including systems like “Ellanikos”. However, its generality limits its ability to capture detailed empirical research characteristics such as methodology or data collection procedures [16,17]. This weakness has been reaffirmed by Mosha and Ngulube [12], who observed that although Dublin Core supports basic discoverability, it lacks the depth needed to ensure reusability and long-term preservation of research data.

2.3. Data Documentation Initiative (DDI)

The DDI standard was developed to describe quantitative research data in the social sciences [18,19]. Written in XML, DDI includes rich metadata about study design, sampling, data collection, and variable-level information. Despite its power, DDI is too complex for widespread use by students documenting theses, especially those involving qualitative or mixed methods.
The Data Documentation Initiative (DDI) provides rich metadata structures for quantitative research, but its complexity creates significant adoption barriers. DDI’s XML-based architecture and 300+ element tags overwhelm non-technical users.
While DDI offers comprehensive coverage for quantitative studies, its practical limitations are well-documented:
  • Technical Complexity: The standard’s reliance on XML schemas and nested hierarchies (<dataDscr>, <var> tags) demands programming knowledge [19].
  • Training Requirements: Average onboarding requires 40+ h for competent use [18].
Mosha and Ngulube [12] noted that while many higher education institutions (HEIs) are aware of rich metadata models such as DDI, implementation is limited by technical barriers, lack of expertise, and insufficient institutional support. Their systematic review found no consistent framework in HEIs for documenting research metadata, nor widespread use of standards suited to non-specialist users—particularly for preserving, discovering, and reusing student-generated research data.
Recent initiatives have acknowledged these limitations. The Qualitative Metadata Working Group [20] proposed best practice guidelines aimed at harmonizing qualitative data documentation across European social science repositories. These emphasize capturing contextual details such as interview settings, ethical considerations, and data processing workflows—areas typically underserved by Dublin Core and only partially addressed by DDI. Similarly, NVivo, a widely used qualitative data analysis tool, offers internal metadata tagging for attributes like interview dates, participant demographics, and case classifications; however, these remain confined to proprietary environments and do not enable open, interoperable documentation [21]. While proposals such as the Qualitative Data Model for DDI [22] suggest promising extensions to existing standards, widespread adoption has yet to materialize.
Together with the findings of Mosha and Ngulube [12], these developments underscore a shared conclusion: existing standards either lack the specificity (e.g., Dublin Core) or accessibility (e.g., DDI) needed to document empirical student research effectively. This motivates the development of a lightweight, hybrid metadata framework that accommodates diverse methodological approaches while remaining usable by students and institutional staff.

2.4. Additional Metadata Standards in the Digital Library Ecosystem

While Dublin Core and DDI form the primary foundation of our hybrid approach, a comprehensive understanding of the metadata landscape requires consideration of other established standards. This section examines additional frameworks to contextualize our design decisions and justify our focus on Dublin Core and DDI integration.
The Metadata Object Description Schema (MODS), developed by the Library of Congress, represents a significant advancement over basic Dublin Core for bibliographic description [23,24]. MODS provides more granular elements, including complex title information, detailed name authorities, subject hierarchies, and sophisticated relationship modeling through elements such as <titleInfo>, <name>, <subject>, and <relatedItem>. However, despite MODS’ enhanced bibliographic capabilities, it lacks the empirical research specificity that our framework addresses, focusing on traditional library cataloging concerns rather than research methodology documentation. Similarly, the Metadata Encoding and Transmission Standard (METS) provides a framework for encoding descriptive, administrative, and structural metadata for digital library objects [25,26]. While METS excels as a “wrapper” standard that can incorporate other metadata schemas while adding structural organization, it addresses packaging and structural relationships rather than content-specific documentation needs, introducing unnecessary complexity for individual thesis documentation without providing the empirical research fields our hybrid model supplies.
Encoded Archival Description (EAD) serves as the standard for encoding archival finding aids, designed to describe collections of materials rather than individual items [27,28]. EAD supports hierarchical descriptions with provenance information and detailed scope descriptions, but its collection-oriented approach differs from our focus on individual thesis documentation. EAD assumes materials exist within established archival hierarchies, whereas student theses represent discrete intellectual products requiring methodological rather than archival description. Schema.org represents another alternative, providing structured data vocabularies for web content using linked data principles [29,30].
While Schema.org offers potential complementarity with academic repository systems for enhancing discoverability, its broad applicability comes at the cost of domain-specific depth, lacking the granular empirical research fields necessary for comprehensive thesis documentation.
The Resource Description Framework (RDF) provides a foundational model for representing information using subject–predicate–object triples [31,32], offering theoretical advantages for semantic web integration and cross-repository interoperability. Despite these capabilities, we chose to focus on XML-based standards for practical reasons: existing institutional repositories predominantly implement XML-based metadata systems, making our approach immediately deployable without infrastructure changes; the learning curve for non-technical users is lower with form-based XML entry systems; and while RDF enables sophisticated semantic relationships, the primary challenge in thesis documentation lies in capturing empirical research details rather than complex entity relationships. Two additional standards warrant brief consideration: ONIX (Online Information Exchange) serves the commercial publishing industry by standardizing product information [33], focusing on sales and marketing metadata rather than scholarly content documentation; and PREMIS (Preservation Metadata: Implementation Strategies) addresses digital preservation requirements [34,35], documenting preservation actions and technical characteristics rather than research methodology and empirical content.
This survey of alternative metadata standards reinforces our decision to build upon Dublin Core and DDI. Dublin Core provides the necessary simplicity and widespread adoption for general thesis description, while DDI offers the only established framework specifically designed for empirical research documentation. Other standards either lack empirical research specificity (MODS, EAD, Schema.org), focus on structural rather than content concerns (METS), address commercial rather than academic needs (ONIX), serve preservation rather than description functions (PREMIS), or prioritize semantic sophistication over practical implementation (RDF). Our hybrid approach leverages the strengths of established standards while addressing the specific gap in empirical thesis documentation, creating a practical solution that institutions can implement within existing technological and organizational constraints (Table 1).

2.5. Toward a Hybrid Model

Previous efforts to bridge general and empirical standards—such as the DATORIUM project [36]—have extended Dublin Core with DDI elements. However, these focus on datasets rather than complete academic works like theses. To address this gap, we propose a hybrid metadata model that combines the accessibility of Dublin Core with selected empirical descriptors from DDI, tailored for educational contexts.

3. Methodology and Design Process

This study follows a design-based research (DBR) methodology (Figure 1) aimed at developing a context-sensitive, user-friendly metadata framework specifically suited for documenting empirical research in academic theses. Rather than proposing an entirely new theoretical model, our contribution lies in the careful adaptation and operationalization of existing metadata standards to solve a practical and widely acknowledged problem in academic repositories: the under-documentation and poor discoverability of empirical content in student-authored theses, particularly within the social sciences.
The methodology consisted of the following structured phases:

3.1. Phase 1: Problem Identification and Gap Analysis

We began with a comprehensive review of metadata standards (Dublin Core, DDI) and digital library practices related to student theses and the gray literature. Our review confirmed the following:
  • Dublin Core is widely used in institutional repositories but lacks empirical research descriptors.
  • DDI is robust but unsuitable for non-technical users and does not target theses.
  • No metadata framework exists that is both empirically expressive and accessible for students or repository staff without specialist training.

3.2. Phase 2: Empirical Content Analysis of Theses

To inform our metadata field selection, we conducted a content analysis of fourteen graduate sociology theses (6 quantitative, 6 qualitative, 2 mixed-method) from the University of the Aegean’s institutional repository. These theses were manually coded against a draft metadata template derived from DDI-Lite’s “Study Description” and core Dublin Core fields.
While the DDI standard was not originally designed for theses, our empirical analysis revealed that many of its high-level descriptive fields (such as research questions, sample characteristics, data collection dates, and geographical scope) are sufficiently general to apply across all research methodologies. We therefore selectively adapted these fields for use in quantitative, qualitative, and mixed-method theses, focusing on those that capture universal aspects of empirical design rather than quantitative-specific constructs like variables or datasets. This approach supports the hybrid model’s goal of balancing expressiveness with usability across methodological types.
Our analysis measured the presence, absence, and usability of 25 candidate metadata fields. Fields were selected for inclusion in the final schema if they were
  • Present in at least 70% (10 out of 14) of the theses, leaving the remaining fields optional;
  • Cross-methodologically relevant (usable across qualitative, quantitative, and mixed methods);
  • Essential for understanding empirical research design (e.g., sampling, data collection);
  • Realistically identifiable and fillable by non-specialist users.

3.3. Phase 3: Hybrid Metadata Model Construction

Based on the analysis, we designed a 33-field hybrid metadata schema organized into two components:
  • Part A—Research-Specific Fields (20 fields): Drawn from DDI, covering strategy, population, methodology, and data production.
  • Part B—General Description Fields (13 fields): Based on Dublin Core and common thesis documentation fields (e.g., title, abstract, language).
This structure reflects an original contribution: a field-tested, pedagogically grounded, and student-usable framework that bridges descriptive and empirical metadata.

3.4. Phase 4: Prototype Implementation

We implemented the framework in a functional web-based prototype using open-source technologies (PHP, MySQL, Bootstrap). The system allows for the following:
  • Guided metadata entry via modular, type-aware forms.
  • Export to XML for integration with institutional repositories.
  • Basic filtering and search based on research method and metadata categories.

3.5. Phase 5: Informal User-Centered Validation

Nine users (five graduate students, three faculty advisors, one repository administrator) participated in an informal walkthrough and evaluation of the prototype. Feedback was collected via structured questionnaires and discussion. Participants confirmed the system’s usability and practical value, while recommending improvements such as tooltip help, field definitions, and support for qualitative variation. These enhancements have been noted for future implementation. The results of the user validation can be found in the Appendix E.

4. Documentation Requirements for Theses

The documentation needs of a thesis depend heavily on the nature of the research it contains. For this purpose, theses are classified into four distinct types:
  • Bibliographic theses (non-empirical).
  • Theses with quantitative empirical research.
  • Theses with qualitative empirical research.
  • Theses with mixed-method empirical research.

4.1. Bibliographic Theses

Theses that do not include any form of empirical research require basic descriptive metadata. These works are typically literature-based, involving the synthesis of existing academic sources. The metadata fields necessary for documentation include the author’s name, title, abstract, language, and date of submission—similar to what is commonly used in library catalogs.

4.2. Theses with Quantitative Research

These theses introduce an additional layer of complexity. Beyond the general description of the thesis, detailed documentation is needed for the research component. This includes metadata on the research design, sample characteristics, methodology, and data collection process. Only the original researcher is typically equipped to document these details accurately. Tools such as DDI are designed to capture this level of detail but are often too complex for general use.

4.3. Theses with Qualitative Research

Qualitative research presents unique documentation challenges, as there is no universally adopted metadata standard. As a result, many such works remain undocumented in digital systems, which limits their discoverability and reuse for secondary analysis.

4.4. Theses with Mixed-Method Research

Mixed-method studies inherit the complexity of both quantitative and qualitative approaches. While some aspects may be documented using tools like DDI, qualitative elements remain largely unsupported. This gap further emphasizes the need for a flexible, hybrid metadata model that accommodates all types of empirical research.

5. Documentation Standards

The importance of documenting empirical research is clear. Each empirical study is conducted within a specific spatial and temporal context, guided by defined research questions and methods of data production and collection. Documenting the research process using metadata that accurately reflects both the nature of the study and its environmental variables is critical for future evaluation and possible reuse. Proper documentation allows these studies to contribute meaningfully to future secondary research, where past findings form integral parts of new data analysis frameworks.
Data sharing is a well-established practice in the social sciences, supported by a global network that facilitates access to and preservation of research archives. As previously noted, empirical research may be quantitative, qualitative, or mixed-method. For quantitative studies, the Data Documentation Initiative (DDI) offers a robust framework for metadata documentation. However, qualitative research, which is equally prevalent and requires proper documentation for similar reasons, is not officially supported by DDI as of its current versions. For bibliographic theses, documentation is generally limited to descriptive metadata using models such as Dublin Core, widely adopted in online library catalogs.

5.1. The Data Documentation Initiative (DDI)

5.1.1. Overview

The DDI is an international standard for documenting empirical research in the social sciences [18]. First released in 1995, the standard is written in XML and supports the preservation and dissemination of research data. Its primary function is to facilitate the creation of structured “codebooks” describing datasets in detail.
Today, the DDI is actively used by data specialists and archive managers worldwide. According to the official DDI website (ddialliance.org), it “provides a standard for describing statistical and social science data, enabling both human and machine interpretation of datasets.”

5.1.2. Structure

The DDI relies on XML and organizes metadata using predefined tags. These include information about data collection methods, sampling techniques, population characteristics, spatial and temporal coverage, keywords, access conditions, and more. While Version 2.0 of the DDI focuses on the final documentation stage of a study, Version 3.3 extends coverage to the entire research lifecycle—from design to dissemination—enhancing its utility for secondary research.
The five main components of a DDI XML document are as follows:
  • docDscr: Description of the XML document.
  • stdyDscr: Study-level metadata.
  • fileDscr: Information about data files.
  • dataDscr: Variable-level metadata.
  • othMat: Supplementary materials.
While DDI was primarily developed for quantitative empirical research, its open and extensible structure allows partial adaptation for other research types. Its advantages include interoperability, open access, and integration potential with standards like Dublin Core [19].

5.1.3. DDI Lite

For users requiring a simpler format, DDI Lite offers a subset of the most commonly used tags. Based on DDI Version 2.0, it simplifies the codebook while maintaining compatibility with other metadata systems. Critical tags are retained to ensure essential information is documented.

5.1.4. DDI and Quantitative Research

DDI is well-suited for designing studies that may be reused in the future. It enables detailed documentation of variables, data collection methods, and the structure of electronic datasets, which helps future researchers analyze or repurpose data effectively. However, complex documents like theses, which may contain non-empirical or mixed content, often fall outside DDI’s formal application scope.

5.2. Dublin Core

Overview

Dublin Core is a simple and widely adopted metadata standard consisting of fifteen elements designed for describing digital objects. These include items like title, creator, subject, description, and date. Dublin Core is commonly used for documenting various digital formats such as videos, audio, images, texts, and complex web objects.
Implemented using XML and RDF, its simplicity has contributed to its popularity across digital libraries and institutional repositories. The standard comes in two levels:
  • Simple Dublin Core: Uses the original 15 core elements.
  • Qualified Dublin Core: Adds three more elements (Audience, Provenance, Rights Holder) and allows element refinements (qualifiers) to support more accurate searches.
According to the American National Standards Institute [6], Dublin Core enables effective resource discovery and has become a foundational standard in metadata management.

5.3. Bridging to the Hybrid Model

The preceding analysis of DDI (Section 5.1) and Dublin Core (Section 5.2) reveals complementary strengths that directly inform our hybrid framework design. While DDI provides granular empirical descriptors (e.g., methodology, sampling), its complexity exceeds typical thesis documentation needs. Dublin Core, conversely, offers simplicity but lacks research-specific fields. Our solution, presented in Section 6, strategically merges these standards by
  • Adopting DDI’s core empirical fields (e.g., Research Strategy, Data Collection Method) that and most frequent that appeared in ≥70% of analyzed theses;
  • Preserving Dublin Core’s lightweight structure for general description.
This synthesis addresses the key limitations identified in Section 5.1 and Section 5.2 while maintaining usability for student authors.

6. Initial Metadata Model: Design and Investigation

6.1. The Empirical Investigation

This phase of the research aimed to validate the proposed metadata model by applying it to actual theses. A diverse sample of theses representing bibliographic, quantitative, qualitative, and mixed-method research types was selected for evaluation. The primary goal was to identify whether the initial documentation schema captured all necessary information for accurate and useful metadata.

6.2. Research Objective

The main objective was to verify whether the initial set of metadata fields adequately described the key components of empirical research. This included information about methodology, sample characteristics, data collection, and research questions. The effectiveness of each field was measured by how consistently and clearly it could be completed using real student work.

6.3. Initial Metadata Template

A preliminary metadata recording form was created based on insights from Dublin Core, DDI, and the needs specific to theses. This form included general descriptive fields (title, author, year, and abstract) as well as fields relevant to empirical research (sample size, methodology type, tools used, population, and data collection method).

6.4. Testing and Observations

Each thesis in the sample was manually analyzed using the initial metadata form. The results revealed a number of limitations:
  • Some fields were too broad or vague to capture meaningful distinctions between research types.
  • Other fields were frequently left incomplete because the relevant information was not present in the thesis text.
  • Certain empirical features, especially in qualitative studies, were described inconsistently.
These findings prompted a refinement of the model, aiming for better alignment between metadata expectations and the way students typically document their research.

6.5. Research Purpose and Initial Form Design

The aim of the empirical investigation was to identify the empirical elements found in theses and develop a suitable metadata entry form that could serve as a foundation for documenting these theses.
This investigation relied on content analysis of 14 theses (6 qualitative, 6 quantitative, and 2 mixed-method) selected through purposive sampling. These theses were already submitted and archived in the University of the Aegean’s “gray literature” repository. Selected theses were expected to include elements aligning with the “Study Description” section of the DDI standard, to test how DDI can be adapted beyond quantitative-only research.
Bibliographic theses were excluded since the focus was on exploring which fields from DDI are essential in documenting empirical studies. For non-empirical works, Dublin Core provides sufficient coverage.
The sample consisted exclusively of theses authored by graduate students from the Department of Sociology at the University of the Aegean up to the year 2015. The sampling was designed to include all thesis types containing empirical content. A metadata entry form was developed and refined based on findings from content review.
Due to ethical considerations, the authors of the examined theses are not named.
The method of analyzing the results involved comparing the empirical details documented in each thesis against the fields of the DDI “Study Description.”
Research Hypotheses
  • Dublin Core offers general information on theses.
  • DDI can be adapted to cover essential elements of all types of empirical research (quantitative and qualitative).
Research Question
  • What are the essential elements of empirical research that must be included in the metadata form to ensure proper documentation?

Initial Metadata Entry Form (Prototype Documentation Model)

The initial form was derived from the DDI standard’s “Study Description” section, especially using elements from the DDI Lite subset, which proved especially suitable for our simplified documentation purposes. The form includes 25 fields, most of which are drawn directly from DDI Lite.
These fields were used to assess whether each element was present in the theses and to what extent it contributed to understanding the research. A detailed description of each field followed, aligned with DDI definitions.
Of the 25 fields (Table 2), 18 (Table 3) were determined to be fundamental for effective metadata documentation. These include the following:
  • Research strategy.
  • Data source (primary/secondary).
  • Time/space framework.
  • Research questions.
  • Hypotheses.
  • Reference and collection dates.
  • Countries, geographic coverage and units.
  • Population and observation unit.
  • Sample size and sampling method.
  • Data collection and analysis methods.
  • Recording and analytical tools.
These fields are considered essential for a robust documentation model that can describe empirical content across diverse thesis types.

6.6. Results of the Empirical Investigation

Presented below is an analysis of 14 empirical theses—encompassing quantitative, qualitative, and mixed-method studies (see Table 4, Table 5 and Table 6).

6.7. Table 7b—Final Selection of Fields

Based on the results above and considering the initial 7 fields under examination (see Table 7a) we conclude that 2 of them that have a score of 10 to 14 are significant (≥70% presence, and the other fields optional). The fields selected to be added as important in the new metadata form are shown in Table 7b:
Table 7. (a) Summary presence in all theses (max count = 14). (b) Final selection of fields.
Table 7. (a) Summary presence in all theses (max count = 14). (b) Final selection of fields.
(a)
#FieldPresence Count
1Title7
2Abstract7
3Purpose14
4Working Hypotheses12
5Other Sources3
6Research Problems2
7Ethics in Research5
(b)
#FieldPresence Count
1Purpose14
2Working Hypotheses12

6.7.1. Justification for the 70% Threshold

We selected the 70% threshold based on three converging considerations that emerged from our empirical analysis. First, this threshold ensures fields appear in a clear majority of theses across methodological types, demonstrating consistent relevance rather than methodological specificity. Second, it balances metadata completeness with system usability for non-specialist users, avoiding the complexity that would arise from requiring fields with inconsistent presence across research approaches. Third, it allows sufficient flexibility for methodological variation while maintaining the empirical specificity necessary for meaningful thesis discovery. This threshold represents 10 of our 14 theses, meaning required fields must demonstrate robust cross-methodological applicability spanning quantitative, qualitative, and mixed-method approaches.

6.7.2. Balancing Usability and Empirical Coverage

The 70% threshold strikes a critical balance between two competing needs: capturing sufficient empirical detail for meaningful thesis discovery while avoiding system complexity that could deter student adoption. Fields appearing in 70% or more of theses represent ‘core empirical elements’ that users can expect to find consistently across diverse research approaches, forming a reliable foundation for repository search and filtering capabilities. Meanwhile, fields with 50–69% presence become optional rather than excluded entirely, accommodating methodological diversity without overwhelming catalogers or creating barriers to system adoption. This tiered approach ensures the framework remains accessible to non-specialist users—a key design requirement given that students and repository staff, not metadata experts, will be the primary system users—while preserving the empirical richness necessary for distinguishing research theses from purely bibliographic works.

6.7.3. Sensitivity Analysis of Threshold Selection

Sensitivity analysis reveals how threshold selection dramatically affects both field inclusion and system utility. A 50% threshold (≥7 theses) would include Title, Abstract, Purpose, and Working Hypotheses as required fields, but only the latter two demonstrate strong cross-methodological presence, while Title and Abstract show inconsistent documentation patterns that could frustrate users expecting complete records. A 60% threshold (≥8 theses) would require only Purpose and Working Hypotheses, providing minimal empirical specificity insufficient for meaningful discovery of research characteristics like methodology, sampling, or data collection approaches. The selected 70% threshold (≥10 theses) results in the same two core fields as required—Purpose appearing in 100% of theses and Working Hypotheses in 86%—while maintaining Title, Abstract, and Ethics in Research as optional fields that enhance description without imposing unnecessary complexity. Higher thresholds produce identical results: 80% and 90% thresholds would still yield only Purpose and Working Hypotheses at the higher level, with 90% ultimately requiring only Purpose alone. This analysis demonstrates that the 70% threshold optimally balances empirical coverage with cross-methodological applicability, ensuring the most consistently present and methodologically relevant elements form the required core while preserving flexibility for enhanced documentation through optional fields.

6.7.4. Proposed Documentation Schema

Having already analyzed the DDI and Dublin Core standards sufficiently, their capabilities and limitations for the design of a new schema aimed at adequately describing theses—especially those containing empirical research—are known. While Dublin Core can provide general descriptive functionality, it lacks details on empirical research. On the other hand, DDI offers thorough descriptions for empirical (especially quantitative) studies, including elements applicable to qualitative or mixed research.
However, DDI’s depth and complexity make it less practical for a simplified metadata input system. Thus, a hybrid approach is needed—an enhanced Dublin Core. Fields drawn from DDI that focus on empirical research are integrated if deemed important and prevalent in actual theses.

New Metadata Form

The hybrid model combining the simplicity of Dublin Core with essential DDI-like fields to describe empirical work, is presented in Table 8a,b.
These two parts together form the final metadata entry form for possible implementation in a documentation system.

7. Discussion

The Ellanikos system, like most library systems used in academic (and other) institutions, does not support enhanced specialization in the metadata it stores. This limitation hinders the description—following a library-style approach—of theses that include empirical (quantitative, qualitative, or mixed-methods) research, particularly according to established scientific documentation standards. Traditional library systems tend to ignore or fail to adequately capture information related to empirical research.
Furthermore, the classification and cataloging of theses in systems like Ellanikos are typically carried out by specialized library staff, with no involvement from the thesis authors themselves. This separation can result in metadata that lacks essential details about the empirical aspects of the research.

7.1. Adaptation of the DDI Standard

In this work, an attempt is made to combine the DDI (specifically DDI Lite) and Dublin Core standards to form a new metadata framework that can include basic information about theses containing empirical research. DDI was chosen as the reference model for designing the initial metadata entry form because it offers a wide array of fields capable of thoroughly describing empirical studies. However, the aim here is not to implement DDI in its entirety but to selectively use only those fields that can adequately describe the empirical nature of the work while maintaining the simplicity and interoperability of Dublin Core.
Particular attention was given to fields under the stdInfo and method subcategories of the Study Description section of the DDI standard. These contain critical metadata elements that can highlight the empirical character of a thesis, such as the research method, data collection techniques, and study scope. These fields were evaluated for their compatibility with both quantitative and qualitative research documentation.
Moreover, this initiative does not seek to achieve full metadata curation or exhaustive documentation of research projects. Most theses do not include supporting materials such as datasets, codebooks, or variable files, which are essential for proper DDI documentation. Therefore, comprehensive documentation must remain the responsibility of the research team itself and should be conducted during the research lifecycle—not retrospectively.

7.2. Combining DDI and Dublin Core

A hybrid approach is proposed, integrating selected DDI fields into a Dublin Core-based cataloging system. This strategy leverages the wide adoption and simplicity of Dublin Core while enriching it with targeted DDI elements that provide meaningful insights into the presence of empirical research. Only those DDI fields that add value to the discovery, classification, and retrieval of empirical theses are included, deliberately avoiding unnecessary complexity.
The goal is not to create a new exhaustive standard, but to enhance systems like Ellanikos with metadata that can enable users to identify whether a thesis includes empirical research and to what extent. This allows users to locate not only the text of a thesis but also, when available, any supporting data or materials.
In this envisioned system, metadata entry will be carried out either by the students themselves—who authored the work—or by designated administrative personnel. These individuals do not need to be research experts, as the recording form and metadata model are designed to be accessible and manageable for non-specialists.
Ultimately, this hybrid model provides a practical compromise between detailed documentation and operational feasibility. It strengthens existing library infrastructures without imposing the burden of full empirical data curation on cataloging systems that were never intended for such tasks.

8. Conclusions

This paper presents a context-driven, hybrid metadata framework designed to improve the discoverability and structured documentation of empirical research embedded in academic theses. Rather than proposing an abstract theoretical model, the contribution of this work lies in its applied design: combining selected elements from Dublin Core and the Data Documentation Initiative (DDI) into a practical, student-friendly schema grounded in real thesis documentation patterns.
Through a structured methodology involving content analysis of student theses, iterative field selection, and prototype development, we have demonstrated that it is possible to bridge the gap between simplicity and empirical specificity in metadata frameworks. The resulting 33-field model addresses the limitations of existing standards—Dublin Core’s generality and DDI’s complexity—by offering a usable structure tailored for qualitative, quantitative, and mixed-method theses authored by students.
Moreover, informal validation with students, faculty, and repository administrators has shown that the system is usable, adaptable, and aligned with real institutional needs. Participants affirmed the framework’s clarity and value while providing suggestions to improve metadata completeness, field clarity, and user experience.
Importantly, this work contributes not by inventing a new metadata theory, but by operationalizing a scalable, educationally grounded solution for a documented gap in institutional repositories: the underrepresentation and inconsistent documentation of empirical student research, particularly in the social sciences.

Future Work

Future development will focus on extending the prototype’s functionality, including the following:
  • Integrating the framework with institutional repository platforms (e.g., DSpace, EPrints) to enhance adoption and scalability.
  • Mapping the 33-field schema to Schema.org vocabularies (e.g., ScholarlyArticle, Dataset) to improve thesis visibility on web search engines, potentially through a JSON-LD export feature alongside the existing XML output.
  • Exploring RDF serialization of the metadata schema to enable linked data integration, allowing theses to be linked to related resources (e.g., datasets, publications, or ORCID profiles) for cross-repository queries and semantic search.
  • Developing an NLP module (or use LLMs) to extract metadata fields (e.g., methodology, sample size) from thesis PDFs, reducing manual entry for students, with pilot testing using tools like GROBID or spaCy to validate the approach for legacy theses.
  • Supporting multilingual metadata entry.
  • Expanding the framework to cover student-authored empirical papers and conference submissions beyond theses.
  • Conducting broader user testing across diverse academic departments and institutional contexts to assess usability, adoption, and scalability.
This system is intended to evolve in alignment with institutional practice and user feedback, contributing to the broader goal of making the gray literature more visible, reusable, and methodologically transparent within scholarly ecosystems.

Author Contributions

Conceptualization, G.V. and G.T.; methodology, G.V.; software, G.T.; validation, N.P., S.C. and T.N.; formal analysis, G.V.; investigation, G.T.; resources, N.P.; data curation, T.N.; writing—original draft preparation, S.C.; writing—review and editing, G.V.; visualization, G.T.; supervision, N.P.; project administration, G.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
TermDescription
AIArtificial Intelligence—Computer systems able to perform tasks that typically require human intelligence
DCMIDublin Core Metadata Initiative—The organization responsible for maintaining Dublin Core metadata standards
DDIData Documentation Initiative—A standard for documenting quantitative empirical research
DDI LiteA simplified version of DDI with fewer descriptive fields, intended for lighter applications
DDI Study DescriptionThe component of DDI used to describe the structure and content of quantitative research studies
Dublin CoreA metadata standard used to describe web resources and digital objects in a general and simplified way
Dublin Core ExtensionsAdditional elements beyond the core Dublin Core schema to support more detailed descriptions
EADEncoded Archival Description—A standard for encoding archival finding aids
ELLANIKOSThe Dublin Core-based documentation system for the gray literature at the University of the Aegean
ETDElectronic Theses and Dissertations—Digital versions of academic theses and dissertations
FAIRFindable, Accessible, Interoperable, and Reusable—Principles for scientific data management
HEIHigher Education Institution—Universities and colleges providing post-secondary education
JSON-LDJavaScript Object Notation for Linked Data—A method of encoding linked data using JSON
MARCMachine-Readable Cataloging—One of the oldest electronic cataloging systems for libraries
MetadataData that describe other data, typically structured using description languages like XML
METSMetadata Encoding and Transmission Standard—A framework for encoding descriptive, administrative, and structural metadata
MODSMetadata Object Description Schema—A bibliographic element set for describing resources
NLPNatural Language Processing—A branch of AI that helps computers understand human language
OAI-PMHOpen Archives Initiative Protocol for Metadata Harvesting—A protocol for harvesting metadata records
ONIXOnline Information Exchange—A standard for representing and communicating book industry product information
PREMISPreservation Metadata: Implementation Strategies—A standard for preservation metadata
RDFResource Description Framework—A framework for representing information about resources on the web
Schema.orgA collaborative vocabulary for structured data markup on web pages
URIUniform Resource Identifier—A string of characters that unambiguously identifies a particular resource
URLUniform Resource Locator—A reference to a web resource that specifies its location on a computer network
W3CWorld Wide Web Consortium—The main international standards organization for the World Wide Web
XMLeXtensible Markup Language—A language used for describing and exchanging data through tags (fields), commonly applied in metadata schemes
XSDXML Schema Definition—A way to formally describe the elements in an XML document

Appendix A. Overview of the DDI Study Description (Version 2)

The DDI Study Description (version 2) in Detail
The following is an overview of the most important fields of DDI that relate to the present work:

Appendix A.1. The stdDscr (2.0)

The stdDscr consists of information related to data collection in general and specific elements about the research. This section includes information about who collected the data, who distributes it, keywords for the data content, an abstract of the data content, and information about data collection and processing methods.

Appendix A.2. 2.1 Citation

2.1.1
titlStmt, Title Statement for the research description
2.1.1.1
Full title related to the research. (corresponds to Dublin Core Title)
2.1.1.2
Secondary title, used to reinforce possible limitations of the main title. May repeat information from the main title
2.1.1.3
Title by which we usually refer to the research
2.1.1.4
The title translated into another language
2.1.1.5
Identification Number–Characteristic number-Unique number or character sequence. (corresponds to Dublin Core Identifier Element).
2.1.2
rspStmt–Responsibility Statement–The person responsible for creating the research.
2.1.2.1
AuthEnty–Authoring Entity/Primary Investigator The person, group, or service responsible for the intellectual content of the work. (corresponds to Dublin Core Creator Element).
2.1.2.2
othId–Other Identification/Acknowledgments Responsibility statements not recorded in the title and responsibility statement area for the work. Named here are individuals or groups related to the work or significant persons connected with previous editions who have not already been mentioned by name in the description. (corresponds to Dublin Core Contributor element)
2.1.3
prodStmt–Production Statement
Statements about the production of the research.
2.1.3.1
producer–The producer is the person or organization with financial or administrative responsibility for the physical processes by which the text was created. (corresponds to Dublin Core Publisher element).
2.1.3.2
copyright–Copyright statement
2.1.3.3
procDate–Production date, when the data was produced
2.1.3.4
prodPlac–Place of Production-Address of the organization that produced the research.

Appendix A.3. The stdInfo (2.2)

The stdInfo (2.2) part (segment) of DDI and specifically:
2.2.1
subject–Generally describes the content of the data.
2.2.1.1
keyword–Describes keywords that characterize the data.
2.2.2
abstract–Presents the purpose, nature, and scope of data collection, specific characteristics of their content, as well as what questions the researcher conducting the research attempts to answer. A list of the main variables is also provided.
2.2.3
sumDscr–Presents information about the chronological as well as geographical coverage of the research and their units.
In detail:
2.2.3.1
timePrd–Analyzes the time period to which the data refers. This is neither the time when documentation was carried out, nor when the data was collected.
2.2.3.2
collDate–Contains the dates of data collection.
2.2.3.3
nation–Reports the country or countries to which the data refers.
2.2.3.4
geogCover–Provides information about the geographical coverage of the data. Contains the complete geographical scope of the data.
2.2.3.5
geogUnit–Reports the smallest geographical unit (e.g., prefecture) covered by the data.
2.2.3.6
anlyUnit- Reports the basic unit of analysis or observation that the files describe (e.g., individuals, families/households, groups, organizations, administrative units).
2.2.3.7
universe–Informs about the group of individuals or other elements that are the subject of the research, to which the results/data refer. Age, nationality, and residence usually help characterize a specific environment (universe). Many factors can participate such as gender, race, income, convictions, etc. The environment (universe) may consist of elements other than persons, such as households, legal cases, deaths, countries, etc.
Generally, it should be possible to determine from the description of the universe whether a specific individual or element (hypothetical or real) is a member of the research population.
2.2.3.8
dataKind–Informs about the type of data in the file: research data, census data, clinical-medical data, experimental data, psychological data, etc.

Appendix A.4. The Method (2.3)

2.3.1
dataColl–Contains information about the data collection methodology.
2.3.1.1
timeMeth–Informs about the temporal method for data collection.
2.3.1.2
dataCollector–Informs about the person or persons responsible for data collection.
2.3.1.3
frequency–Informs about the frequency (if any) of data collection.
2.3.1.4
sampleproc–Informs about the type of sample (e.g., random).
2.3.1.5
collMode–Informs about the method of data collection (e.g., telephone interviews)

Appendix B. Dublin Core in Detail

The following presents the way of defining the fifteen (15) elements and how they should be used
Title (Label: “Title”): The name given to the resource, usually by the Creator or Publisher.
Author or Creator (Label: “Creator”): The person or organization primarily responsible for creating the intellectual content of the resource. For example, authors in the case of written texts, artists, photographers, or illustrators in the case of visual resources.
Subject and Keywords (Label: “Subject”): The topic of this resource. Typically, the subject will be expressed as keywords or phrases that describe the topic or content of the resource. The use of controlled vocabularies and formal classification schemes is encouraged.
Description (Label: “Description”): A textual description of the content of the resource, including abstracts in the case of written texts as objects or content descriptions in the case of visual resources.
Publisher (Label: “Publisher”): The responsible entity that made the resource available in its present form, such as a publishing house, a university department, or a corporate entity.
Contributor (Label: “Contributor”): A person or organization not identified in the Creator element who has made a significant intellectual contribution to the resource but whose contribution is secondary to any person or organization identified in the Creator element (for example, editor, translator, and illustrator).
Date (Label: “Date”): A date related to the creation and availability of the resource. Such a date should not be confused with that in the Content element, which will be associated with the resource only to the extent that the intellectual content is somehow close to the date.
Resource Type (Label: “Type”): The category of the resource, such as homepage, novel, poem, work report, technical report, exhibition, dictionary. For the sake of functionality, the Type must be selected from an enumerated set of terms.
Format (Label: “Format”): The data format of the resource used to certify the software and possibly hardware that may be needed for the display or operation of the source. For the sake of functionality, the Format must be selected from an enumerated set of terms.
Resource Identifier (Label: “Identifier”): A phrase or number used to uniquely identify the resource. Examples for network resources include URLs or URNs (when used). Other globally unique identifiers, such as International Standard Book Numbers (ISBN) or other formal names are also candidates for this element.
Source (Label: “Source”): Information about a second resource from which the present resource is derived. While it is generally recommended that elements contain information about the present resource only, this element may contain the date, creator, identifier, or other data for a second resource when considered important for the discovery of the present resource. It is more recommended in practice to use the Relation element instead. For example, it is possible to use a Source from 1603 in the description of a 1996 adaptation of a Shakespeare work, but it is preferable to use the “IsBasedOn” Relation with reference to a separate resource whose description contains the Date 1603. Source is not appropriate if the present resource is in its primary form.
Language (Label: “Language”): The language of the intellectual content of the resource.
Relation (Label: “Relation”): An identifier of a second resource and its relationship to the present resource. This element allows the declaration of links between related resources and resource descriptions. Examples include a version of a work (IsVersionOf), a translation of a work (IsBasedOn), a chapter of a book (IsPartOf), and a formatting of data in an image (IsFormatOf). For the sake of functionality, relationships must be selected from an enumerated set of terms.
Coverage (Label: “Coverage”): The spatial or temporal characteristics of the intellectual content of the resource. Spatial coverage refers to the physical area (e.g., celestial sector) using coordinates (e.g., longitude and latitude) or setting names that are from a specified list or fully spelled out. Temporal coverage refers to why the resource is related and not when it was created or made available (the latter mentioned belongs to the Date element).
Rights (Label: “Rights”): A statement of usage rights, a code leading to a usage rights statement, or a code leading to a service providing information about the usage rights of the resource.

Appendix C. XML Description of the Proposed Documentation System

XML Schema Definition (XSD) of the Proposed Model

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
 
<!-- Root element -->
<xs:element name="total">
<xs:complexType>
<xs:sequence>
<xs:element name="Writer">
<xs:complexType>
<xs:sequence>
 
<xs:element name="surName" type="xs:string"/>
<xs:element name="firstName" type="xs:string"/>
<xs:element name="fathersName" type="xs:string"/>
</xs:sequence>
<xs:attribute name="gender" type="xs:string" use="optional"/>
</xs:complexType>
</xs:element>
 
<xs:element name="Dissertation">
<xs:complexType>
<xs:sequence>
<xs:element name="Title" type="xs:string"/>
<xs:element name="nameOfSupervisor" type="xs:string"/>
<xs:element name="dateOfSupport" type="xs:date"/>
<xs:element name="Abstract" type="xs:string"/>
<xs:element name="KeyWords" type="xs:string"/>
<xs:element name="altTitle" type="xs:string"/>
<xs:element name="typeOfDissertation" type="xs:string"/>
<xs:element name="Program" type="xs:string"/>
<xs:element name="nameOfCom1" type="xs:string"/>
<xs:element name="nameOfCom2" type="xs:string"/>
<xs:element name="otherNotes" type="xs:string"/>
<xs:element name="libraryLink" type="xs:anyURI"/>
<xs:element name="Institution" type="xs:string"/>
<xs:element name="Department" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
 
<xs:element name="Survey">
<xs:complexType>
<xs:sequence>
<xs:element name="Strategy" type="xs:string"/>
<xs:element name="sampleProd" type="xs:string"/>
<xs:element name="Spacetime">
<xs:complexType>
<xs:sequence>
<xs:element name="timeAnalysis" type="xs:string"/>
<xs:element name="spaceAnalysis" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="Target" type="xs:string"/>
<xs:element name="studyQuestions" type="xs:string"/>
<xs:element name="studyAssumptions" type="xs:string"/>
<xs:element name="workAssumptions" type="xs:string"/>
 
<xs:element name="Sample">
<xs:complexType>
<xs:sequence>
 
<xs:element name="Size" type="xs:string"/>
<xs:element name="Description" type="xs:string"/>
<xs:element name="basicUnit" type="xs:string"/>
<xs:element name="collectionMethod" type="xs:string"/>
<xs:element name="methodOfRecording" type="xs:string"/>
<xs:element name="methodOfAnalysis" type="xs:string"/>
<xs:element name="recordingTools" type="xs:string"/>
<xs:element name="analysisTools" type="xs:string"/>
<xs:element name="sampleTimeReference" type="xs:string"/>
<xs:element name="sampleCollectionReference" type="xs:string"/>
<xs:element name="statesOfReference" type="xs:string"/>
<xs:element name="geographicalCover" type="xs:string"/>
<xs:element name="geographicalUnit" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
 
</xs:sequence>
</xs:complexType>
</xs:element>
 
</xs:sequence>
</xs:complexType>
</xs:element>
 
</xs:schema>

Appendix D. Detailed Description of Fields

The model consists of three main sections: Writer, Dissertation, Survey. Below are the field descriptions based on the XML structure.
  • Writer (Author Information)
    1.1
    surName–Author’s surname
    1.2
    firstName–Author’s first name
    1.3
    fathersName–Author’s father’s name
    1.4
    gender–Gender
  • Dissertation (General Information about the Work)
    2.1
    Title–The title of the work
    2.2
    nameOfSupervisor–The name of the supervising professor
    2.3
    dateOfSupport–Date of defense
    2.4
    Abstract–Brief summary
    2.5
    Keywords–Key words (phrases)
    2.6
    altTitle–Alternative title (or title in English)
    2.7
    typeOfDissertation–Type of work (e.g., undergraduate, graduate)
    2.8
    Program–The name of the program under which the work was conducted
    2.9
    nameOfCom1–The name of committee member (1)
    2.10
    nameOfCom2–The name of committee member (2)
    2.11
    otherNotes–Other useful notes about the work
    2.12
    libraryLink–The internet link to the library
    2.13
    Institution–The institution where the work was conducted
    2.14
    Department–The department where the work was conducted
  • Survey (General Information about Potential Empirical Research)
    3.1
    strategy–Research strategy (e.g., quantitative/qualitative/mixed)
    3.2
    sampleProd–Data production (e.g., primary/secondary)
    3.3
    SpaceTime–Space-Time Management
    3.3.1
    timeAnalysis–Temporal Analysis (e.g., Static, Dynamic, Comparative)
    3.3.2
    spaceAnalysis–Spatial Analysis (e.g., Single-regional)
    3.4
    Target–The purpose of the research
    3.5
    studyQuestions–Research questions
    3.6
    studyAssumptions–Research hypotheses
    3.7
    workAssumptions–Working hypotheses
    3.8
    Sample–Regarding the sample/population
    3.8.1
    Size–The size
    3.8.2
    Description–Description (e.g., specific group of people)
    3.8.3
    basicUnit–Basic observation unit (e.g., Individual)
    3.8.4
    collectionMethod–Sample collection method (e.g., Random) and description (e.g., sampling)
    3.8.5
    methodOfRecording–Sample recording method (e.g., Interview)
    3.8.6
    methodOfAnalysis–Analysis method (e.g., ANOVA, Relational)
    3.8.7
    recordingTools–Recording tools (e.g., Interview Guide)
    3.8.8
    analysisTools–Analysis tools
    3.8.9
    sampleTimeReference–Data reference time period (from-to)
    3.8.10
    sampleCollectionReference–Data collection time period (from-to)
    3.8.11
    statesOfReference–Country or countries of data reference
    3.8.12
    geographicalCover–Geographical coverage of data (e.g., regions)
    3.8.13
    geographicalUnit–The smallest geographical unit (e.g., communities)

Appendix E. User Feedback Questionnaire

User Feedback Questionnaire on Metadata Documentation Framework and System Prototype
  • Section A: General Information
1.
Your role (select one):
□ Student  □ Faculty Supervisor  □ Repository Administrator
2.
Have you used any metadata systems or repositories before?
□ Yes  □ No
  • Section B: Usability (for all participants)
3.
Was the interface easy to navigate?
□ Very easy  □ Somewhat easy  □ Difficult  □ Very difficult
4.
Were the field labels and instructions clear?
□ All clear  □ Somewhat clear  □ Mostly unclear  □ Not clear at all
5.
Which fields were unclear or confusing to you?
 
6.
Did the system support your documentation needs for empirical content?
□ Fully  □ Partially  □ Not at all
Comments:
 
  • Section C: Role-Specific Questions
For Students:
7.
Did the system help you reflect more deeply on your research design and methodology?
□ Yes  □ No  □ Not sure
8.
Which features would make the process easier for you? (check all that apply)
□ Field examples/tooltips  □ Dropdown menus  □ Field autofill
□ Metadata preview before submission  □ Other:                 
For Faculty:
9.
Do you believe this framework improves students’ research documentation?
□ Yes  □ Somewhat  □ No
10.
Would you use this tool in your supervision or teaching?
□ Yes, in both  □ Only in supervision  □ Only in teaching  □ Not interested
For Repository Administrators:
11.
Is the system compatible with your current repository infrastructure (e.g., DSpace, EPrints)?
□ Yes  □ Partially  □ No
12.
What features would support scalability in your institution? (check all that apply)
□ Bulk metadata import  □ Mandatory field validation
□ Role-based access control  □ Logging and audit trail
Between 24–27 July 2025, nine participants from the Hellenic Mediterranean University were invited to interact with a functional demo of the system. The group included:
  • Five graduate students from the Department of Social Work
  • Three faculty supervisors with thesis advising responsibilities
  • One institutional repository administrator
Participants completed a short task-based walkthrough and responded to a structured feedback form (see previous part). While this was not a formal usability study, the responses provided useful early stage insight into the strengths and limitations of the system from a user-centered perspective.
  • Summary of Key Findings
Students:
  • Found the form-based interface intuitive and appreciated the structured separation between general and empirical metadata.
  • Requested clearer definitions for fields such as Observation Unit, Geographical Unit, and Time/Space Management.
  • Suggested adding tooltips, dropdowns for recurring values, and previews of the completed metadata record.
  • One student noted: “It really helped me clarify how to document my sample and method without being too technical.”
Faculty Supervisors:
  • Supported the framework’s role in guiding students toward more complete methodological documentation.
  • Recommended making certain fields (e.g., research hypotheses) optional for qualitative or exploratory work.
  • Proposed including a general-purpose “Notes” field to capture context not reflected in standard metadata.
Repository Administrator:
  • Validated the compatibility of the XML export structure with institutional repositories.
  • Suggested adding basic field validation, and a batch import feature for legacy theses.
  • Implications and Next Steps
Based on this feedback, the following minor enhancements are planned:
  • Integration of contextual help (tooltips) for complex fields
  • Optional metadata fields for non-structured research outputs
  • Role-based access control for students and repository staff
  • Exploration of a batch-entry module for archival scalability
Note: This feedback activity was conducted internally for prototype testing and did not collect or store any personal or sensitive data.

Appendix F. The Web Application Developed

Having already formulated the theoretical model of the new classification system, an effort was made to implement it as a real application.
Thus, an online web application was created with the goal of applying the ideas described, centered around facilitating the general search for theses that include empirical research. Without delving into technical details about its implementation, we will only mention in general the technologies (all open-source software) that were used: Dynamic web pages were developed using the PHP language, the Apache Web Server, the MySQL database tool, Bootstrap and JQuery technologies, along with some independent use of Javascript and CSS.
Screenshot of the application running in Figure A1.
Figure A1. Search database with filters.
Figure A1. Search database with filters.
Algorithms 18 00490 g0a1

References

  1. Papenmeier, A.; Krämer, T.; Friedrich, T.; Hienert, D.; Kern, D. Genuine information needs of social scientists looking for data. Proc. Assoc. Inf. Sci. Technol. 2021, 58, 292–302. [Google Scholar] [CrossRef]
  2. Nicholson, S.; Bennett, T. Do institutional repository deposit guidelines deter data discovery? Evid. Based Libr. Inf. Pract. 2021, 16, 2–17. [Google Scholar] [CrossRef]
  3. Osman, R.; Yanti Idaya, A.M.K.; Abrizah, A. Metadata matters: Evaluating the quality of Electronic Theses and Dissertations (ETDs) descriptions in Malaysian institutional repositories. Malays. J. Libr. Inf. Sci. 2023, 28, 109–125. [Google Scholar] [CrossRef]
  4. Khan, A.; Loan, F.; Parray, U.; Rashid, S. Global overview of research data repositories: An analysis of re3data registry. Inf. Discov. Deliv. 2024, 52, 53–61. [Google Scholar] [CrossRef]
  5. Jowkar, A. Dublin Core Metadata Element Set usage in national libraries’ web sites. Electron. Libr. 2009, 27, 441–447. [Google Scholar] [CrossRef]
  6. Standard ANSI/NISO Z39.85-2017; The Dublin Core Metadata Element Set. NISO Press: Baltimore, MD, USA, 2017.
  7. Baker, T. A Grammar of Dublin Core; Technical Report; GMD-German National Research Center for Information Technology: St. Augustin, Germany, 2000. [Google Scholar]
  8. Dublin Core. Detailed Information About the Dublin Core, Its Activities and Metadata Sets. 2001. Available online: http://www.dublincore.org (accessed on 7 July 2025).
  9. Avram, H. Machine-Readable Cataloging (MARC); The Library of Congress: Washington, DC, USA, 2003. [Google Scholar]
  10. Avram, H. MARC: Its History and Implications; The Library of Congress: Washington, DC, USA, 1975. [Google Scholar]
  11. DDI Alliance. The Data Documentation Initiative Description; DDI Alliance: Ann Arbor, MI, USA, 2017; Available online: https://ddialliance.org/ (accessed on 30 July 2025).
  12. Mosha, N.F.; Ngulube, P. Metadata Standard for Continuous Preservation, Discovery, and Reuse of Research Data in Repositories by Higher Education Institutions: A Systematic Review. Information 2023, 14, 427. [Google Scholar] [CrossRef]
  13. Furrie, B. Follett Software Company. Understanding MARC Bibliographic: Machine-Readable Cataloging, 8th ed.; Library of Congress: Washington, DC, USA, 2009. [Google Scholar]
  14. Weibel, S. The Dublin Core: A Simple Content Description Model for Electronic Resources. Bull. Am. Soc. Inf. Sci. Technol. 1997, 24, 9–11. [Google Scholar] [CrossRef]
  15. Ward, J. Unqualified Dublin Core usage in OAI-PMH data providers. OCLC Syst. Serv. 2004, 20, 40–47. [Google Scholar] [CrossRef]
  16. Greenberg, J.; Pattuelli, M.; Parsia, B.; Robertson, W. Author-generated Dublin Core metadata for web resources: A baseline study in an organization. J. Digit. Inf. 2001, 2, 38–46. [Google Scholar] [CrossRef]
  17. Robertson, T.; Döring, M.; Guralnick, R.; Bloom, D.; Wieczorek, J.; Braak, K.; Otegui, J.; Russell, L.; Desmet, P. The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet. PLoS ONE 2014, 9, e102623. [Google Scholar] [CrossRef] [PubMed]
  18. Vardigan, M. Data Documentation Initiative: Toward a Standard for the Social Sciences. Int. J. Digit. Curation 2007, 3, 107–113. [Google Scholar] [CrossRef]
  19. Radler, B.; Lyle, J.; Johnson, J. A DDI Primer: An Overview and Examples of DDI in Action; University of Wisconsin–Madison, Data Documentation Initiative (DDI) Working Group: Madison, WI, USA, 2016. [Google Scholar]
  20. CESSDA Qualitative Metadata Working Group. *Best Practice Guidelines for Qualitative Data Documentation*. 2021. Available online: https://www.cessda.eu/ (accessed on 30 July 2025).
  21. QSR International. NVivo Qualitative Data Analysis Software. Available online: https://www.qsrinternational.com/nvivo-qualitative-data-analysis-software/home (accessed on 14 June 2025).
  22. Southall, H.; Woollard, M. A Qualitative Data Model for DDI; Working Paper; DDI Alliance: Ann Arbor, MI, USA, 2011. [Google Scholar]
  23. Library of Congress. MODS: Metadata Object Description Schema: Official Web Site. Available online: http://www.loc.gov/standards/mods/ (accessed on 7 July 2025).
  24. Guenther, R. MODS: The Metadata Object Description Schema. Cat. Classif. Quarterly 2003, 36, 81–91. [Google Scholar] [CrossRef]
  25. Library of Congress. METS: Metadata Encoding and Transmission Standard: Official Web Site. Available online: http://www.loc.gov/standards/mets/ (accessed on 7 July 2025).
  26. McDonough, J. METS: Standardized Encoding for Digital Library Objects. Int. J. Digit. Libr. 2006, 6, 148–158. [Google Scholar] [CrossRef]
  27. Library of Congress. EAD: Encoded Archival Description: Official Web Site. Available online: https://www.loc.gov/ead/ (accessed on 7 July 2025).
  28. Pitti, D.V. Encoded Archival Description: An Introduction and Overview. D-Lib Mag. 1999, 5, 11. [Google Scholar] [CrossRef]
  29. Schema.org. Available online: https://schema.org/ (accessed on 7 July 2025).
  30. Ronallo, J. HTML5 Microdata and Schema.org. Code4Lib J. 2012, 16. Available online: https://journal.code4lib.org/articles/6400 (accessed on 15 July 2025).
  31. W3C. RDF 1.1 Primer. 2014. Available online: https://www.w3.org/TR/rdf11-primer/ (accessed on 7 July 2025).
  32. Haslhofer, B.; Isaac, A. data.europeana.eu: The Europeana Linked Open Data Pilot. In Proceedings of the International Conference on Dublin Core and Metadata Applications, The Hague, The Netherlands, 21–23 September 2011; pp. 94–104. [Google Scholar]
  33. EDItEUR. ONIX for Books. Available online: https://www.editeur.org/83/Overview/ (accessed on 7 July 2025).
  34. Library of Congress. PREMIS: Preservation Metadata Maintenance Activity. Available online: https://www.loc.gov/standards/premis/ (accessed on 7 July 2025).
  35. Caplan, P. Understanding PREMIS: An Overview of the PREMIS Data Dictionary for Preservation Metadata; Library of Congress: Washington, DC, USA, 2009. [Google Scholar]
  36. Wira-Alam, A.; Dimitrov, D.; Zenk-Möltgen, W. Extending Basic Dublin Core Elements for an Open Research Data Archive. In Proceedings of the 2012 International Conference on Dublin Core and Metadata Applications, Kuching Sarawak, Malaysia, 3–7 September 2012. [Google Scholar]
Figure 1. Research methodology for metadata framework development.
Figure 1. Research methodology for metadata framework development.
Algorithms 18 00490 g001
Table 1. Comparison of metadata standards for thesis documentation.
Table 1. Comparison of metadata standards for thesis documentation.
FeatureDublin CoreDDISchema.orgRDFProposed Model
Intended for Theses××××
Bibliographic Description
Empirical Research Metadata×××
Support for Qualitative Methods×partial××
Ease of Use for Students××
Semantic Web Integration××× (Future Work)
Table 2. Temporary metadata entry form.
Table 2. Temporary metadata entry form.
Metadata FieldOptions
Research Strategy0 Quantitative, 1 Qualitative, 2 Mixed
Data Source0 Primary, 1 Secondary, 2 Mixed
Title0 Absent, 1 Present
Time/Space Management0 Non-historical, 1 Historical-comparative
Abstract0 Absent, 1 Present
Purpose0 Absent, 1 Present
Research Questions0 Absent, 1 Present
Working Hypotheses0 Absent, 1 Present
Research Hypotheses0 Absent, 1 Present
Reference Date for Data0 Not Mentioned, 1 Mentioned
Data Collection Date0 Not Mentioned, 1 Mentioned
Country/Countries of Study0 Not Mentioned, 1 Mentioned
Geographical Coverage0 Not Mentioned, 1 Mentioned
Geographical Unit0 Not Mentioned, 1 Mentioned
Population0 Not Mentioned, 1 Mentioned
Observation Unit0 Not Mentioned, 1 Mentioned
Sample0 Not Mentioned, *Number* of Sample Size
Sampling Method0 Not Mentioned, 1 Mentioned
Data Collection Method0 Not Mentioned, 1 Mentioned
Analysis Method0 Not Mentioned, 1 Mentioned
Other Sources0 Not Mentioned, 1 Mentioned
Research Problems0 Not Mentioned, 1 Mentioned
Research Ethics0 Not Mentioned, 1 Mentioned
Recording Tools0 Not Mentioned, 1 Mentioned
Analysis Tools0 Not Mentioned, 1 Mentioned
Table 3. Key metadata fields selected for final model.
Table 3. Key metadata fields selected for final model.
Metadata FieldOptions
1. Research Strategy0 Quantitative, 1 Qualitative, 2 Mixed
2. Data Source0 Primary, 1 Secondary, 2 Mixed
3. Time/Space Management0 Non-historical, 1 Historical-comparative
4. Research Questions0 Absent, 1 Present
5. Research Hypotheses0 Absent, 1 Present
6. Reference Date for Data0 Not Mentioned, 1 Mentioned
7. Data Collection Date0 Not Mentioned, 1 Mentioned
8. Country/Countries of Study0 Not Mentioned, 1 Mentioned
9. Geographical Coverage0 Not Mentioned, 1 Mentioned
10. Geographical Unit0 Not Mentioned, 1 Mentioned
11. Population0 Not Mentioned, 1 Mentioned
12. Observation Unit0 Not Mentioned, 1 Mentioned
13. Sample0 Not Mentioned, *Sample Size*
14. Sampling Method0 Not Mentioned, 1 Mentioned
15. Data Collection Method0 Not Mentioned, 1 Mentioned
16. Analysis Method0 Not Mentioned, 1 Mentioned
17. Recording Tools0 Not Mentioned, 1 Mentioned
18. Analysis Tools0 Not Mentioned, 1 Mentioned
Table 4. Quantitative theses.
Table 4. Quantitative theses.
Field1st2nd3rd4th5th6th
Research Strategy000000
Data Source000000
Title000101
Time/Space Management000000
Abstract110001
Purpose111111
Research Questions011111
Working Hypotheses011111
Research Hypotheses011111
Reference Date111111
Data Collection Date111111
Countries of Reference111111
Geographical Coverage111111
Geographical Unit111111
Population111111
Observation Unit111111
Sample110212300150250320
Sampling Method111111
Recording Method111111
Analysis Method (ANOVA/Relational)111111
Other Sources000000
Research Problems000000
Ethics in Research110100
Recording Tools111111
Analysis Tools011111
Table 5. Qualitative theses.
Table 5. Qualitative theses.
Field1st2nd3rd4th5th6th
Research Strategy111111
Data Source011111
Title001111
Time/Space Management000101
Abstract000111
Purpose111111
Research Questions111111
Working Hypotheses111110
Research Hypotheses111000
Reference Date111111
Data Collection Date101111
Countries of Reference101011
Geographical Coverage100110
Geographical Unit100111
Population101110
Observation Unit111111
Sample12116310
Sampling Method111111
Recording Method111111
Analysis Method (ANOVA/Relational)111000
Other Sources000110
Research Problems000100
Ethics in Research000100
Recording Tools101110
Analysis Tools100111
Table 6. Mixed-methods theses.
Table 6. Mixed-methods theses.
FieldValue 1Value 2
Research Strategy22
Data Source01
Title01
Time/Space Management00
Abstract10
Purpose11
Research Questions11
Working Hypotheses11
Research Hypotheses11
Reference Date11
Data Collection Date01
Countries of Reference01
Geographical Coverage01
Geographical Unit00
Population11
Observation Unit11
Sample612
Sampling Method11
Recording Method11
Analysis Method (ANOVA/Relational)11
Other Sources01
Research Problems01
Ethics in Research01
Recording Tools11
Analysis Tools01
Table 8. (a) Part A: Research-related fields. (b) Part B: General thesis description (Dublin Core-based).
Table 8. (a) Part A: Research-related fields. (b) Part B: General thesis description (Dublin Core-based).
(a)
#Field (Description)
1Research Strategy—Type of research conducted
2Data Production—How data were produced
3Time/Space Management—Historical or non-historical comparative approach
4Purpose
5Research Questions—Clear presence of research questions
6Working Hypotheses—Clear presence of working hypotheses
7Research Hypotheses—Clear presence of research hypotheses
8Reference Time of Data—When the data refer to
9Data Collection Time—When the data were collected
10Country/Countries of Reference—Where the study was conducted
11Geographical Coverage—Areas covered (e.g., regions)
12Geographical Unit—Units (e.g., cities)
13Population—Which population the research refers to
14Observation Unit—Subsets of the sample
15Sample—Sample size
16Sampling Method—How the sample was selected (e.g., random)
17Data Recording Method—How data were recorded (e.g., interview)
18Data Analysis Method—e.g., ANOVA, relational
19Recording Tools—Presence of tools like interview guides
20Analysis Tools—Presence of tools like SPSS
(b)
#Field
1Author Name
2Author Gender
3Title
4Supervisor Name
5Defense Date
6Number of Pages
7Language
8Abstract
9Keywords
10Alternative Title
11Committee Member 1
12Committee Member 2
13Additional Notes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vassiliou, G.; Tsamis, G.; Chatzinikolaou, S.; Nipurakis, T.; Papadakis, N. Enhancing Discoverability: A Metadata Framework for Empirical Research in Theses. Algorithms 2025, 18, 490. https://doi.org/10.3390/a18080490

AMA Style

Vassiliou G, Tsamis G, Chatzinikolaou S, Nipurakis T, Papadakis N. Enhancing Discoverability: A Metadata Framework for Empirical Research in Theses. Algorithms. 2025; 18(8):490. https://doi.org/10.3390/a18080490

Chicago/Turabian Style

Vassiliou, Giannis, George Tsamis, Stavroula Chatzinikolaou, Thomas Nipurakis, and Nikos Papadakis. 2025. "Enhancing Discoverability: A Metadata Framework for Empirical Research in Theses" Algorithms 18, no. 8: 490. https://doi.org/10.3390/a18080490

APA Style

Vassiliou, G., Tsamis, G., Chatzinikolaou, S., Nipurakis, T., & Papadakis, N. (2025). Enhancing Discoverability: A Metadata Framework for Empirical Research in Theses. Algorithms, 18(8), 490. https://doi.org/10.3390/a18080490

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop