IT Challenges in Designing and Implementing Online Natural History Collection Systems

Lawenda, Marcin; Wolniewicz, Paweł

doi:10.3390/d17060388

Open AccessArticle

IT Challenges in Designing and Implementing Online Natural History Collection Systems

by

Marcin Lawenda

^*

and

Paweł Wolniewicz

Poznan Supercomputing and Networking Center, Jana Pawła II 10, 61-139 Poznań, Poland

^*

Author to whom correspondence should be addressed.

Diversity 2025, 17(6), 388; https://doi.org/10.3390/d17060388

Submission received: 23 February 2025 / Revised: 27 May 2025 / Accepted: 28 May 2025 / Published: 30 May 2025

(This article belongs to the Special Issue Digitization of Natural History Collections for Biodiversity Science and Conservation)

Download

Browse Figures

Versions Notes

Abstract

Numerous institutions engaged in the management of Natural History Collections (NHC) are embracing the opportunity to digitise their holdings. The primary objective is to enhance the accessibility of specimens for interested individuals and to integrate them into the global community by contributing to an international specimen database. This initiative demands a comprehensive digitisation process and the development of an IT infrastructure that adheres to stringent functionality, reliability, and security standards. This endeavour focuses on the procedural and operational dimensions associated with accurately storing and managing taxonomic, biogeographic, and ecological data about biological specimens digitised within a conventional NHC framework. The authors suggest categorising the IT challenges into four distinct areas: requirements, digitisation, design, and technology. Each category discusses several selected topics, highlighting often underestimated essentials for implementing the NHC system. The presented analysis is supported by numerous examples of specific implementations, enabling a better understanding of the given topic. This document serves as a resource for teams developing their systems for online collections, offering post factum insights derived from implementation experiences.

Keywords:

design of NHC systems; biodiversity data standards; georeferencing; database structure; validation; iconography; backups

Graphical Abstract

1. Introduction

Natural History Collections (NHCs) play a pivotal role in studying the diversity and variability of organisms. The digitisation of natural history collections by the institutions tasked with their upkeep and administration enhances the accessibility of the specimens housed within for researchers and a wide array of nature enthusiasts. National information centres are experiencing a renaissance due to the IT revolution, including the development of the Geographic Information System (GIS). This renaissance is expressed primarily in open access to increase digital biodiversity data for all interested parties. Digital botanical collections are increasingly used in phenological research, studies on species extinction and invasion, and in species distribution modelling [1,2,3].

Systematic analysis of biodiversity data to reduce biodiversity loss and indicate essential problems that cause it is a significant challenge, in which automation can help. However, this task seems unattainable due to the lack of a global, unified observation system that would provide the data regularly [4]. All the more, individual initiatives of individual centres that create digital databases of specimens and their integration with global initiatives should be appreciated [5,6,7,8,9,10,11].

It is essential to recognise that the digitisation process and establishing infrastructure and associated services that adhere to international standards and ensure reliability and security present a highly intricate challenge that necessitates an interdisciplinary approach. Nevertheless, it should be realised that both the design and implementation of online systems of natural collections, such as herbaria or museum specimens, are closely associated with many IT-related challenges [12]. They are mainly concerned with technical implementations, but one should not forget the ethical and logistical requirements.

Many institutions are currently building systems that enable the provision of digitalised natural collections, thus providing access to the collected data and their analysis as part of synthetic environmental research. These systems are characterised by diverse scientific areas, the accuracy of their description, the format used, and the availability of supporting tools (e.g., exploration, presentation, visualisation). However, it can be argued that a certain critical mass of the number of systems has already been reached, allowing us to see the potential benefits of combining the available knowledge. One of the ideas that goes in this direction and emphasises the importance of compatibility and interoperability of NHC systems is the concept of developing Essential Biodiversity Variables (EBVs) [4,13,14]. EBVs serve as a fundamental indicator for evaluating shifts in biodiversity over time. They are instrumental in assessing adherence to biodiversity policies, monitoring advancements towards sustainable development objectives, and observing how biodiversity responds to disturbances and management strategies. The implementation of EBVs has implications for the design and implementation of databases, particularly in terms of structure, performance, and interoperability [15]. The requirements for storing different types of data (genetic, species, community, ecosystem, etc.) require using hybrid databases supporting structured, unstructured, and geospatial data. The need to ensure performance for handling large amounts of data requires the inclusion of scalable architectures such as distributed databases (e.g., Apache Cassandra [16], Hadoop-based systems [17]). Interoperability for collaboration between institutions requires the use of data standards and ontologies (e.g., Darwin Core [18], OGC standards [19]) and APIs to ensure integration.

The practical implementation of Essential Biodiversity Variables is discussed in the Bari Manifesto [20], a collection of guidelines designed for researchers and data infrastructure providers. This manifesto aims to facilitate the development of an operational framework for EBVs that is grounded in transnational and cross-infrastructure scientific workflows. This proclamation introduces ten principles for EBV data products in the following areas: data management plan, data structure, metadata, data quality, services, workflows, provenance, ontologies/vocabularies, data preservation, and accessibility. By concentrating on these areas, specialists can establish pathways for advancing technical infrastructure while allowing infrastructure providers the flexibility to determine the methods and timelines for implementation. This approach facilitates consensus and the adoption of guiding principles among the participating organisations. In the development of the AMUNATCOLL system, the creators endeavoured to adhere closely to these guidelines throughout both of the systems’ design [5] and implementation phases (portal and database) [12,21].

Furthermore, another point of view on the design and implementation of NHC systems should be noted. Systems collecting information about biodiversity are usually composed of modules that reflect processes related to collecting and identifying specimens, converting analogue information to digital (digitisation), storing in an organised way, searching and presenting, and analysing them. Each stage mentioned above has its methodology and tools, creating a research area of expertise called biodiversity informatics. Among the most critical IT techniques addressing the cycle of preparation, development, and maintenance of data, we can distinguish the following: string processing, metadata management, conceptual modelling, semantic web, machine learning, statistics, geographical information systems, and graph theory [22].

The design phase, which is the initial step of the natural collection system development, has a profound impact on the overall functionality and usability of the system in its entirety [23]. To mitigate barriers to data discovery, the collected information must be organised in a manner that accurately reflects the most significant data groups and characteristics of the entities representing the digitised objects [24]. Defining the scope of processed information before the digitisation process is closely related to the choice of the documentation standard for specimens, samples, and other forms of analogue records [25]. There are several standards (e.g., DarwinCORE, ABCD) that can and even should be the basis for defining the scope of stored data, if only for the sake of later interoperability with external repositories. In many cases, however, it turns out to be necessary to extend their scope due to the local specificity of the research being conducted [21].

The primary research goal of this work is to identify and discuss the most important challenges faced by the creators of IT systems that provide online access to natural history collections. On the one hand, this will allow for identifying key factors that facilitate the understanding of the level of complexity and the required scope of work by people who are just planning or are at an early stage of implementing the system. On the other hand, through the specification of the offered functionality, it will contribute to the discussion on cooperation and data exchange between similar systems and thus achieve synergy. A further goal is to systematise the identified challenges in the form of defined categories, which will allow for the determination of the relationship and impact of a given area on another (e.g., the requirements of the target group on the offered functionality and, later on, the required infrastructure). In addition, another goal is to discuss the sample use cases encountered by the authors and the sample solutions and related consequences (e.g., the method of date formatting).

Adam Mickiewicz University houses significant collections of algae, plants, fungi, and animals. This study reflects the authors’ experience developing an IT system for making data from these collections globally accessible [26,27,28].

2. Materials and Methods

The following fundamental assumptions were made under the design of the AMUNATCOLL IT system [29]: (i) including all the university’s natural history collections, (ii) addressing the needs of multiple user groups; and (iii) linking to global databases. In turn, functionally, it enables the collection, analysis and open sharing of digitised data about natural specimens.

Numerous systems focus on presenting biodiversity information, differing in their objectives and the range of functionalities they provide users. In defining the functional scope of the implemented interfaces, insights from esteemed scientific institutions and IT databases were examined, particularly online databases, which are prevalent among scientific users. A comprehensive comparative analysis of the most significant websites and portals of this type is detailed in a separate publication [12]. While the findings from this analysis were highly informative, a decision was made to initiate the development of our own IT system. This decision was supported by several factors: (1) none of the known operators provide such systems to other users; (2) the highly regarded Global Biodiversity Information Facility platform offers its space and functionalities, but utilising them necessitates the development of sophisticated IT tools for the proper export of data; (3) the assessment of target groups revealed that their expectations extend beyond the functionalities available on the reviewed websites and portals. Creating a proprietary information system on biodiversity documented in the NHC necessitates collaboration between biologists and computer scientists with expertise in the relevant fields. This entails a long-term partnership between the institution responsible for the NHC and an institution possessing the requisite IT capabilities. In the AMUNATCOLL project, critical decisions were made during the conceptualisation phase of the future IT system. A team comprising representatives from the Faculty of Biology at Adam Mickiewicz University in Poznań and the Poznan Supercomputing and Networking Center was involved in preparing the project assumptions and documentation.

After examining several existing systems, we divided the challenges facing NHC system designers into four distinct but interacting domains: requirements, digitalisation, design, and technology. Each domain had related challenges (Figure 1), and sometimes, because of their interactions, a decision made to solve a challenge in one domain leads to modifications to decisions made to solve other challenges. The following Section 4, Section 5, Section 6 and Section 7 discuss the challenges within each domain, the issues they raised, and the solutions adopted.

3. Challenges Categorisation

The work categorises the challenges encountered by NHC systems development into four domains: requirements, digitisation, design, and technology (Figure 1). Nonetheless, considering the nature of the analysed problems, their close connection should be emphasised, which means that a given matter may belong to one or many categories, depending on the perspective. Therefore, the category assignment process adopted the principle of greater matching, with the above condition noted. The following challenges were included in the “requirements” specification category: using unified terminology (common language), defining target groups and specifying their requirements, defining non-functional requirements and converting them into functional ones, as well as determining the scope of work. The “digitisation” category embraces the following: definition of the digitisation process along with automation mechanisms, procedures for dealing with data ambiguity, and mechanisms for detecting errors and introducing corrections/modifications. The “design” challenges are focused on the following: the specification of the metadata structure used to store information about specimens, the data access policy taking into account the type of data and access roles, flexibility in defining the set of input data, taking into account limitations in defining interfaces resulting from the way the user interacts, and the use of the so-called good practices when designing and implementing applications. The last “technological” category elaborates challenges related to the following: providing infrastructure adapted to project requirements, considering the requirements of the implementation process for development and production purposes, selecting tools for implementing the required applications, ensuring system security at the operational level, technology solutions for protecting copyrights to data in particular iconographic ones, and securing data against loss.

4. Requirements

The implementation of an IT system begins with the specification of user requirements. Specifications and constraints outline the essential requirements that a system must fulfil, detailing its operational parameters and the conditions under which it functions. These established requirements are the foundation for a system’s design, development, testing, and implementation phases. Initially, non-functional requirements (also known as quality analysis) are articulated, focusing on aspects of a software system that do not pertain to specific functionalities or behaviours. Moreover, requirements delineate the operational characteristics of the system rather than its actions. They encompass various factors, including performance, scalability, security, availability, reliability, portability, etc. Non-functional requirements are subsequently converted into functional requirements that delineate the expected actions of the system, encompassing the necessary functions, operations, and business rules it must incorporate. Functional requirements articulate the technical specifications, data handling and processing, as well as other particular functionalities that dictate the outcomes the system is expected to produce. A detailed analysis of both requirements is a very laborious process that requires a systematic approach, and its description goes beyond the scope of this study. Nevertheless, selected aspects of system requirements analysis are presented below, focusing on target groups and interdisciplinarity in the context of defining the scope of work through the definition of functional requirements.

4.1. Target Groups

Identifying target groups at the outset of the NHC system specification is essential for accurately establishing non-functional requirements. These requirements serve as the foundation for a comprehensive system architecture description and are subsequently utilised to assess its performance. Furthermore, they are compared with functional requirements, which are outlined in the system design and specify particular behaviours or functions. Adaptation to target groups also includes the portal interface, which addresses users with different knowledge and interests. The intentional approach taken in the deployment of the AMUNATCOLL system involves utilizing various portal views (specifically in certain areas, such as search), tailored to the preferences of the user. At the time of registration on the portal, the user is required to choose one of the available views; however, this selection can be modified in the user’s profile at any point during regular usage.

For the AMUNATCOLL system, five distinct target groups have been identified: (i) scientists, (ii) state and local government administration, (iii) services and state officials, (iv) non-governmental organisations and society, and (v) education. Below, a description of each group is presented.

4.1.1. Scientists

This group is interested in the full information available in the specimen database. The most advanced mechanisms for searching for information and mechanisms for defining the method of presenting the list of specimens and information about specimens are needed. Systematic units are presented and searched in Latin; local names are not important for this target group. However, access to detailed data and data searching according to many criteria is necessary. For this target group, three methods of searching for data have been prepared, differing in the level of advancement and flexibility.

An important aspect of scientists’ work is using BioGIS tools to work with data. Advanced tools have been prepared for working with maps and presenting them, both data from the specimen database and from the database of natural objects unrelated to plant, fungi, and animal specimens.

4.1.2. State and Local Government Administration

For this group, a great deal of emphasis is placed on geolocation data, both when searching and presenting data. Available data can be limited to a selected map area (county, reserve, etc.). Providing aggregated information about specimens collected in this area is also essential. Government administration works with a strictly defined subset of the database, especially with information on monitored species. For this reason, access to data is simplified.

4.1.3. Services and State Officials

This group’s interest is limited to specimens of potential interest to individual services. This concerns the status of a species in danger (CITES, Red Book, etc.). The presentation should include information specific to a target group; iconographic data are essential. Databases with information on the status of species protection and databases with information intended for this target group are integrated.

4.1.4. Non-Governmental Organisations and Society

This general (default) group covers an extensive range of recipients. The way information is presented and searched for depends very much on the specific target group and probably consists of different elements available to other target groups. The tools available in the view intended for this group are universal and allow access to much information. Taxonomic units are presented in the local language, but Latin systematic names are also available.

4.1.5. Education

This group includes users with potentially very different interests. Some users may have less knowledge of the field, so the method of searching for specimens and the method of presenting information are simplified. The most important element is the list of specimens in the form of slides, reviews, and thematic indexes, which can be used to conduct lessons and lectures. It was necessary to include the local language when searching and presenting data. Adding links to external sources, e.g., from the database of national systematic names and basic information about species, may be useful.

The considerable diversity of potential users is a significant advantage, increasing interest in the provided system. However, it also creates a substantial responsibility for system designers concerning customising the functionalities offered. It is essential to recognise that the diversity among users is intricately linked to varying levels of knowledge, sophistication, and expectations regarding the numerous methods of database exploration, as well as the scope and manner in which results are presented. These differences may include search engine (simple, extended, advanced), language used to name specimens (national or Latin used by scientists), information presented in a scientific or educationally appealing format, access to individual specimens versus aggregated data reports, access to maps versus creating your own maps, and different access methods—via a portal and/or a mobile app [30,31]. The characteristics of each group are presented in Table 1.

4.2. Interdisciplinarity

Interdisciplinarity, characterised by collaboration among subject matter experts, representatives of target groups, and IT professionals, is essential in establishing the requirements and design of the system. The primary contributors from the first group include biologists, specifically botanists and zoologists, who are tasked with selecting and processing specimen data for digitisation. This group also encompasses taxonomists, ecologists, collection curators, and museum staff. Additionally, the role of geotagging specialists is crucial, as their data provides vital information regarding each specimen. The representatives of the target group are determined by the practical applications of the system in relation to its functionalities. Thus, it is imperative to address the question, “For whom is this system being developed, and what functionalities are of interest to these individuals?” IT professionals are responsible for translating the needs and expectations of the various interest groups into the specific functional capabilities offered by the NHC system.

It is important to recognise that engaging in discussions on the aforementioned topics with a diverse group of participants necessitates the use of soft skills. These skills are essential for conducting analyses in a constructive manner that fosters an understanding of the other party’s needs. The authors’ experience indicates that a beneficial initial step is to establish a common vocabulary that helps clarify the details of the topics being discussed and ensures that all parties have a basic understanding of the problems being discussed. Frequently, challenges arise in defining the specifics of requirements, which are closely tied to articulating the other party’s expectations. This often manifests in discussions framed as “What can you offer?” versus “What do you need?” It is also crucial to consider that potential recipients tend to articulate not their actual needs but their vision of the final product, creating a significant distinction. Lastly, it is important to highlight that even after a position is established within one of the target groups, this may ultimately result in ambiguous requirements when viewed in the broader context of services, reflecting varying needs across fields such as science, education, and other sectors.

5. Digitisation

Natural history collections, such as insect collections, often contain millions of specimens that require detailed documentation, including geolocation, taxonomy, and historical context. The huge number of specimens creates challenges in digitising physical records and maintaining high-quality metadata. The challenges encountered are not solely linked to the necessity of handling a substantial volume of data, but also pertain to aspects such as interpretation, data quality, and the rectification of inherent errors. Consequently, system developers are confronted with the challenge of optimising the entire process while ensuring high data quality. An effective approach to managing the digitisation of numerous specimens is through the implementation of automation mechanisms. Concurrently, it is essential to establish procedures for addressing data ambiguity, mechanisms for error detection, and protocols for implementing corrections or modifications.

5.1. Digitisation Process

The need to process a large number of specimens (the most common case when creating a system for the NHC) requires the development of appropriate procedures to organise and streamline the digitisation process of the collections gathered at the hosting institution.

After the digitisation operation, the graphical object (e.g., herbarium scan or photo from the field observation) receives a unique identifier in the NHC database resources. A new record with metadata fields describing its specificity is created in the prepared format (e.g., portal form or Excel spreadsheet). The graphical file with metadata is placed in a work buffer based on, e.g., the Seafile system [32], a software for synchronizing and sharing files which, because it is reliable and efficient, promotes productivity. To facilitate the preparation of data in the correct format, data administrators have additional tools at their disposal, such as a converter (which allows data conversion in existing files) and a validator (which allows checking the compliance of data with the developed standard). During the digitisation process, two qualitatively different records are considered, the first containing information about the specimens, while the second contains information about the specimens and the associated graphic file (photo, scan, etc.).

In the next step, with a set frequency (once a day if new/changed data are detected), data from the working buffer are automatically imported into the taxonomic and iconographic databases. During the import, re-validation is performed, and a report from this process is sent by email to data administrators. This allows for the detection of possible errors (inconsistencies) and streamlines the process of possible data correction. Imported information is placed in the appropriate database tables. Iconographic data are additionally subject to a security process (see Section 7.3.3) to guarantee its creators’ copyright. After these processes, the taxonomic and iconographic data are available for use in, for example, portal displays or export to external databases. This scenario is illustrated in Figure 2.

The frequent scenario in developing Natural History Collection systems involves collaboration between two or more entities tasked with executing specific responsibilities. This raises the question of how to facilitate the effective exchange of outcomes from the completed sequence of tasks. Referring to the framework illustrated in the AMUNATCOLL project, it is evident that two organisations with distinct areas of expertise are working together: the entity managing the natural collection, which oversees the initial phase of specimen preparation (including digitisation, numbering, description preparation, and validation), and the IT firm tasked with integrating this data into the database, safeguarding iconographic information, and providing online access. A suitable point of interaction for both organisations appears to be a shared data repository, which would promote high efficiency and allow for seamless exchange of both textual data (metadata) and associated graphic materials (such as scans, photographs, audio files, etc.). It is anticipated that following a successful data import, the importer will delete the data, thereby creating space for future entries.

5.2. Data Quality—Date Uncertainty and Ambiguity

Digitisation of data directly from herbarium sheets causes many problems in terms of the quality of reading information. These problems increase with the age of the sheet (e.g., 18th or 19th century) and difficulties in reading handwriting, data incompleteness or fading ink (Figure 3). This causes difficulties in implementing automatic text recognition and its interpretation using machine learning techniques, which is a very convenient solution considering the amount and complexity of the analysed text. Specifically, they are also related to identifying the correct date or place of collection, which is fundamental information for the accurate characterisation of the catalogued specimen. Table 2 lists some examples of issues relating to the interpretation of dates.

In the event of the above or similar problems with identifying data, it is invaluable to have them verified by an experienced collection manager. One can supplement the date based on other information or his experience. In cases where it is still not possible to fully specify the date, the imprecise information (from the point of view of format) should be recorded in the field related to comments. Eventually, for a scientist studying a given specimen, it can be a valuable source of knowledge.

In IT systems, it is assumed that the date is recorded in accordance with accepted standards, e.g., ISO 8601 [33] or RFC 3339 [34], usually in the format “YYYY-MM-DD”. This requirement translates into the date formats used in database systems. Considering the problems presented above with incomplete date information, the question should be asked whether it is worth strictly sticking to the standard or adapting the format to the system requirements. In the first case, it will not be possible to record an incomplete date, and partial information will be in the field intended for comments. In the second case, a text format should be used (instead of a date format), allowing for partial information to be recorded. However, this requires adapting the interface functions, e.g., those related to searching.

5.3. Iconographic Data Formats

One of the challenges associated with creating online systems for sharing natural collections is choosing the correct format for iconographic data. In a typical digitisation process, the absolute minimum is the so-called MASTER form and one presentation form (HI- or LOW-RES). Depending on the digitisation path and equipment, the RAW and MASTER CORRECTED forms may or may not be created. Most often, only one form of presentation is provided (HI- or LOW-RES), and the creators of a specific website decide which of them will be decisive. The RAW format (obtained from digital cameras) and TIFF for lossless recording of scans of natural specimens are most often used in the digitisation process. The JPG format is sometimes used, especially in field observations performed on older devices. Images are converted to the pyramidal TIFF format [35], which allows for streamlining the process of their transmission and presentation on the portal. Table 3 presents the most essential formats along with their characteristics.

5.4. Automation of Digitalisation Procedures

A significant operational, logistical and technological challenge is the need to develop a large amount of data (often hundreds of thousands or millions of specimens) that meet specific requirements during the digitisation process. Due to its time-consuming nature, this task often proves to be crucial for the success of the data preparation stage and meeting key project indicators. It is therefore worth investing effort in improving the flow of data by developing procedures and applications that support this process. They help prepare data, check its correctness, and prepare statistics. These tools can be available in different forms (a console or web application, or a spreadsheet that facilitates formalised data preparation), depending on the planned data entry process and subsequent correction (see Table 4). Ultimately, this simplifies the management and monitoring of processes in the project, and increases the chance of successfully completing the task while maintaining the highest possible effectiveness.

The approach based on spreadsheet forms, used in the first stage, allowed for easy conversion of existing data to the new format and efficient communication between teams, supplementing individual groups of data on specimens (e.g., the team of botanists transferred data to the team responsible for georeferencing). Portal-based forms were added later (when the system reached the appropriate level of advancement), providing a greater level of interactivity (immediately visible changes) while maintaining high data quality (validation and reporting of changes).

6. Design

An essential aspect of the design phase involves strategizing operational procedures for the appropriate storage and management of taxonomic, biogeographic, and ecological data associated with the biological specimens being digitised. In the initial phase of this process, metadata are defined, i.e., the formal management of the structure, based on the analysis of existing standards [36,37]. The set of parameters derived from the standard is expanded with data important from the point of view of the specificity and functionality of the system being developed. Next, the database, as a key element of many IT systems, must be configured to store data with an appropriate structure to increase efficiency. Preparing and processing vast amounts of data requires automated procedures with dedicated tools attached. They cover a variety of routines, ranging from data preparation, which sometimes involves conversion, to aggregation and finally validation, which ensures that the data follow certain rules. First, to enable proper handling of the process, dedicated operational procedures must be defined and applied.

Before discussing the metadata structure, two critical aspects that define the approach to data management must be mentioned: the Data Management Plan and the Essential Biodiversity Variables.

Data Management Plan (DMP) is a formal document specifying procedures for handling data in the context of its collection, processing, sharing, and presentation. Data Management Plans are a key element of good data management. A DMP is a formal document describing the data management life cycle, preservation, and metadata generation. As part of making research data findable, accessible, interoperable, and reusable (FAIR principle), a DMP should include information on how research data are handled during and after the end of the project, what data will be collected, processed, and/or generated, which methodology and standards will be applied, whether data will be shared/made open access, and how data will be maintained and preserved. It should be noted that a properly prepared DMP allows for saving project implementation time and increases research efficiency [38].

Essential Biodiversity Variables (EBVs) facilitate evaluating biodiversity changes over time. This capability supports monitoring advancements toward sustainable development goals by assessing adherence to biodiversity policies and observing how biodiversity responds to disturbances and management actions [4,20].

6.1. Metadata Definition

Given the complexities involved in establishing a metadata structure, adhering to established global biodiversity metadata standards is advisable to avoid compatibility pitfalls during future integration with external data sets. The parameters and types of data constitute critical categories of information that greatly affect effective data management. Inadequately defined metadata structures can impede data retrieval and exploration, thereby providing insufficient support for researchers in their endeavours [39,40]. Two primary standards for biodiversity informatics are widely recognised and utilised by major networks: Darwin Core [18] and Access to Biological Collections Data [41]. Examining these specifications reveals that they effectively outline the essential aspects of specimen characteristics and their categorisation, aligning with the requirements of most collections. These standards encompass areas related to the taxonomic description of specimens, their specifications, spatial attributes, descriptions of associated multimedia files, and references to information sources. Adopting such standards also enhances the interoperability of systems with similar objectives, significantly improving their overall effectiveness.

Therefore, developing a proper organisation of metadata seems extremely important, both in terms of addressing the system’s functional and non-functional requirements and defining the individual sections to which they belong. The definition of metadata was divided into four sections, covering taxonomy, biological samples, multimedia, and bibliography.

The first section of the metadata description provides details describing the taxonomically identified (named) “objects”, such as preserved specimens, iconographic documents (drawings or photographs), multimedia documents (multimedia objects), and field notes (human observations).

The second part is intended for information related to a specific area (or areas) of research. In the case of AMUNATCOLL, these are metadata describing biological samples in which biological material (mainly invertebrates) is preserved, awaiting scientific processing, in particular taxonomic identification.

The following section discusses metadata describing multimedia documents of landscapes, natural habitats, and species that do not have all the properties necessary for their inclusion in the first section but illustrate the characteristics of these taxa well. Such research material is also a valuable source of information on, for example, biotopes, and should be subject to cataloguing.

The last information group is metadata describing bibliographic items cited in the database and previously unpublished documents, including digitised copies of poorly accessible bibliographic sources. Because they often contain unique material supplementing information previously included in other chapters, they must not be forgotten.

Each field in the metadata specification has a unique name and is described with a metric consisting of the elements presented in Table 5.

The relevant fields for describing the above-mentioned characteristics are primarily found in the ABCD and Darwin Core standards. However, a group of characteristics goes beyond the defined scope. In the case of the AMUNATCOLL project, the ABCD standard in version 2.06 was implemented, implementing numerous extensions. This resulted in over 220 fields and numerous dependencies describing the relationships between characteristics (conditioning the occurrence of specific values). The metadata specifications and the associated Data Management Plant were developed through numerous discussions held over several months and resulted in over 100 versions of documentation consisting of hundreds of pages. The result of the relationships between the fields suggested by the standard and the numerous project extensions is illustrated in Figure 4.

Finally, it is worth emphasising the issue related to the mandatory completion of individual fields describing the specimen. Considering the specifics of work in a natural resources digitisation project, especially in the context of usually limited resources and time, it should not be expected that all fields will be completed on the first attempt (especially since there are usually a significant number of them—e.g., more than 220 in AMUNATCOLL) [21]. This has its consequences in defining in the metadata specification in which fields must be completed compulsorily due to either the importance of the information or due to links with other fields of the record. It is essential to consider that the following fields are required: identification (ID, institution, source), taxonomic (at least genus), details regarding the specimen’s origin (collector and designations, date and location of collection), and storage method. In this context, it is essential to keep in mind the compatibility with external data repositories (such as GBIF) to ensure that the data’s completeness facilitates seamless integration in the future (refer to Section 7.2).

6.2. Data Access Restrictions

During the system design phase, it is essential to consider the sharing of selected data by establishing a suitable access policy. Due to biodiversity protection, access to some data should be restricted, for example, specimens from protected species or those collected in protected areas. Protection involves restricting data access to specific groups of recipients with an appropriate authorisation level, such as researchers, and/or offering location information with a defined level of precision, whether exact or general, including GPS coordinates and a descriptive account. A more complex challenge arises when considering the legal protection status at the time of collection. Implementing a database data protection policy that is protected at two levels: specimen and field (part of the record) is proposed. Both methods of protection are discussed below.

6.2.1. Specimen Data Protection Levels

A record in the database can be marked with a protection level (0–3), indicating the users who have access to the information (see Table 6).

The specimen is visible to the user with the appropriate permissions (appropriate role) in the system.

6.2.2. Record Field Protection Levels

The user can view protected fields if their permission level (role) allows it. Some of the fields are treated in a special way. Information regarding geographic coordinates and habitat is stored in the database in both exact and approximate form. In general—it is important, at least for herbarium specimens, that the specimen image of protected species not be shared because the label will contain information on its location. The field value presented to the user takes the values of one of two physical fields, depending on the user’s protection level. This is to protect sensitive data about the collected specimens. Fields are marked with different protection levels (0–3), indicating permissions to view them (Table 7).

6.2.3. User Roles

Both general users and project coordinators use the Natural History Collections portal. A user’s actions in the portal depend on the permissions assigned to them. At a general level, we can distinguish two types of users: not logged in (access to the basic functionality of the portal and data) and logged in (access to additional functionalities, depending on the roles set).

A logged-in user can browse content shared publicly on the portal, but has no role with additional permissions. Authorised persons can assign users one or more roles. A role groups permissions to individual actions and defines the level of user access to individual fields describing specimens. Let us define the following roles: (i) confirmed user—can create content, in particular observations and presentations; (ii) leader—can create teams and confirm user joining; (iii) trusted collaborator—has access to some sensitive information, e.g., the exact location of the specimen collection site; (iv) coordinator—content editor, leader manager (super-leader), project coordinator.

6.3. Data Correction

The challenge of data correction is frequently overlooked during the initial stages of the system design and development. As the digitisation process unfolds and data are entered into the database alongside routine quality control measures, it becomes evident that data inaccuracies are more common than anticipated. These inaccuracies vary, with a significant portion stemming from human mistakes. Examples include random errors, such as typographical mistakes related to individual entries, and systematic errors, which involve assigning incorrect field values across a group of entries. Also, errors may arise from incorrect data description files, particularly when re-importing entire datasets. Mechanical errors also occur, such as low-quality scans that require re-importation of specific items, including iconographic data. All these scenarios highlight the necessity for expanding the validation and import procedures and mechanisms beyond initial expectations.

The procedure for value correction should be restricted to individuals who possess the necessary authorisations (e.g., coordinator, editor, refer to Section 6.2.3). It is advisable to modify the value of records pertaining to a specimen in two distinct manners: individually and collectively. For the individual approach, it is beneficial to enable the modification of records directly through the portal interface, specifically within the form that displays the specimen characteristics (see Figure 5a,b). Conversely, a more effective method for the collective value modification process would be to import the amended records via a spreadsheet file that adheres to the established standard (see Figure 5c). In both cases, a significant advantage of such a solution is that the changes are immediately visible in the portal. It is essential to highlight that every record, irrespective of the modification method, undergoes a validation process akin to the initial digitisation. Only upon completing this validation are the data incorporated into the database.

The record editing process must be effectively managed on the backend, necessitating an expansion of the range of supported API calls. To ensure secure access to these services, all requests must include a valid authorisation token, which can be obtained by submitting a login request to the API, along with the appropriate permissions for the logged-in user. The API accommodates requests from tools associated with Excel spreadsheet files, such as file validation, conversion, and database updates, as well as requests aimed at modifying individual records through the online portal form.

Requests are queued on the portal side, ensuring that each subsequent request is processed only after receiving a response from the preceding one. For operations that alter the state of data within the database, each modification is recorded in tables designated for maintaining the history of the respective record type. This includes general information about the change, such as the date and time, the user who made the change, the specific record that was altered, and detailed descriptions of all modified fields. In instances where the database is imported or updated from a file, the name of that file is also documented in the database to assist those responsible for data modifications. Each operation generates a report detailing the results or any errors encountered.

6.4. Graphic Design

The graphic design of a typical NHC portal should meet several rigorous requirements regarding usability, appearance, and accessibility. Work on its creation usually begins with the assumption of the purpose of the portal, content architecture, defining the recipients, and then creating a functional model and graphic design. A critical task is to determine who it is addressed to (who are the end users), what the expectations are, and how the recipients will use the designed interface.

The spectrum of recipients is often extremely wide, and users have different needs and experiences with IT solutions.

A sample list of recipients may include researchers, teachers and students, representatives of state and local government administration involved in nature conservation, state services and officials responsible for species protection, representatives of non-governmental organisations, and ordinary nature lovers. Therefore, it is necessary to design such UX (user experience) and UI (user interface) so that each user can find exactly the expected content. It can be assumed that the experience or, in some cases, habits of users on the web are not very different from the habits of customers in a supermarket. Website visitors browse each new page, “scan” text and photos, and click on the first link that promises to provide what they seek.

According to UX principles, it is necessary to predict the steps users take to access functionality or information, the so-called user flow. Most users search for engaging content by clicking on various interface elements until something catches their attention. If the page does not meet their expectations or they do not find the needed information, they click the “back” button and continue searching.

Numerous studies have shown that having the site navigation arranged in a pattern resembling the letter “F” (F-pattern design) is advisable. Numerous studies have shown that when browsing pages, internet users usually scan them with their eyes, starting from the upper left corner of the page and then moving lower. As a result, the human eye makes a “journey” across the screen, the “trail” of which resembles the letter “F”. In accordance with this style, the page’s most important elements should be placed near the top and on the left. This is an important principle, where the key navigation elements are found on the portal.

The home page should be a single page, with clear links to the various modules. A module should fill 100% of the screen. Some modules may be easier to use if broken across more than one screen—so long as it is easy to move between screens.

The basic function of the NHC portal is to offer access to the database of natural collections. To increase user engagement, it is worth considering encouraging the user to create an account. This can be done by offering more functionality and data for authorised users while maintaining a minimum offer for all visitors. However, it should be remembered that both groups should have a defined path to reach the information.

Access to the main pages should be possible from the menu level, the structure of which is intuitive. According to Krug’s first law of usability [42], a website should be intuitive and logically constructed. If the navigation and architecture of the site are not intuitive, the number of doubts increases. Hence the consistency in the ranking of content and functions in the appropriate submenu groups.

Another assumption when designing the portal concerns its appearance. It is necessary to be guided by the need to provide websites whose design will be attractive for several years, but will also be user-friendly. It should be remembered that the portal serves both its content provider and the recipient. This goal can be achieved by using the appropriate colour palette. A well-chosen colour palette enhances the positive experience of using the website. Complementary colours create balance and overall harmony. The use of contrasting colours in text and background makes reading easier.

When designing, it is worth consciously using the “Whitespace” principle. Whitespace is a space that separates elements of the website from each other. Empty spaces have a very positive effect on the user’s reception of content. First of all, they give the impression of spaciousness on the website and, as a result, reduce the feeling of tightness between individual elements of its content. Thanks to the use of additional spaces, they are definitely more legible. Whitespace highlights the most important parts of the page and allows the reader to focus their attention on them. One such element is the “call to action” buttons on the main page in each of the modules. Forms and advanced search engines are located in empty fields. This increases the chances they will be completed.

The typography used (e.g., Montserrat) was chosen because it is consistent with current design trends as well as being easy to read and appropriate to the content. It is a good idea to limit yourself to one font of different thickness, in order to maintain consistency and shorten the page loading time.

NHC portals often offer users a huge database of illustrations on the one hand and information in text form on the other. It is worth using the experience that the human brain perceives and processes images much faster than regular text. The user’s eye quickly gets bored with long texts to read. “A picture is worth a thousand words”. Therefore, wherever a photo can support the content, attractive images should be used. An example are infographics, which explain the portal’s assumptions or the functionality of the application to the user faster than written text.

In order to reconcile different forms of communication and maintain legibility, it was necessary to reach for the experience and current trend of creating pages. Minimalism is currently one of the most popular trends in website design. Its essence can be defined as “less is more”. The basis of this trend is to simplify the page as much as possible. Unnecessary elements should be removed, limiting them to the necessary minimum. This applies not only to the content of the website, but also to its colours. The portal should be simple and uncomplicated. A minimalist layout minimises the distraction caused by unnecessary elements. In addition, it makes discovery of the page’s resources easier to find.

When designing websites, it is worth considering SEO requirements. The graphic design of subpages should meet the expectations of search engines (e.g., Google) and the expectations of the user. Therefore, the expected template of subpages contains a header with the title of the page, a text lead introducing the content of the page and the actual content, where the blocks of text are divided by graphics.

The design also had to be responsive so that it should be designed and coded in such a way that it works and looks good regardless of the resolution of the monitor it is displayed on. The portal is prepared to work properly on a large monitor, tablet or smartphone. Responsive web design makes it possible for the user to find and easily read all the information they are looking for without the need to reduce/enlarge individual elements of the portal.

The well-designed NHC portal meets the WCAG in version 2.1 level AA requirements, i.e., perceptibility, functionality, comprehensibility, and robustness/compatibility [43]. This matter pertains not only to best practices in website development but is also mandated by established European legislation, including the European Accessibility Act (EAA)—Directive (EU) 2019/882 [44], the Web Accessibility Directive—Directive (EU) 2016/2102 [45], and the European Standard EN 301 549 [46], as well as national laws within EU member states. The content on the portal is prepared and accessible for people who have various limitations, but want to know what is in the picture even though they cannot see, cannot use the mouse but only the keyboard, enlarge the view of the pages or change its colours to be able to see the content better, or change the browser settings to make the content more legible. On the portal, one should find the accessibility declaration of the portal, from which users can find out to what extent the site complies with the requirements of the act on the digital accessibility of websites and mobile applications of public entities.

7. Technology

The functional model proposed in this work assumes that digital biodiversity data will be used for scientific, educational, public, and practical purposes. Therefore, designing and implementing interfaces that properly enable access, exploration, and manipulation of the data in the project database is crucial. Data can be accessed using two available interfaces: a graphical and a programming interface (API). The first one is implemented in two forms: a portal, which is the primary interface for access to data collected in the database, and a mobile application with functions of particular value when conducting field research. Providing the set of operations required by target groups involved equipping the portal with simplified and advanced search, statistical analysis, and BioGIS processing capabilities. The graphical interface is subject to numerous requirements and limitations, which are reflected in graphic design and accessibility issues related to accommodations for people with disabilities. It must appropriately address different groups of target recipients, considering their various goals and levels of knowledge, and adapt the interaction level due to the interface’s limitations.

The technological challenge was the scale of the project, both during the digitisation process and the subsequent storage and sharing of data. The botanical collections include approximately 500,000 specimens, including over 350,000 vascular plants. Specimens of vascular plants are kept in two herbaria: POZ and POZG. The POZ herbarium (approx. 190,000 sheets) consists of many collections, mainly from Poland and various regions of Europe and North America. This herbarium contains over 240 nomenclatural types of multiple ranks. Zoological collections contain over 1,700,000 specimens of invertebrates and 50,000 chordates catalogued so far. Almost 800 TB of physical disk space is needed to store this data and multimedia files, including processed data, replication and backups (see Section 7.1.2 for details). Additional space (120 TB) is used to temporarily store data after the digitisation process and before their verification and inclusion in the system.

Data openness and cooperation with other solutions/systems are key elements in achieving synergy when conducting biodiversity research. Therefore, AMUNATCOLL IT offers the opportunity to respond to these challenges by enabling data export for independent processing using external tools and the portal functionality or by providing access to data directly via an application programming interface. In addition to independent export, the API interface also allows you to connect the AMUNATCOLL database with external databases, e.g., GBIF (Global Biodiversity Information Facility) [47].

7.1. Infrastructure

The IT infrastructure implemented to create the NHC online system is intended to safely store data collected in the process of digitisation of specimens and to provide computing power for services related to serving content to users. The IT infrastructure was designed to meet all project requirements, including ensuring basic functionality such as data redundancy, the ability to implement various processes, supporting both development and production purposes, and providing space for systems supporting reliable system operation, e.g., monitoring service performance.

7.1.1. Understanding the Specimen Quantity Factor

First, the project requirements must be adequately assessed regarding the number of specimens to be digitised and the time and human resources devoted to the process. Below in Table 8 an estimate for the AMUNATCOLL system is presented.

To understand the scale of the digitisation project, let us do some simple calculations. Let us assume the project lasts 3 years, with about 250 working days each year, which gives us 750 days. From this value, we need to subtract about 6 months (125 days) for the so-called start-up at the beginning of the project, related to purchasing the necessary digitisation equipment (scanners, workstations, disk space), establishing the metadata structure, preparing forms, training employees, etc. We are left with 625 days for digitisation, which gives about 3600 daily specimens. Assuming we have 50 employees, each should digitise seventy-two specimens (nine per hour) during a working day. Of course, achieving such efficiency in the initial phase is extremely difficult to do (lack of skill of newly trained people) and will require catching up on the “backlog” in the subsequent stages of the project. These numbers are intended to emphasise that without proper planning and support of the entire process with automation operations, the implementation of the task would not be possible.

7.1.2. Storage Space

Disk space is another enormously important aspect that must be considered when planning the infrastructure for the NHC system. Proper planning assumes data redundancy, allowing for their protection in the event of a failure and the need to efficiently restore the system’s operation, as well as a buffer necessary for storing processed data.

The source material from which we start calculations is all iconographic materials from the scanning process, photographic, video, and sound documentation. It is not used for operational activities. The source material enables us to recreate the operational material in the event of its damage or a change in the technology used.

In turn, the operational material is created by processing the source material to meet the needs of the IT system. It is processed using operations that increase its usability during presentation on the portal and provide copyright protection. The first operation concerns conversion to pyramidal TIFF (adaptation to the needs of the portal, enabling faster loading of graphics for its presentation). The next one involves securing photos against unauthorised use by cutting off the border and adding a watermark, holograms, and a set of metadata (EXIF). The above actions increase the capacity by an average of 160% (mainly related to conversion to pyramidal TIFF).

The next step is to consider the storage method in terms of the disk technology used. It is suggested to use the RAID (Redundant Array of Independent Disks) solution [48] simultaneously increasing reliability and transmission efficiency, and the uniform available space. A reasonable and sufficient approach seems to be to prepare a configuration based on RAID 6 (8 + 2), where eight is the number of drives used for storing actual data, and two is the number of drives dedicated to redundancy to provide fault tolerance. Such a solution is characterised by resistance to failure of a maximum of two disks. The array is generally implemented as RAID 1 (replication of work on two or more physical disks), the elements of which are RAID 0 arrays. Such an array has the advantages of the RAID 0 array, speed in write and read operations, and the RAID 1 array, data protection in case of a single disk failure. A single disk failure causes the whole to become RAID 0 in practice. However, it should be noted that the additional cost is allocating 20% more space for data than without this solution.

Considering the effort needed for potential data recovery (time and personnel costs) in serious emergencies such as floods or fires, it is necessary to consider placing the data in different locations. It is assumed that one copy of the data is kept in the exact location (local copy), but on another, much cheaper medium, and the second in a different geographical location (geographic copy), where the data are replicated on disk arrays. This approach guarantees relatively quick system recovery, even in a serious event, allowing for a smooth switchover of the data source. In addition, it can be assumed that the local copy is limited only to the source material to reduce storage costs.

It is important to remember that the storage capacities of data carriers, such as HDD and SSD drives, are defined according to the ISO standard [49] (1 kB = 1000 B -> 1 kilobyte = 1000 bytes). However, the values of available space given by the operating system are shown using the conversion factor 1 KiB = 1024 B (KiB stands for kibibyte) [50]. While for small values it seems not much, a size of 1 TB causes an increase of approx. 10% on disk space [51].

It is important to remember to allocate space for shared storage for digitisation. It should enable an asynchronous data import mechanism by synchronising data directories in the background. The browser-based file transfer solution is not recommended due to large file sizes and transfer speed limitations. The AMUNATCOLL project effectively assumes a buffer size of about 90 TB (physically about 120 TB).

Let us consider the calculation of the required disk space in the variants: regular and economical (local copy includes only source materials), based on data from the AMUNATCOLL project (Table 9).

Considering the above assumptions, it should be assumed that for safe and efficient data storage and processing, one needs to have from 9.52 to 11.63 times more space relative to the collected source material. In addition, procedures for restoring the state of the database from maintained backup copies should be developed to ensure the fastest possible service restoration in the event of a disk system failure.

7.1.3. Computing and Service Resources

The implemented IT infrastructure is intended to store data and ensure the efficient and reliable operation of numerous services and processing processes (e.g., conversion of scans and their protection). With this in mind, some services such as the portal have been duplicated, dividing them into development and operational infrastructure. This allows for implementing new functionality and introducing changes based on the CI/CD (continuous integration/continuous delivery) methodology [52].

The portal implementation consists of two parts: backend and frontend. The server part (backend) was developed in Python [53] using the Django programming platform [54] with the Pandas [55] and Gunicorn [56] libraries. The browser part of the portal (frontend) was prepared in the SPA (Single Page Application) model in JavaScript using the ReactJS library [57]. BioGIS tools were prepared using the GeoJSON library [58] on the database side and the Leaflet [59] and Pixi.js [60] libraries on the frontend side. The front and backend communication are carried out using HTTP API calls. Query authentication uses JWT token technology [61].

Maintaining the continuity of the system and safeguarding the data it houses is of utmost importance, given the accessibility of the provided data and services. To achieve a high level of service availability, it is advisable to implement a resource monitoring system, such as Zabbix [62]. This service tracks a specified list of critical applications at predetermined intervals. Notifications regarding service outages are emailed to a designated individual or group, enabling service administrators to take prompt action to restore functionality.

The database is a fundamental component of nearly all NHC systems, necessitating considerable time and focus for its practical design and implementation. It stores metadata that outlines the attributes of specimens and retains information pertaining to organisational, technical, and support areas, thereby facilitating efficient access and management.

For the AMUNATCOLL project, the implementation utilised the PostgreSQL server [63]. This system is a free, open-source relational database management solution that prioritises extensibility and SQL compatibility. It encompasses essential functionalities tailored to the project’s requirements, including ACID properties (atomicity, consistency, isolation, durability), automatically refreshed views, foreign key triggers, and stored procedures. PostgreSQL is engineered to accommodate substantial workloads, such as data warehouses or web services that support numerous concurrent users. It is vital to identify distinct logical areas for storing specific groups of information pertinent to various operational components of the project. Taking the AMUNATCOLL project as an example, three distinct areas have been established: “amunatcoll”, “dlibra”, and “anc_portal”. The “amunatcoll” database is dedicated to the storage of data concerning specimens, featuring tables that include taxonomic names and their synonyms, sample data, bibliographic references for publications, types of specimens, as well as collections and subcollections, along with statistics and historical records of specimen imports. The “dlibra” database serves as the repository for the dLibra digital library [64], tasked with storing data on imported multimedia objects. Within this library, the project utilises data from tables that contain metadata related to multimedia files and publication scans. In this database, attributes and their corresponding values are organised in two columns consistent across all attributes, unlike the “amunatcoll” database, which has dedicated columns for storing parameters separately. The “anc_portal” database manages information related to the portal and its functionalities. It includes tables that hold user data, user permissions, resources generated by users through the mobile application (such as projects, observations, and associated files), additional user-generated resources (including albums, filters, and base maps), as well as information about teams and their members, along with visit statistics.

The most essential elements of the AMUNATCOLL development and demonstration infrastructure and the connections between them are presented below (Figure 6).

7.2. Interoperability

Collections are often stored in different formats and databases in different institutions. Agreement on common standards and formats is required to ensure effective communication between systems.

Another aspect is access to these collections with a global reach, which is most often based on platforms integrating data from different leading institutions: museums, universities, and research institutions.

The complexity of a typical NHC teleinformatic system requires delegating the operations to many cooperating modules. Functional assumptions also require a certain openness to cooperation with external applications, such as independent data repositories. The above conditions are related to the development of appropriate programming interfaces. The backend layer of the system provides access to information stored in the database using a standardised access interface, most often implemented in REST technology [65]. This interface is used by both the WWW portal, the mobile application and other cooperating modules. Access to individual interface methods is secured using JWT tokens [61]. To facilitate access while ensuring security, both a short-term access token and a long-term refresh token are used. This enables limiting the use of the offered functionality only to the logged-in user, based on their authorisations.

Selected taxonomic data of specimens from the database are available to external entities. An example of a database on biodiversity to which records will be exported is GBIF. For this purpose, the “BioCASe Provider Software” (BPS) was used—a web service compatible with the “Biological Collection Access Service” [66]. The BioCASe access service is a transnational network of biodiversity repositories. It combines data on specimens from natural history collections, botanical/zoological gardens, and research institutions worldwide with information from large observational databases. In order to ensure efficient data transfer from one database to another, instructions for connecting individual fields, called mapping, must be provided. An example of such a mapping between the ABCD standard (supported by GBIF [67]) and the AMUNATCOLL database is presented in Table 10.

Correct preparation of BPS results in a URL to which queries can be sent according to the BioCASe protocol. In the case of GBIF, access to data consists of sending a scanning query, the response to which is a list of specimens in the database, and then sending a series of individual queries retrieving available data for each specimen.

7.3. Security

Designing an NHC system requires awareness of security gaps in individual applications and methods to counteract their occurrence. This involves implementing appropriate programming practices, system openness (access via the internet), and the issue of securing copyrights of shared materials.

7.3.1. Security Programming Practices

At the beginning of the development of the NHC system, it is worth paying attention to the general recommendations for secure programming to take advantage of the benefits they provide while minimising the risk of system compromise, including the leakage of confidential data. The Security Development Lifecycle (SDL) [68] is a systematic approach to integrating security best practices into the software development process. This methodology enables developers to detect security vulnerabilities early in the software development lifecycle, thereby mitigating the security risks associated with software products. Microsoft, the originator of the SDL methodology, reported that its application proved successful and yielded substantial outcomes. Notably, there was a 91% reduction in security vulnerabilities in Microsoft SQL Server 2005 compared to the 2000 version of the software, which was the last release prior to the adoption of the SDL. Methodology comprises five fundamental phases: requirements, design, implementation, verification, and release. Each phase necessitates specific checks and approvals to guarantee that all security and privacy requirements, as well as best practices, are adequately met. Furthermore, two supplementary phases concerning training and response are established. These phases are carried out prior to and following the core phases to ensure their effective execution. The implementation phase is particularly significant from the perspective of NHC systems (but not only). During this phase, it is advisable to adhere to three key recommendations regarding tools, hazardous functions, and static analysis. The concept of utilizing only approved tools is linked to the publication of a list of such tools, along with associated security checks, including compiler and linker options and warnings. This approach facilitates the automation of processes and the integration of security practices at a minimal cost. A thorough examination of all service-related functions, including API functions, enables identifying and blocking dangerous functions, thereby minimising potential security vulnerabilities with low engineering expenses. Specific measures may involve using header files, updated compilers, or code scanning tools to identify prohibited functions and substitute them with safer alternatives. Additionally, static code analysis should be conducted systematically, which entails reviewing the source code prior to compilation [69]. This method offers a scalable approach to assessing code security and ensures adherence to policies aimed at developing secure code.

Despite the additional cost associated with implementing SDL, it is essential to remember that early identification of vulnerabilities significantly lowers the costs of rectifying them later. Furthermore, SDL promotes adherence to regulatory requirements and industry standards (ISO 27001 [70], NIST [71], PCI-DSS [72], etc.), enhancing overall security and resilience.

7.3.2. Security System Audit

As part of the safety supervision, it is necessary to cooperate with a professional security team, which conducts a security audit of both the code that is still in the development phase and another one in its final phase. This allows identification of potential threats and their elimination at an early stage of software development. When performing an audit, it is advisable to follow the guidelines established by reputable security organisations, which will direct our focus toward specific, prevalent issues. One such organisation is OWASP (Open Web Application Security Project) [73], a global non-profit entity dedicated to enhancing the security of web applications. Functioning as a community of professionals united by a common objective, OWASP produces tools and documentation based on practical experience in web application security. Due to its distinctive attributes, OWASP can offer impartial and actionable insights on web application security to individuals, corporations, academic institutions, government bodies, and various organisations worldwide. Table 11 presents the ten most critical security threats in web applications, which are also of considerable relevance to NHC systems.

7.3.3. Securing Iconographic Data

An essential aspect of the NHC development is ensuring intellectual property rights of shared materials created by the housing institute. Therefore, security issues should include the methodology for securing iconographic data, with particular emphasis on graphic files from scanning specimens and photos presenting observation data. There are many methods to protect graphic data using different technological solutions. Below, those that were used in the AMUNATCOLL system are presented. They have been selected as a result of a thorough analysis taking into account the benefits and costs of implementation. A decision was made to choose four methods and use them simultaneously: removing external pixels, adding metadata within EXIF data, placing a visible watermark with information about the owner, and adding a digital signature, which is invisible to the user but at the same time provides the strongest protection.

Removing the Outer Pixels Around the Image

This method originated in the insurance industry for works of art, primarily paintings. It involves photographing the work both without and within the frame of the painting. The photo without the frame is not published anywhere, while the photo in the frame can be publicly available. Any marks on the canvas obscured by the frame allow you to identify the originality of the work in the event of its forgery. Translating this method into computer language, an operational copy is created based on the original image, from which we remove several dozen extreme pixels from each edge. The original version of the scan/photo is not published anywhere and is used only for evidentiary purposes or to recreate the operational copy if it is, e.g., destroyed or unwantedly modified. Additionally, it cannot be forgotten that the image intended for presentation in a portal will have a lower resolution (e.g., an image with a resolution of 1000 × 1500 pixels was created by sampling an image with a resolution of 2000 × 3000 pixels). In this way, the owner with the original image will be able to prove that the file intended for display, created by performing the above operations, comes from their database. An entity that does not have the original image will not be able to prove the origin of the image. The time consumption of this method is low.

Image Metadata

The EXIF metadata standard [74] outlines a framework for describing graphic files, enabling the inclusion of details such as the photographer’s name, a description, and the geographical location. When a resource is uploaded to the server, the script processes it by eliminating unnecessary parameters or incorporating new ones based on the user’s selected preferences (for instance, photo location, camera model, author, description, etc.).

Like any metadata storage standard, it is subject to modifications. The approach to securing resources at the metadata level does not provide a robust level of protection, as this information is not only accessible to users but can also be easily altered by them. Consequently, relying solely on metadata for security purposes does not effectively prevent attempts to misappropriate intellectual property. It is essential to highlight that no reliable method restricts user access to metadata. Notwithstanding, this approach can be utilised for internal processes such as data sorting, aggregation, and resource description. Examples of fields from the EXIF standard that can be employed to indicate copyright are illustrated in Table 12. The advantage of using this method is its simplicity in implementation and low time consumption.

Watermarks

Watermarks serve as a robust means of protecting visual content, owing to their visibility and the challenges associated with their removal from an image without substantial alterations. This technique is straightforward to implement. To maximise their effectiveness, watermarks should occupy at least 40% of the central area of the image, thereby significantly hindering attempts to eliminate them. The design of the watermark should distinctly represent the owner of the content. However, a notable drawback of this approach is its potential to interfere with the original image, which may obscure critical details. When dealing with a large volume of files and varying object placements, it is often impractical to customise the watermark’s position and style (including font and colour) for each image. Therefore, it is advisable to establish specific guidelines for automatically applying watermarks following a thorough analysis.

Digital Signature

A more sophisticated image protection method is adding a CGH (Computer Generated Hologram) [75] digital signature, i.e., a hologram invisible to the naked eye. There are various methods of adding holograms. One of the most popular is to perform lossy compression of the image, and in place of some of the information responsible for the colour, depth of focus, and saturation of the image, hidden information about the owner of the resource is inserted (e.g., “2019©ANC”) saved in binary using Base64 code (Figure 7) using a hologram requires the use of specialised software that checks the checksum. However, reading information from the hologram is proof of copyright ownership if someone unauthorised uses the resource. The disadvantage of using holograms is the need to devote additional computing power to the server, the costs associated with implementing the method (the need to create a specialised program that cooperates with the rest of the system) and a slight loss of quality of the resources.

8. Main Aspects and Efforts of the System Implementation

To perform a comprehensive requirements analysis, it is essential to identify the most essential aspects and related challenges that will serve as the foundation for developing the NHC system while also delineating its complexity boundaries. Many aspects have already been discussed in this paper and references are made to them in their description. They are included here for the sake of completeness. In the second part of this chapter, the authors attempted to estimate the time required to implement the areas discussed.

8.1. Aspects of Consideration Under the NHC System Implementation

A descriptive list of aspects is provided below that can be regarded as preliminary non-functional requirements, accompanied by details pertaining to their optionality (R—required, O—optional) along with a note clarifying their scope. The issues were divided into domains according to the scheme proposed above: digitisation, design, and technology, constituting at the same time the source material for considerations conducted in the requirements domain.

8.1.1. Selected Digitisation Aspects

Data processing flow (R). The data flow diagram defines the tasks, responsibilities, and resources used to conduct the digitisation process and achieve the highest possible efficiency. See more in Section 5.1.

Forms (R). Forms are used to prepare a metadata description according to an accepted structure, avoiding inconsistencies between data entered by different people. They are usually created as a spreadsheet file. The file contains all the necessary attributes according to the metadata specification, divided into sheets covering, e.g., taxa, samples, iconography, and bibliographic entries. See more in Section 5.4.

Digitisation supporting tools (R). A collection of instruments designed to facilitate the digitisation of specimens throughout different phases of their preparation. The validator enables the verification of the description’s adherence to established guidelines, while the converter aids in transforming existing specimen descriptions into a new format. The aggregator merges files containing descriptions created by various teams, and the report generator compiles the total number of records in the description files. This tool is particularly beneficial for project coordinators in their efforts to monitor the advancement of digitisation activities continuously. See more in Section 5.4.

Image converter and securer (R—in terms of converting). A tool whose task is to convert a graphic file (scan, photo) to the format required by modules for sharing and presentation in the final user interfaces. Optionally, the tool includes a set of functions for securing iconography to protect copyright. See more in Section 5.4.

Georeferencing (R). All records were geotagged based on the textual descriptions of the specimens’ locations, such as those found on herbarium sheets. This geotagging facilitates analysis through the Geographic Information System, enabling the examination of digitised records and the identification of spatial relationships. There are different quality classes of geotagging records:

exact coordinates—the precise location of the object was determined in the geotagging process;
approximate coordinates—the location of the object is approximately determined, and the geographic coordinates indicate the centroid of the area that could be assigned to the specimen;
approximate coordinates due to legal protection—the specimen is legally protected in accordance with national law, and the user does not have sufficient permissions to view the exact coordinates of such specimens;
unspecified coordinates—the specimen has not been geotagged.

Quality classes are presented in the details of the record (specimen, sample, multimedia material). They can also be used in the specimen search form in the specialist extended or advanced search engine.

8.1.2. Selected Design Aspects

Metadata (R). A set of metadata used to register unique resources of natural collections gathered by the housing organisation. Given the high complexity of the database, this description allows for unambiguous, compliant information entry with international standards. See more in Section 6.1.

Browsing (R). Data browsing should be available in a general view from the main page for both logged-in and non-logged-in users. However, the scope of information received may be varied (non-logged-in users see only part of the information about the specimens). In addition to the general view, the ability to browse data profiled, adapted to the needs of different user groups and their level of advancement, may be optionally provided.

Search engine(s) (R). Search engines serve as fundamental instruments for examining the amassed data on specimens. It is advisable to develop search engines tailored to specific target audiences, providing varying levels of complexity and functionality. A potential categorisation includes search engines such as general, specialised, collections, systematic groups, samples, multimedia, bibliography, and educational. For instance, a multimedia search engine facilitates access to information presented in formats such as images, videos, maps, or audio recordings that are either related to the stored specimens (for example, a scan of herbarium) or independent of them (such as a photograph of a habitat), thereby enabling access to a multimedia database that complements the archived specimens.

Profiled data presentation (O). Adjusting the portal view to the target group according to their knowledge and interests. Adjustment concerns the advancement of the search engine (complexity of queries), the language of phrases during the search or the scope of presented data. The view can be set depending on individual preferences. For more, please refer to Section 6.2.

Graphical project (R). Meeting WCAG requirements in the project concerns the access interfaces on which users operate, i.e., the portal and the mobile application. Implementation of WCAG principles is legally required. Effective UX (User Experience) and UI (User Interface) design allows users to locate the content they anticipate easily. One can argue that users’ online behaviours and preferences are quite similar to those of customers in a retail environment. In accordance with UX principles, it is essential to anticipate the actions users will take to access features or information, commonly referred to as user flow. More in Section 6.4.

Administration module (R—scope for discussion). Composed of account management, permission levels, profile settings, reports and stats, protected taxa, protected areas, manage users’ roles, files report, file sources, team leaders, editor tools, Excel files tools, task history, and database changes history.

Edit and correction (R—optional at portal level). Available at the portal level, providing the possibility of correcting data describing natural collections within two modes:

(a): editing a single record—in the details view of a given record (specimen, samples, bibliography, iconography, and multimedia), an authorised person can switch the details view to the editing mode. It is then possible to correct all fields of a given record and save it. Changes are made and are visible immediately.
(b): group editing of many taxonomic records—often, the attribute value for records is corrected simultaneously. Such a change is possible from the level of statistical tools—reports. Search results can be grouped according to a selected field, and then the value of the chosen field can be changed for all records from this group. For more, see Section 6.3.

Data access rules (O). It allows for the limitation of access to sensitive information, due to the protection of diversity, and especially the protection of species that are subject to protection. Access to sensitive information requires appropriate authorisations. Data are protected at two levels: specimen (complete information about the specimen), and field (a specific feature of the specimen may be restricted). More in Section 6.2.

Data exporter (O). This feature can be utilised when displaying search result data, contingent upon acquiring the necessary permissions. It enables users to download the retrieved data as a spreadsheet that adheres to the specified format. The data obtained through this method can subsequently be analysed using external analytical tools.

Data analytics (reports) (O). A tool designed to produce statistical reports utilising data from specimens stored in the database. This reporting tool enables the categorisation of data based on one or multiple parameters. It facilitates the creation of graphs that illustrate the proportion of a specific group of specimens within the results generated by the query. It serves as an invaluable resource for understanding the contents of collections, including identifying discrepancies in metadata descriptions, such as the presence of species in regions where they are not expected to occur.

Specimen comparison (O). The functionality allows for the comparison of two or more specimens (depending on the scheme used), a feature especially desired by scientists conducting in-depth analyses.

Support for team Collaboration (O). A group of users working on the same topic can create a team and share their observations with the team supervisor. This feature can be used by scientists, e.g., while working in the field, as well as by schoolteachers.

Preparation of educational materials (O). Specimens, photos, and observations can be added to user-defined albums and complemented with custom descriptions. Such albums can then be shared with other users or exported as a PDF presentation.

Spatial analysis BioGIS (GIS tools) (O). The BioGeographic Information System (BioGIS) serves as a tool for the entry, collection, processing, and visualisation of geographic data. It has become extensively utilised in scientific research and decision-making activities, particularly in the field of biodiversity studies. It allows for the presentation of phenomena in space on different types of maps, e.g., dot distribution map, area class map, choropleth map, diagram map, cluster map, attribute grouping map, and time-lapse map.

Information section (O). A section at a portal containing the following information: mission, portal (info on offering), mobile application, BioGIS, our users (info on target groups), about us, how to use (guidelines) and contact.

Mobile application (O). The mobile application facilitates the documentation of natural observations through text descriptions, photographs, and audio recordings. Observations are exported to a database and become accessible from the portal, which in turn allows them to be edited and used with existing analytical and georeferencing tools. The observation form is equipped with a set of predefined fields as well as customisable open fields that users can define according to their preferences. The predefined fields encompass ordinal data, identification details of the observation (such as number, date, and author), geographical coordinates of the observation site, area size, and vegetation coverage. This allows for greater engagement of target groups, but also for reaching external users who use the NHC system to create their own collections of specimens and other natural observations.

API (R). The API is used to provide interested parties with automatic access to the contents of the database. Access to the NHC system occurs through an interface implemented in REST (REpresentational State Transfer) technology. Since the portal and the mobile application use the same programming interface, its implementation is obligatory. Most functionalities are available only to the logged-in user; therefore, access to individual interface methods is secured with JSON Web Token.

Interoperability (O). This functionality enables providing specific information from the taxonomic database to external organisations, facilitating database integration. For instance, to support integration with GBIF, the “BioCASe Provider Software” (BPS) service was developed, which is compatible with the “Biological Collection Access Service.” This global network of biodiversity repositories amalgamates data on specimens from wildlife collections, botanical gardens, zoos, and various research institutions worldwide, alongside information from extensive observational databases. Check Section 7.2 for more details.

Iconography library (R). A library for sharing digital objects and documents used to import and store multimedia. It should include a range of features that make it easier to enter, manage and use digital assets, such as serving images in the required resolution and storing related metadata to make it easier to find the desired content.

Backup operational procedures (R). A set of procedures and mechanisms focused on archiving and restoring data after a failure.

Monitoring (O). Ensuring the continuity of the system’s operation and the security of the stored data is extremely important due to access to the data and services offered. Monitoring a defined list of key services with a given time interval. Information about the unavailability of services is sent by email to a designated person or group of people. This allows for immediate intervention by service administrators and restoration of their operation.

8.1.3. Selected Technology Aspects

Data buffer (R). A designated place in the storage system allowing for saving and storing data after the digitisation process and before the actual import into the database. Please refer to Section 7.1.2 for more information.

Data storage (R). Space on array disks or tape systems that allow storing source data and processed data along with their copies safely after the database import process. Disk arrays also serve data to design applications (portal and mobile application). Please refer to Section 7.1.2 for more information.

Database (R). The database is a core of the NHC system, the content of which is based primarily on collections of biological specimens and locations of photographs and published or previously unpublished field observations. In addition, its structure should correspond to the set of metadata defined by international standards (e.g., ABCD) and the requirements of the target groups. More in Section 7.1.3.

Database for users’ content (O). It enables storing information independent of the data that make up the Natural History Collections Database and is available only to a given user and authorised persons. It mainly concerns sections like my observations, my albums, my maps, and my teams.

Virtual servers (R). The infrastructure has been divided into a development and production part to provide customers with only ready-made and tested solutions that meet the expected requirements. Each new functionality is subjected to a validation procedure, and only after its positive result is it made available to the end customer. More in Section 7.1.3.

Authentication and authorisation service (R). A service that allows for user authentication and authorisation. Authentication verifies the user’s identity using credentials such as passwords, biometrics, or third-party authentication providers. Authorisation (access control) controls what authenticated users can access.

External authentication service (O). Enabling login using existing user credentials from platforms such as Apple, Google, or Facebook improves the user experience and reduces the need to remember multiple usernames and passwords.

Safe programming recommendations (O). Implementation of general recommendations for secure programming during code implementation in accordance with the SDLC (Software Development LifeCycle) concept [76]. Application of detailed security recommendations for programming languages used in project creation. More in Section 7.3.1.

Security audit (R). The need to conduct an audit by a qualified team of experts in at least two stages: halfway through the project (early detection of weaknesses) and at its final phase (verification of the implementation of previous recommendations and final check of the system’s vulnerabilities). Particular attention should be paid to issues related to user password management, content serving via the web server, configuring the server itself, and user registration. More in Section 7.3.2.

Intellectual Property Rights (IPR) (R). Preserving IPR to published materials. Security issues include the methodology of securing iconographic data, with particular emphasis on graphic files from both the specimen scanning process and photographs presenting observation data. Additionally, they are related to the proper way of citing the materials used by external scientists. More information in Section 7.3.3.

8.2. Estimated Efforts of Implementation

The cost of implementing the above-mentioned areas within the project implementation varies and also depends on the emphasis on individual system features. Therefore, an important question to be answered in the context of defining the requirements and the design and implementation scope of the NHC system, taking into account the optionality of selected modules and available human and time capitals, is how the decision to include individual areas in the implementation translates into the need to distribute the available resources. Figure 8 shows the estimated share of individual work areas (assuming the implementation of all of the above) in the overall scope of work based on the experience from the AMUNATCOLL project.

9. Discussion

The digitisation of fauna and flora collections is revolutionizing the way biodiversity is studied, protected, and assessed, which has a fundamental impact on understanding the processes of ongoing changes. The process of implementing the NHC system is associated with many challenges at the design and development stages in order to provide the functionality required by end users [77]. This is not an easy task considering that users may come from different target groups and therefore their expectations most often differ significantly. Considering the above, the emerging IT challenges are related to adequately addressing the overarching goal, which is the requirements of local research groups resulting from the specificity of the research conducted. This has an impact on both the scope of stored data and the tools for its analysis and presentation.

It is important to note that even the most extensive collection held by a single institution does not encompass the entirety of knowledge within a specific field. This reality necessitates collaboration with other collections and systems to ensure the reliability of research. Consequently, when developing a Natural History Collection (NHC) system, it is essential to consider the requirement for interoperability. This entails adhering to standards in specimen descriptions, ensuring proper metadata storage, implementing an API interface, and utilizing commonly accepted communication protocols. The push for interoperability among Natural History Collection systems and other biodiversity data infrastructures is exemplified by Essential Biodiversity Variables (EBVs). These variables act as a link between raw biodiversity data and high-level policy indicators, such as those employed by the Convention on Biological Diversity (CBD) [78]. By standardizing biodiversity monitoring across various regions and taxa, EBVs facilitate the understanding and prediction of changes in biodiversity, thereby aiding in the formulation of national and global policy decisions regarding biodiversity conservation.

The next step towards connecting systems is taken with the Bari Manifesto, which provides a comprehensive framework for interoperability and best practices in data management. This imposes building systems using open standards, APIs and agreed data formats to enable platform integration and reuse. It is also assumed that the system should prioritise usability, accessibility, and the real needs of end users (e.g., citizens, researchers, decision-makers). In order to maintain transparency and accountability of systems, it is mandatory to ensure action logging, auditability, and explainability of system decisions (primarily if artificial intelligence is used). Community involvement is also important through supporting collaboration tools and implementing mechanisms for data entry by interested parties. In terms of the definition of the system architecture assuming long-term sustainable development, its modularity and scalability are assumed, which will allow its expansion or modernisation if necessary.

To summarise, the stage of requirements analysis and design of the NHC system should reflect not only the need of the local scientific community associated with the collections of specimens, but also a number of requirements related to effective connection with external systems.

This work results in the identification of four categories of IT challenges associated with the development of the NHC system, which facilitates the online sharing of digitised objects. For each category, specific solutions were examined and proposed for a distinct set of issues likely to arise in most designed systems. The challenge categories are presented in the sequence they are typically encountered in the conventional process of designing and developing an IT system, beginning with requirement definition, progressing through the digitisation phase, user interface design, and concluding with infrastructure and security considerations. Additionally, the challenges analysed include those necessary for ensuring the NHC system’s interoperability on a global scale. It is important to acknowledge that this work does not cover all potential difficulties that may arise; however, the list provided serves as a resource for often overlooked challenges that can influence the approach and perspective during the implementation of system modules.

It should also be noted that the presented considerations concern the creation of new systems, the undoubted advantage of which is using the latest programming and infrastructure achievements. In the context of global interoperability, however, we should not ignore collection databases that have been in operation for some time and have many unique specimens that cannot be ignored. Interaction with such systems often encounters problems that cannot be easily solved. Examples of such issues include trying to connect older generation systems using incompatible formats or simply the lack of appropriate digital infrastructure. Another aspect is the quality and completeness of data, where inconsistent or missing metadata can limit their usability. Taxonomic discrepancies are also a problem, where synonyms, outdated names or conflicting classifications make integration difficult.

10. Summary

This paper presents the most essential aspects related to planning and building online natural collection systems. Addressing such challenges requires an interdisciplinary approach, in which IT specialists, scientists, and stakeholders from many disciplines participate. Building an IT system starts with a thorough analysis of the goals to be achieved, considering the scope of work and production capabilities. In the case of natural collection systems, it is crucial to consider to whom such a system is addressed. Such a definition determines the overall shape of the system, influencing the scope of collected data (metadata), functions offered, presentation layer, and many others. Although the work focuses mainly on issues related to the technical aspects of implementing online systems for natural history collections, tasks related to their preservation after they have been made available cannot be omitted.

One significant aspect to consider is the challenge of sustainable development and ongoing maintenance. The initiation and upkeep of systems necessitate a continuous allocation of financial and human resources. These platforms require regular updates, not only in terms of new software versions that underpin their architecture but also for security enhancements. Such activities impose a considerable burden; thus, securing a steady stream of financial resources for the upkeep of IT infrastructure should be a critical topic of discussion before the final implementation version is delivered.

Additionally, it is essential to emphasise the importance of supporting the system’s ongoing development, with institutional collaboration being a fundamental component. Partnerships among museums, universities, and governmental organisations facilitate the broadening of available data, often introducing new functionalities. This collaboration is inherently linked to the need for coordinated efforts and ensuring compatibility among institutions, including data formats and communication protocols, which presents another technological and logistical challenge. The outcome of such initiatives often involves working within distributed programming teams, where communication and coordination can become more complex, thereby impacting development timelines and decision-making processes. Nevertheless, it is crucial to recognise that each advancement in gathering knowledge about natural collections brings us closer to establishing an integrated knowledge system regarding biodiversity. This, in turn, will empower us to manage our most precious resource—nature—consciously and judiciously.

Author Contributions

Conceptualisation, M.L.; methodology, M.L.; validation, M.L., P.W.; formal analysis, M.L.; investigation, M.L., P.W.; writing—original draft preparation, M.L., P.W.; writing—review and editing, M.L., P.W.; visualisation, M.L.; supervision, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the AMUNATCOLL project and has been partly funded by the European Union and Ministry of Digital Affairs from the European Regional Development Fund as part of the Digital Poland Operational Program under grant agreement number: POPC.02.03.01-00-0043/18. This paper expresses the opinions of the authors and not necessarily those of the European Commission and Ministry of Digital Affairs. The European Commission and Ministry of Digital Affairs are not liable for any use that may be made of the information contained in this paper.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

The authors wish to extend their sincere appreciation to Bogdan Jackowiak for his invaluable feedback on the manuscript, which greatly enhanced its quality. They are also grateful for his unwavering support and encouragement during the entire preparation of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ABCD	Access to Biological Collection Data
ACID	Atomicity, Consistency, Isolation, Durability
AMUNATCOLL	Adam Mickiewicz University Nature Collections
API	Application Programming Interface
BPS	BioCASe Provider Software
BioGIS	BioGeographic Information System
CGH	Computer Generated Hologram
CI/CD	Continuous Integration/Continuous Delivery
DMP	Data Management Plan
EBV	Essential Biodiversity Variables
EXIF	Exchangeable Image File Format
FAIR	Findable, Accessible, Interoperable and Reusable
GBIF	Global Biodiversity Information Facility
GDPR	General Data Protection Regulation
GIS	Geographic Information System
HDD	Hard Disk Drive
IPR	Intellectual Property Rights
ISO	International Organisation for Standardisation
IT	Information Technology
JWT	JSON Web Token
NHC	Natural History Collection
OWASP	Open Web Application Security Project
PCI DSS	PCI Data Security Standard
RAID	Redundant Array of Independent Disks
REST	REpresentational State Transfer
SDLC	Software Development LifeCycle
SDL	Security Development Lifecycle
SEO	Search Engine Optimisation
SSD	Solid State Drive
SSRF	Server-Side Request Forgery
TIFF	Tag Image File Format
UX	User Experience
UI	User Interface
WCAG	Web Content Accessibility Guidelines

References

Page, L.M.; MacFadden, B.J.; Fortes, J.A.; Soltis, P.S.; Riccardi, G. Digitization of biodiversity collections reveals biggest data on biodiversity. BioScience 2015, 65, 841–842. [Google Scholar] [CrossRef]
Soltis, P.S.; Nelson, G.; James, S.A. Green digitization: Online botanical collections data answering real-world questions. Appl. Plant Sci. 2018, 6, e1028. [Google Scholar] [CrossRef]
Corlett, R.T. Achieving zero extinction for land plants. Trends Plant Sci. 2023, 28, 913–923. [Google Scholar] [CrossRef]
Pereira, H.M.; Ferrier, S.; Walters, M.; Geller, G.N.; Jongman, R.H.; Scholes, R.J.; Bruford, M.W.; Brummitt, N.; Butchart, S.H.; Cardoso, A.C.; et al. Essential Biodiversity Variables. Science 2013, 339, 277–278. [Google Scholar] [CrossRef]
Jackowiak, B.; Błoszyk, J.; Celka, Z.; Konwerski, S.; Szkudlarz, P.; Wiland-Szymańska, J. Digitization and online access to data on natural history collections of Adam Mickiewicz University in Poznan: Assumptions and implementation of the AMUNATCOLL project. Biodivers. Res. Conserv. 2022, 65, 23–34. [Google Scholar] [CrossRef]
KEW. KEW Data Portal. 2025. Available online: https://data.kew.org/?lang=en-US (accessed on 25 May 2025).
MNP. Muséum National d’Histoire Naturelle in Paris—Collections. 2025. Available online: https://www.mnhn.fr/en/databases (accessed on 25 May 2025).
PVM. Plantes Vasculaires at Muséum National d’Histoire Naturelle. 2025. Available online: https://www.mnhn.fr/fr/collections/ensembles-collections/botanique/plantes-vasculaires (accessed on 25 May 2025).
TRO. Tropicos Database. 2025. Available online: http://www.tropicos.org/ (accessed on 25 May 2025).
NDP. Natural History Museum Data Portal. 2025. Available online: https://data.nhm.ac.uk/about (accessed on 25 May 2025).
SMI. Smithsonian National Museum of Natural History 2025. Available online: https://collections.nmnh.si.edu/search/ (accessed on 25 May 2025).
Nowak, M.M.; Lawenda, M.; Wolniewicz, P.; Urbaniak, M.; Jackowiak, B. The Adam Mickiewicz University Nature Collections IT system (AMUNATCOLL): Portal, mobile application and graphical interface. Biodivers. Res. Conserv. 2022, 65, 49–67. [Google Scholar] [CrossRef]
Schmeller, D.S.; Weatherdon, L.V.; Loyau, A.; Bondeau, A.; Brotons, L.; Brummitt, N.; Geijzendorffer, I.R.; Haase, P.; Kuemmerlen, M.; Martin, C.S.; et al. A suite of essential biodiversity variables for detecting critical biodiversity change. Biol. Rev. 2018, 93, 55–71. [Google Scholar] [CrossRef] [PubMed]
Jetz, W.; McGeoch, M.A.; Guralnick, R.; Ferrier, S.; Beck, J.; Costello, M.J.; Fernandez, M.; Geller, G.N.; Keil, P.; Merow, C.; et al. Essential biodiversity variables for mapping and monitoring species populations. Nat. Ecol. Evol. 2019, 3, 539–551. [Google Scholar] [CrossRef]
Kissling, W.D.; Ahumada, J.A.; Bowser, A.; Fernandez, M.; Fernández, N.; García, E.A.; Guralnick, R.P.; Isaac, N.J.; Kelling, S.; Los, W.; et al. Building essential biodiversity variables (EBVs) of species distribution and abundance at a global scale. Biol. Rev. 2018, 93, 600–625. [Google Scholar] [CrossRef]
Apache Cassandra. 2025. Available online: https://cassandra.apache.org/_/index.html (accessed on 25 May 2025).
Apache Hadoop. 2025. Available online: https://hadoop.apache.org/ (accessed on 25 May 2025).
DwC. Darwin Core Standard 2025. Available online: https://dwc.tdwg.org/ (accessed on 25 May 2025).
Open Geospatial Consortium 2025. Available online: https://www.ogc.org/standards/ (accessed on 25 May 2025).
Hardisty, A.R.; Michener, W.K.; Agosti, D.; García, E.A.; Bastin, L.; Belbin, L.; Bowser, A.; Buttigieg, P.L.; Canhos, D.A.; Egloff, W.; et al. The Bari Manifesto: An interoperability framework for essential biodiversity variables. Ecol. Inform. 2019, 49, 22–31. [Google Scholar] [CrossRef]
Lawenda, M.; Wiland-Szymańska, J.; Nowak, M.M.; Jędrasiak, D.; Jackowiak, B. The Adam Mickiewicz University Nature Collections IT system (AMUNATCOLL): Metadata structure, database and operational procedures. Biodivers. Res. Conserv. 2022, 65, 35–48. [Google Scholar] [CrossRef]
Gadelha, L.M.R., Jr.; de Siracusa, P.C.; Dalcin, E.C. A survey of biodiversity informatics: Concepts, practices, and challenges. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2021, 11, e1394. [Google Scholar] [CrossRef]
Feest, A.; Van Swaay, C.; Aldred, T.D.; Jedamzik, K. The biodiversity quality of butterfly sites: A metadata assessment. Ecol. Indic. 2011, 11, 669–675. [Google Scholar] [CrossRef]
Walls, R.L.; Deck, J.; Guralnick, R.; Baskauf, S.; Beaman, R.; Blum, S.; Bowers, S.; Buttigieg, P.L.; Davies, N.; Endresen, D.; et al. Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related Ontologies. PLoS ONE 2014, 9, e89606. [Google Scholar] [CrossRef]
da Silva, J.R.; Castro, J.A.; Ribeiro, C.; Honrado, J.; Lomba, Â.; Gonçalves, J. Beyond INSPIRE: An Ontology for Biodiversity Metadata Records. OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”; Springer: Berlin, Germany, 2014. [Google Scholar]
ANC. AMUNatColl Project. 2025. Available online: http://anc.amu.edu.pl/eng/index.php (accessed on 25 May 2025).
FBAMU. Faculty of Biology of the Adam Mickiewicz University in Poznań. 2025. Available online: http://biologia.amu.edu.pl/ (accessed on 25 May 2025).
PSNC. Poznan Supercomputing and Networking Center. 2025. Available online: https://www.psnc.pl/ (accessed on 25 May 2025).
ANCPortal. AMUNATCOLL Portal. 2025. Available online: https://amunatcoll.pl/ (accessed on 25 May 2025).
AMUNATCOLL, Android 2025; Faculty of Biology of Adam Mickiewicz University in Poznan: Poznan, Poland.
AMUNATCOLL, iOS 2021; Faculty of Biology of Adam Mickiewicz University in Poznan: Poznan, Poland.
Seafile. Open-Source File Sync and Share Software; Seafile Ltd.: Beijing, China; Available online: https://www.seafile.com/en/home/ (accessed on 25 May 2025).
ISO. International Organization for Standardization. 2017. Available online: https://www.iso.org/iso-8601-date-and-time-format.html (accessed on 25 May 2025).
RFC 3339: Date and Time on the Internet: Timestamps. 2002. Available online: https://www.rfc-editor.org/rfc/rfc3339.html (accessed on 25 May 2025).
Guo, N.; Xiong, W.; Wu, Q.; Jing, N. An Efficient Tile-Pyramids Building Method for Fast Visualization of Massive Geospatial Raster Datasets. Adv. Electr. Comput. Eng. 2016, 16, 3–8. [Google Scholar] [CrossRef]
BIS. Biodiversity Information Standards. 2025. Available online: https://www.tdwg.org/ (accessed on 25 May 2025).
DCMI. The Dublin Core™ Metadata Initiative. 2025. Available online: https://www.dublincore.org/ (accessed on 25 May 2025).
Michener, W.K.; Jones, M.B. Ecoinformatics: Supporting ecology as a data-intensive science. Trends Ecol. Evol. 2012, 27, 85–93. [Google Scholar] [CrossRef]
Kacprzak, E.; Koesten, L.; Ibáñez, L.D.; Blount, T.; Tennison, J.; Simperl, E. Characterising dataset search—An analysis of search logs and data requests. J. Web Semant. 2018, 55, 37–55. [Google Scholar] [CrossRef]
Löffler, F.; Wesp, V.; König-Ries, B.; Klan, F. Dataset search in biodiversity research: Do metadata in data repositories reflect scholarly information needs? PLoS ONE 2021, 16, e0246099. [Google Scholar] [CrossRef]
ABCD. Access to Biological Collections Data Standard. 2025. Available online: https://www.tdwg.org/standards/abcd/ (accessed on 25 May 2025).
Krug, S. Don’t Make Me Think! Web & Mobile Usability; MITP-Verlags GmbH & Co. KG: Frechen, Germany, 2018. [Google Scholar]
Web Content Accessibility Guidelines (WCAG) 2.1. 2025. Available online: https://www.w3.org/TR/WCAG21/ (accessed on 25 May 2025).
EUDirective2019-882. European Accessibility Act (EAA)—Directive (EU) 2019/882. 2019. Available online: https://eur-lex.europa.eu/eli/dir/2019/882/oj/eng (accessed on 25 May 2025).
EUDirective2016-2102. Web Accessibility Directive—Directive (EU) 2016/2102. 2016. Available online: https://eur-lex.europa.eu/eli/dir/2016/2102/oj/eng (accessed on 25 May 2025).
EN301549. European Standard EN 301 549. 2021. Available online: https://www.etsi.org/deliver/etsi_en/301500_301599/301549/03.02.01_60/en_301549v030201p.pdf (accessed on 25 May 2025).
GBIF. Global Biodiversity Information Facility. 2025. Available online: https://www.gbif.org/en/ (accessed on 25 May 2025).
Liu, Q.; Xing, L. Reliability Modeling of Cloud-RAID-6 Storage System. Int. J. Future Comput. Commun. 2015, 4, 415–420. [Google Scholar] [CrossRef]
ISOUNITS. Standards by ISO/TC 12 Quantities and Units. 2025. Available online: https://www.iso.org/committee/46202/x/catalogue/ (accessed on 25 May 2025).
ByteUnits. Wikipedia—Multiple-Byte Units. 2025. Available online: https://en.wikipedia.org/w/index.php?title=Byte&utm_campaign=the-difference-between-kilobytes-and-kibibytes&utm_medium=newsletter&utm_source=danielmiessler.com#Multiple-byte_units (accessed on 25 May 2025).
Seagate. Why Does my Hard Drive Report Less Capacity than Indicated on the Drive’s Label? 2025. Available online: https://www.seagate.com/gb/en/support/kb/why-does-my-hard-drive-report-less-capacity-than-indicated-on-the-drives-label-172191en/ (accessed on 25 May 2025).
Mazrae, P.R.; Mens, T.; Golzadeh, M.; Decan, A. On the usage, co-usage and migration of CI/CD tools: A qualitative analysis. Empir. Softw. Eng. 2023, 28, 52. [Google Scholar] [CrossRef]
PYT. Python Language. 2025. Available online: https://www.python.org/ (accessed on 25 May 2025).
DJA. Django Framework. 2025. Available online: https://www.djangoproject.com/ (accessed on 25 May 2025).
PAN. Pandas—Data Analysis Tool. 2025. Available online: https://pandas.pydata.org/ (accessed on 25 May 2025).
GUN. Gunicorn—Python Web Server. 2025. Available online: https://gunicorn.org/ (accessed on 25 May 2025).
REA. React—JavaScript Library. 2025. Available online: https://reactjs.org/ (accessed on 25 May 2025).
GEO. GeoJSON Geographic Data Structures Encoding Format. 2025. Available online: https://geojson.org/ (accessed on 25 May 2025).
LEA. Leaflet—Library for Mobile-Friendly Interactive Maps. 2025. Available online: https://leafletjs.com/ (accessed on 25 May 2025).
PIX. PixiJS—Advanced Text Rendering Graphic. 2025. Available online: https://pixijs.com/ (accessed on 25 May 2025).
JWT. JSON Web Tokens. 2025. Available online: https://jwt.io/ (accessed on 25 May 2025).
Guo, F.; Chen, C.; Li, K. Research on Zabbix Monitoring System for Large-scale Smart Campus Network from a Distributed Perspective. J. Electr. Syst. 2024, 20, 631–648. [Google Scholar]
PostgreSQL. PostgreSQL Database Web Site. 2025. Available online: https://www.postgresql.org/ (accessed on 25 May 2025).
dLibra. Digital Library Framework. 2025. Available online: https://www.psnc.pl/digital-libraries-dlibra-the-most-popular-in-poland/ (accessed on 25 May 2025).
Wilde, E.; Pautasso, C. REST: From Research to Practice; Springer: New York, NY, USA, 2011. [Google Scholar]
BioCASe. Biological Collection Access Service. 2025. Available online: http://www.biocase.org/ (accessed on 25 May 2025).
GBIF Data Standards. 2025. Available online: https://www.gbif.org/standards (accessed on 25 May 2025).
Saeed, H.; Shafi, I.; Ahmad, J.; Khan, A.A.; Khurshaid, T.; Ashraf, I. Review of Techniques for Integrating Security in Software Development Lifecycle. Comput. Mater. Contin. 2025, 82, 139–172. [Google Scholar] [CrossRef]
Ryan Dewhurst. Static Code Analysis. 2025. Available online: https://owasp.org/www-community/controls/Static_Code_Analysis (accessed on 25 May 2025).
ISO 27001 Information Security Management Systems. Available online: https://www.iso.org/standard/27001 (accessed on 25 May 2025).
National Institute of Standards and Technology. Available online: https://www.nist.gov/standards (accessed on 25 May 2025).
PCI DSS—Payment Card Industry Data Security Standard. Available online: https://www.pcisecuritystandards.org/document_library/ (accessed on 25 May 2025).
OWASP. Open Web Application Security Project. 2025. Available online: https://owasp.org/ (accessed on 25 May 2025).
EXIF. Exchangeable Image File Format. 2025. Available online: https://en.wikipedia.org/wiki/Exif (accessed on 25 May 2025).
Pi, D.; Wang, J.; Li, J.; Wu, J.; Zhao, W.; Wang, Y.; Liu, J. High-security holographic display with content and copyright protection based on complex amplitude modulation. Opt. Express 2024, 32, 30555–30564. [Google Scholar] [CrossRef] [PubMed]
Software Development Life Cycle—SDLC. Available online: https://www.ibm.com/think/topics/sdlc (accessed on 25 May 2025).
Hardisty, A.; Roberts, D. A decadal view of biodiversity informatics: Challenges and priorities. BMC Ecol. 2013, 2013, 16. [Google Scholar] [CrossRef] [PubMed]
Convention on Biological Diversity. 2025. Available online: https://www.cbd.int/ (accessed on 25 May 2025).

Figure 1. Categorisation of challenges in NHC system design and development.

Figure 2. The digitisation process showing the responsibilities of the NHC and IT institutions.

Figure 3. Examples of unclear date representation scans.

Figure 4. Metadata description: standard vs. extension. The arrows show the direction of expanding the scope of the data groups.

Figure 5. View of the specimen characterisation page with edit mode turned off (a) and on (b), with the mode switch marked with a red border. (c) Shows the form view for collective editing of spreadsheet data.

Figure 6. Service modules interaction in the AMUNATCOLL system.

Figure 7. A diagram showing how to encode and decode holographic information. Data are retained even after the scan is printed. The arrows show the direction of processing and data flow.

Figure 8. Estimated distribution of efforts required to analyse and implement areas and issues of NHC systems.

Table 1. Adapting the search and browsing of information to target groups.

	Species Search Language	Search	Browsing
View	Species Search Language	Search	Browsing
Scientists	Latin	Simple—a few selected fields Extended—adding multiple fields to search conditions Advanced—any definition of complex conditions, including logical expressions	Browse as a list, a map, and aggregated reports. Full information about specimens
State and local government administration	National, Latin	Search among species	Aggregated reports for selected species
Services and state officials	National, Latin	Mainly from protected and similar species	Graphic information is essential
Non-governmental organisations and society	National (e.g., English, Polish)	Simplified—one field whose value is matched to many fields from the specimen description	Browse in list, map and aggregated reports. Key information about specimens
Education	National, including everyday language	Simplified—mainly searching from available educational materials	Educational materials about the specimen

Table 2. Examples of unclearly defined dates on herbarium sheets.

Read Value	Comment
19.VI	Lack of year
20.7.1889, 3.X.1890	The month given in Arabic or Roman numerals
7.1876 ü 1882	Date given (probably) without day and with double year
mid July (in Polish “połowa lipca”) 1919	Imprecise harvest day
end of May (in Polish “koniec maja”) 1916	Imprecise harvest day
flowers (in Polish “kwiaty”) 16.04, leaves (in Polish “liście”) 15.07.1893	Providing the date (without the year), and two specimens on one sheet
19?5	One of the year digits is not legible

Table 3. Iconographic data formats with characteristics.

Code	Format Description	Extension	Volume	Creation Method	Typical Applications	Remarks
RAW	Output files obtained directly from the digitising device, in a device-specific format (e.g., recording format)	depending on the device, e.g., .CRW or .CR2 for Canon cameras	big	It is created manually or semi-automatically (depending on the equipment), in the digitisation process.	long-term data archiving	The format is often manufacturer-dependent and may be closed, patented, etc., so it is not recommended as the only form of long-term archiving. Some devices (e.g., scanners) can generate TIFF files immediately without the RAW option.
MASTER	Lossless transformation of a RAW file to an open format, without any other changes.	most often TIFF	big	It can be automated for some RAW formats, if batch processing tools exist for these formats. If they do not exist, manual actions, e.g., using the manufacturer’s software supplied with the camera, are necessary.	long-term data archiving	The basic format for long-term data archiving is preserved in parallel to the RAW format. In case an error occurs during the transformation from RAW to MASTER, MASTER files can be restored from RAW.
MASTER CORRECTED	MASTER file subjected to necessary corrections (e.g., straightening, cropping), still saved in lossless form.	most often TIFF	big	Depending on the type of corrections, this may be a manual or partially automated operation.	long-term data archiving limited sharing of sample files	This is the base form for generating further formats for sharing purposes. It contains the file in a usable form (after processing) in the highest possible quality. If an error occurs during the corrections of the MASTER file, you can always go back to the clean MASTER form and repeat the corrections.
PRESENTATIONHI-RES	MASTER CORRECTED files converted to a format dedicated to online sharing in high resolution. Low-loss compression may be used here but does not have to be.	e.g., TIFF (pyramidal) or JPEG2000	big	It can be created fully automatically based on the MASTER CORRECTED file.	online sharing	Sharing such files online usually requires the use of a special protocol that allows for gradual loading of details—the standard here is the IIIF protocol.
PRESENTATIONLOW-RES	MASTER CORRECTED files converted to a format dedicated for online sharing in medium/low resolution—there is a loss of quality.	e.g., JPG, PDF, PNG	small	It can be created fully automatically based on the MASTER CORRECTED file.	online and offline sharing (download)	Sharing by simply displaying on web pages or downloading files from the site’s pages.

Table 4. Tools developed for the AMUNATCOLL project that facilitate digitalised data preparation in an automated way.

Name	Type	Description
Converter	console or internet application	A tool for automatic formatting of Excel files as a web application. Its existence is based on the assumption that previously digitised data require adaptation to the new format. Its task is to convert input files into files compliant with the metadata specification using specific configurations (sets of rules), allowing for optional editing of input data.
Form	spreadsheet	The basis for proper preparation of input data is compliance with the metadata specification. This is also intended to automate the processes used by the developed applications. In order to avoid inconsistencies between the data filled in (often by different people), a form in the form of a spreadsheet file was prepared. It contains all the columns in accordance with the metadata specification, divided into sheets for taxa, samples, iconography, and bibliography.
Validator	console or internet application	A tool for validating spreadsheet files available as a web application. Its task is to check the presence and correctness of filling in the appropriate sheets in the file. The tool returns a report with errors and details of their occurrence, divided into each detected sheet.
Aggregator	console application	A program used to combine spreadsheet files that comply with metadata descriptions. It is used in the digitisation process, mainly at the stage of describing records by the georeferencing or translation teams.
Reporter	console application	A program that summarises the number of AMINATCOLL-compliant records in a spreadsheet. Used by project coordinators primarily to monitor progress.

Table 5. Metadata specification field metric.

Field content description:	Contains brief information about the type of information that should be entered into a given field.
Field format:	Specifies whether the field is a field of a specific format: Integer field, Float field, Text field, Date field in ABCD format.
Allowed values:	This field contains only values from the allowed list. Each list item is entered on a separate line.
Required field:	The word YES in this field description element indicates that the field is mandatory. The word NO in this field description element indicates that the field is optional.
Example values:	This description element provides example values for the field.
Comments:	Space for any additional information related to the field.

Table 6. Specimen data protection levels.

Level	Name	Description
0	specimen made public	information about the specimen (record) is not protected; full information is available to all logged-in and non-logged-in users
1	specimen made public with restrictions	information about the specimen (record) is partially protected—sensitive data (e.g., geographical coordinates, habitat) is protected
2	specimen is restricted	information about the specimen (record) is made available only to external and internal users verified in terms of competence
3	non-public specimen	information about the specimen (record) is made available only to authorised internal users (e.g., selected from among the employees of the hosting institution)

Table 7. Record field protection levels.

Level	Description
0	the field is not protected, everyone can see it, even if they are not logged in, e.g., genus, species;
1	the field is only available to logged-in users (e.g., ATPOL coordinates),
2	the field is only available to trusted collaborators, e.g., exact coordinates, exact habitat;
3	the field is only available to authorised internal users, e.g., suitability for target groups, technical information regarding file import.

Table 8. Summary of the number of digitised specimens in the natural resources of the Faculty of Biology, Adam Mickiewicz University in Poznań.

Classification of Specimens			Number of Specimens
Total number of specimens			2.25 million
	Botanical collections (algae and plants)		approx. 500 thousand
		Included nomenclatural types	approx. 350
	Mycological collections (fungi and lichens)		approx. 50 thousand
	Zoological collections		approx. 1.7 million
		Included nomenclatural types	over 1000

Table 9. Calculation of the required disk space for the AMUNATCOLL project.

		Regular	Economical
Operation Name or Location	Factor	TB	TB
Source material		90	90
Processed and protected material	1.6	144	144
RAID 6	0.2	46.8	46.8
Conversion to ISO	0.1	28.08	28.08
Total in one location		308.88	308.88
Geographic copy		308.88	308.88
Local copy		308.88	118.8
Buffer for the digitisation process		120	120
Total		1046.64	856.56
Conversion rate		~11.63	~9.52

Table 10. Mapping of ABCD and AMUNATCOLL for the BioCASE protocol.

Dataset-Related Fields
ABCD		AMUNATCOLL
Property Name	Link to Specification	Property Name/Comment
/DataSets/DataSet/ContentContacts/ContentContact/Name	https://terms.tdwg.org/wiki/abcd2:ContentContact-Name (accessed on 26 May 2025)	Information generated automatically during export depending on the custodian of a given collection
TechnicalContact/Name	https://terms.tdwg.org/wiki/abcd2:TechnicalContact-Name (accessed on 26 May 2025)	Information generated automatically during export
/DataSets/DataSet/ContentContacts/ContentContact/Organization/Name/Representation/@language	https://terms.tdwg.org/wiki/abcd2:DataSet-Representation-@language (accessed on 26 May 2025)	Information generated automatically during export
/DataSets/DataSet/Metadata/Description/Representation/Title	https://terms.tdwg.org/wiki/abcd2:DataSet-Title (accessed on 26 May 2025)	Information generated automatically during export
/DataSets/DataSet/Metadata/RevisionData/DateModified	https://terms.tdwg.org/wiki/abcd2:DataSet-DateModified (accessed on 26 May 2025)	Date of last import, date generated automatically by the system during export
Specimen Fields
/DataSets/DataSet/Units/Unit/SourceInstitutionID	https://terms.tdwg.org/wiki/abcd2:SourceInstitutionID (accessed on 26 May 2025)	Institution
/DataSets/DataSet/Units/Unit/SourceID	https://terms.tdwg.org/wiki/abcd2:SourceID (accessed on 26 May 2025)	Botany/Zoology
/DataSets/DataSet/Units/Unit/UnitID	https://terms.tdwg.org/wiki/abcd2:UnitID (accessed on 26 May 2025)	Collection/Specimen number
/DataSets/DataSet/Units/Unit/RecordBasis	https://terms.tdwg.org/wiki/abcd2:RecordBasis (accessed on 26 May 2025)	Source
/DataSets/DataSet/Units/Unit/SpecimenUnit/NomenclaturalTypeDesignations/NomenclaturalTypeDesignation/TypifiedName/NameAtomised/Botanical/GenusOrMonomial	https://terms.tdwg.org/wiki/abcd2:TaxonIdentified-Botanical-GenusOrMonomial (accessed on 26 May 2025)	Genus
/DataSets/DataSet/Units/Unit/Identifications/Identification/Result/TaxonIdentified/ScientificName/NameAtomised/Zoological/GenusOrMonomial	https://terms.tdwg.org/wiki/abcd2:TaxonIdentified-Zoological-GenusOrMonomial (accessed on 26 May 2025)	Genus
/DataSets/DataSet/Units/Unit/Identifications/Identification/Result/TaxonIdentified/ScientificName/NameAtomised/Botanical/FirstEpithet	https://terms.tdwg.org/wiki/abcd2:TaxonIdentified-FirstEpithet (accessed on 26 May 2025)	Species
/DataSets/DataSet/Units/Unit/Identifications/Identification/Result/TaxonIdentified/ScientificName/NameAtomised/Zoological/SpeciesEpithet	https://terms.tdwg.org/wiki/abcd2:TaxonIdentified-Zoological-SpeciesEpithet (accessed on 26 May 2025)	Species
/DataSets/DataSet/Units/Unit/Gathering/Agents/GatheringAgent/Person/FullName	https://terms.tdwg.org/wiki/abcd2:GatheringAgent-FullName (accessed on 26 May 2025)	Author of the collection
/DataSets/DataSet/Units/Unit/Identifications/Identification/Identifiers/Identifier/PersonName/FullName	https://terms.tdwg.org/wiki/abcd2:Identifier-FullName (accessed on 26 May 2025)	Author of designation
/DataSets/DataSet/Units/Unit/Gathering/DateTime/ISODateTimeBegin	https://terms.tdwg.org/wiki/abcd2:Gathering-DateTime-ISODateTimeBegin (accessed on 26 May 2025)	Date of specimen/sample collection
/DataSets/DataSet/Units/Unit/SpecimenUnit/Preparations/Preparation/PreparationType	https://terms.tdwg.org/wiki/abcd2:SpecimenUnit-PreparationType (accessed on 26 May 2025)	Storage method
/DataSets/DataSet/Units/Unit/Gathering/SiteCoordinateSets/SiteCoordinates/CoordinatesLatLong/LatitudeDecimal	https://terms.tdwg.org/wiki/abcd2:Gathering-LatitudeDecimal (accessed on 26 May 2025)	Latitude
/DataSets/DataSet/Units/Unit/Gathering/SiteCoordinateSets/SiteCoordinates/CoordinatesLatLong/LongitudeDecimal	https://terms.tdwg.org/wiki/abcd2:Gathering-LongitudeDecimal (accessed on 26 May 2025)	Longitude
/DataSets/DataSet/Units/Unit/Identifications/Identification/Result/TaxonIdentified/ScientificName/FullScientificNameString	https://terms.tdwg.org/wiki/abcd2:TaxonIdentified-FullScientificNameString (accessed on 26 May 2025)	Required by Darwin Core. Merging of several fields: Genus + Species + SpeciesAuthor + YearOfCollection

Table 11. List of the most common web application security risks according to OWASP.

Name	Description
Broken Access Control	Access controls implement policies designed to restrict users from operating beyond their designated permissions. When these controls fail, they often lead to unauthorised information disclosure, data alteration or destruction, or the execution of business functions that exceed the user’s authorised limits.
Cryptographic Failures	It emphasises the importance of safeguarding data both during transmission and while stored. Sensitive information, including passwords, credit card details, personal data, and proprietary information, necessitates enhanced security measures, particularly when such data falls under privacy laws like the EU General Data Protection Regulation (GDPR) or financial data protection standards such as the PCI Data Security Standard (PCI DSS).
Injection	This vulnerability arises from the absence of verification for user-provided data, specifically the lack of filtering or sanitisation measures. It pertains to the use of non-parameterised, context-sensitive calls that are executed directly within the interpreter. Additionally, it encompasses the utilisation of malicious data, such as that found in SQL queries, dynamic queries, commands, or stored procedures.
Insecure Design	A broad category representing various weaknesses, expressed as “missing or ineffective control design.” Design flaws and implementation flaws must be distinguished for some reason, and have different root causes and remedies. A secure design may still have implementation flaws that lead to exploitable vulnerabilities. An insecure design cannot be fixed by a perfect implementation, because by definition, the necessary security controls were never designed to defend against the specific attacks.
Security Misconfiguration	The application stack may be susceptible to attacks due to insufficient hardening or improper configuration of various services, such as enabling unnecessary ports, services, pages, accounts, or permissions. Additionally, error handling mechanisms may disclose excessive information, including stack traces or other overly detailed error messages. Furthermore, the most recent security features may either be disabled or incorrectly configured. The server might fail to transmit security headers or directives, or it may not be configured with secure values.
Vulnerable and Outdated Components	Lastly, the software could be outdated or contain vulnerabilities. The scope encompasses the operating system, web or application server, database management system, applications, APIs, and all associated components, runtimes, and libraries. It is essential that the underlying platform is consistently patched and updated, and software developers must verify the compatibility of any updated, enhanced, or patched libraries.
Identification and Authentication Failures	To protect against authentication attacks, user identity confirmation, authentication, and session management are key. Frequently, vulnerabilities arise from automated attacks wherein the perpetrator possesses a compilation of legitimate usernames and passwords. Additionally, breaches can occur due to inadequate or ineffective credential recovery methods, forgotten passwords, and the transmission of passwords in plaintext or through poorly hashed password storage systems. The absence or ineffectiveness of multi-factor authentication further exacerbates these risks. Moreover, there may be instances of improper invalidation of session identifiers or the exposure of session identifiers within URLs.
Software and Data Integrity Failures	Software and data integrity failures pertain to code and infrastructure that do not adequately safeguard against breaches of integrity. It is impermissible for an application to depend on plugins, libraries, or modules sourced from unverified origins. A frequent method for exploiting vulnerabilities is through the auto-update feature, which allows updates to be downloaded and implemented on a previously trusted application without adequate integrity checks.
Security Logging and Monitoring Failures	Logging and monitoring service activity helps detect, escalate, and respond to active breaches. Events such as logins, failed logins, and high-value transactions should be audited. Warnings and errors should generate clear log messages when they occur, and they themselves should be monitored for suspicious activity. Alert thresholds and response escalation processes should be defined at the appropriate level, and information about their exceedance should be provided in real or near real time.
Server-Side Request Forgery (SSRF)	An SSRF vulnerability arises when a web application retrieves a remote resource without validating the URL supplied by the user. This allows an attacker to manipulate the server-side application into directing the request to an unintended destination. Such vulnerabilities can impact services that are exclusively internal to the organisation’s infrastructure, in addition to any external systems. Consequently, this may lead to the exposure of sensitive information, including authorisation credentials.

Table 12. EXIF standard fields used to indicate copyright, along with examples.

Tag	Exemplary Value
ID	POZ-V-0000001
Image description	The image depicts a scan from the Natural History Collections of Housing Institution Name.
Copyright	Copyright © 2025 Housing Institution Name, City. All rights reserved.
Copyright note	This image or any part of it cannot be reproduced without the prior written permission of Housing Institution Name, City.
Additional information	Deleting or changing the image metadata is strictly prohibited. For more information on restrictions on the use of photos, please visit: https://www.domain.com/

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lawenda, M.; Wolniewicz, P. IT Challenges in Designing and Implementing Online Natural History Collection Systems. Diversity 2025, 17, 388. https://doi.org/10.3390/d17060388

AMA Style

Lawenda M, Wolniewicz P. IT Challenges in Designing and Implementing Online Natural History Collection Systems. Diversity. 2025; 17(6):388. https://doi.org/10.3390/d17060388

Chicago/Turabian Style

Lawenda, Marcin, and Paweł Wolniewicz. 2025. "IT Challenges in Designing and Implementing Online Natural History Collection Systems" Diversity 17, no. 6: 388. https://doi.org/10.3390/d17060388

APA Style

Lawenda, M., & Wolniewicz, P. (2025). IT Challenges in Designing and Implementing Online Natural History Collection Systems. Diversity, 17(6), 388. https://doi.org/10.3390/d17060388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IT Challenges in Designing and Implementing Online Natural History Collection Systems

Abstract

1. Introduction

2. Materials and Methods

3. Challenges Categorisation

4. Requirements

4.1. Target Groups

4.1.1. Scientists

4.1.2. State and Local Government Administration

4.1.3. Services and State Officials

4.1.4. Non-Governmental Organisations and Society

4.1.5. Education

4.2. Interdisciplinarity

5. Digitisation

5.1. Digitisation Process

5.2. Data Quality—Date Uncertainty and Ambiguity

5.3. Iconographic Data Formats

5.4. Automation of Digitalisation Procedures

6. Design

6.1. Metadata Definition

6.2. Data Access Restrictions

6.2.1. Specimen Data Protection Levels

6.2.2. Record Field Protection Levels

6.2.3. User Roles

6.3. Data Correction

6.4. Graphic Design

7. Technology

7.1. Infrastructure

7.1.1. Understanding the Specimen Quantity Factor

7.1.2. Storage Space

7.1.3. Computing and Service Resources

7.2. Interoperability

7.3. Security

7.3.1. Security Programming Practices

7.3.2. Security System Audit

7.3.3. Securing Iconographic Data

Removing the Outer Pixels Around the Image

Image Metadata

Watermarks

Digital Signature

8. Main Aspects and Efforts of the System Implementation

8.1. Aspects of Consideration Under the NHC System Implementation

8.1.1. Selected Digitisation Aspects

8.1.2. Selected Design Aspects

8.1.3. Selected Technology Aspects

8.2. Estimated Efforts of Implementation

9. Discussion

10. Summary

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI