CTAARCHS: Cloud-Based Technologies for Archival Astronomical Research Contents and Handling Systems

Gallozzi, Stefano; Zacharis, Georgios; Fiordoliva, Federico; Lucarelli, Fabrizio

doi:10.3390/metrics2030018

Open AccessArticle

CTAARCHS: Cloud-Based Technologies for Archival Astronomical Research Contents and Handling Systems

¹

INAF-OAR, Istituto Nazionale di Astrofisica—Osservatorio Astronomico di Roma, 00178 Rome, Italy

²

INAF-SSDC, Science Space Data Center, 00133 Rome, Italy

^*

Author to whom correspondence should be addressed.

Metrics 2025, 2(3), 18; https://doi.org/10.3390/metrics2030018

Submission received: 9 June 2025 / Revised: 6 August 2025 / Accepted: 11 August 2025 / Published: 8 September 2025

Download

Browse Figures

Versions Notes

Abstract

This paper presents a flexible approach to a multipurpose, heterogeneous archive and data management system model that merges the robustness of legacy grid-based technologies with modern cloud and edge computing paradigms. It leverages innovations driven by big data, IoT, AI, and machine learning to create an adaptive data storage and processing framework. In today’s digital age, where data are the new intangible gold, the “gold rush” lies in managing and storing massive datasets effectively—especially when these data serve governmental or commercial purposes, raising concerns about privacy and data misuse by third-party aggregators. Astronomical data, in particular, require this same thoughtful approach. Scientific discovery increasingly depends on efficient extraction and processing of large datasets. Distributed archival models, unlike centralized warehouses, offer scalability by allowing data to be accessed and processed across locations via cloud services. Incorporating edge computing further enables real-time access with reduced latency. Major astronomical projects must also avoid common single points of failure (SPOFs), often resulting from suboptimal technological choices driven by collaboration politics or In-Kind Contributions (IKCs). These missteps can hinder innovation and long-term project success. The principal goal of this work is to outline best practices in archival and data management projects—from policy development and task planning to use-case definition and implementation. Only after these steps can a coherent selection of hardware, software, or virtual environments be made. The proposed model—CTAARCHS (Cloud-based Technologies for Astronomical Archiving Research Contents and Handling Systems)—is an open-source, multidisciplinary platform supporting big data needs in astronomy. It promotes broad institutional collaboration, offering code repositories and sample data for immediate use.

Keywords:

CTAARCHS; cloud and edge storage; astronomical archives; big data in astronomy; distributed archives; distributed databases; distributed storage

1. Good and Bad Practices in Data Management Projects

This paper introduces a flexible archival and data management model that integrates recent developments across data-grid, cloud, edge, and fog computing technologies. Designed to meet the requirements of large-scale astronomical projects, the model emphasizes resilience, performance, and sustainability while avoiding typical single points of failure (SPOFs), which often arise from short-sighted political management decisions and suboptimal In-Kind Contribution (IKC) allocations.

A recurring issue in large scientific collaborations is the allocation of leadership roles based not on technical expertise, but on political convenience or financial leverage. This leads to fragmented and inefficient work organization, particularly in core areas such as data handling and archiving. It is common to see simple tasks unnecessarily divided among multiple groups, each with distinct visions and leadership, making coordination and integration difficult. In response, project leaders often “descope” activities, reducing group autonomy in favor of hierarchical control. While this may streamline decision-making, it suppresses innovation and undermines project agility.

A particularly harmful trend is the political fragmentation of archive design, where medium- to long-term data management is split across loosely defined entities without real architectural boundaries. Such divisions introduce complexity and delay, especially when multiple groups interact with a shared infrastructure. Leadership may be assigned to individuals with little or no technical background, and the final decision-making authority may reside with administrative bodies rather than developers. This practice results in systems driven by political compromise rather than technological soundness.

Effective archive development must begin with robust planning. As outlined in Figure 1, project management strategies should reflect the project’s timeline and goals. For short-term implementations, use-case generalization and rapid prototyping are essential to test technological feasibility. For long-term projects, more detailed planning, including thorough documentation of use cases, requirements, and interfaces, should be established early on. However, premature commitment to specific technologies should be avoided, as rapid technological evolution can render early choices obsolete.

The system design phase consolidates all use cases and validated requirements into an integrated solution based on proven technologies. This is followed by code development, pre-production testing, and final deployment. A major constraint, particularly in scientific archiving, is budgetary: long-term maintenance costs are often underestimated or ignored. As a result, hardware acquisition frequently follows funding availability rather than design logic. To overcome this, a virtualized, service-based model is adopted, allowing for the decoupling of hardware from software layers.

This approach enables the implementation of Archive as a Service (AaaS), which builds upon the established paradigms of Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS), as illustrated in Figure 2.

The archive model distinguishes two main user roles: Data Producers, who supply content at various levels, and Data Consumers, who access and possibly process the data (Figure 3). While users may act in both roles, each must interact with the system through standardized, role-specific interfaces.

A good approach in data archival and handling systems is to adopt the Open Archival Information System (OAIS) reference model [1], which logically separates user interaction from core system operations. The core functions of the archive—Ingest, Search/Browse, and Retrieve/Distribute—are built upon two foundational components: a repository and a database. The choice of technologies in these areas is dictated by the archive’s system topology and performance goals.

The key challenge in developing an effective data handling and archival system is to prioritize the intended use cases over political and personal considerations before implementing any technological solutions. Specifically, once the most common challenges are abstracted, as outlined in the OAIS model, it becomes possible to identify the most widely adopted technologies and approaches. This helps in evaluating which are suitable for our needs and which are not. By doing so, the resulting system will become an intelligent and adaptive framework that facilitates both efficient data access and its medium- to long-term preservation.

In the following sections, we will explore archive topologies, database selection, and data access interfaces as critical factors in designing scalable, user-centered archive and data handling systems. Finally, the versatile and efficient architecture of CTAARCHS will be introduced.

2. Storage Topology: Centralized vs. Distributed Approaches

One of the key design challenges in developing an astronomical archive and data handling system is to understand the data topology, so the first thing to understand is the storage architecture and the related storage service deployments. The decision between a centralized or distributed model depends on the archive’s use cases, particularly when access is required across geographically distributed locations. There is no universal solution—each approach has strengths and limitations based on scalability, resilience, access latency, cost, and administrative complexity [2].

As summarized in Table 1, centralized architectures offer simplicity, streamlined security, and ease of management, making them suitable for small-scale or local deployments. However, they pose greater risks of failure and limited scalability. In contrast, distributed architectures support high availability (HA), redundancy, and better performance across dispersed users, though they require more sophisticated orchestration and monitoring.

The choice ultimately depends on system scale, geographic distribution, and acceptable complexity. The model presented here allows flexible configuration—from a single-node centralized instance to a distributed system with multiple nodes in a “leader + followers” configuration (i.e., primary + secondaries) or fully redundant HA configurations, ensuring no single point of failure (no SPOF) [3].

Before discussing the details of our work, it is appropriate to make a general narrative highlighting specific aspects addressed in different approaches and computing paradigms; this basic overview helps us to design an end-to-end solution that could be adaptable to various projects, each using different methodologies and paradigms, ensuring compatibility within tools and parts required by currently in-production experiments [4].

Historically, data-grid computing, see references, was the dominant model in research environments, where computing and storage were distributed across tiered datacenters connected by middleware for data orchestration. While effective in some contexts, its hierarchical structure limited scalability and flexibility. Over past decades, the data-grid model has largely been complemented and/or sometimes replaced by cloud computing, which enables horizontal scaling, service-based architecture, and global accessibility. Cloud systems offer improved resource outsourcing, built-in redundancy, and disaster recovery, making them better suited to handle complex, large-scale datasets with minimal management overhead [5].

However, widespread data sharing via cloud platforms raises serious security and privacy concerns [6], making robust access control and encryption critical challenges [7].

More recently, computing paradigms have been supplemented by edge computing, where data processing occurs closer to data sources—often at the sensor or device level. This reduces network congestion, minimizes latency, and enables real-time applications [8]. Edge computing is particularly valuable in time-sensitive use cases, where immediate processing and decision-making are required. Enhancing this model with edge intelligence—that is, applying AI and machine learning algorithms locally—enables automated decisions based on complex, use-case-specific criteria. This adds significant value where human intervention must be minimized [9].

At a broader level, this leads to fog computing, a form of fine-grained distributed processing that extends computing and storage further toward the network edge. By integrating IoT devices and localized data sources, fog architectures process large volumes of unstructured data near their origin, which is essential for real-time analytics [10].

Even if a lot of scientific computing is historically and technically incapable of embracing the edge (or fog) paradigm since data can not be analyzed at the edge, there is no doubt that these solutions are going to represent the near future. Given the limited computational capacity typical at the edge, adaptive AI algorithms play a critical role in optimizing performance. These systems can identify semantic patterns, adapt compression techniques, and reduce computational loads, enabling efficient data analysis and visualization. The use of optimized low-latency databases becomes essential in transforming raw data into science-ready outputs quickly and interactively.

Note: Although these models raise legitimate concerns about environmental impact—particularly related to the power demands of AI training and edge infrastructure—this paper does not address sustainability [11]. It is misleading to discuss energy use without a comprehensive life-cycle analysis of the hardware and algorithms involved. The sustainability of AI and edge computing should not be reduced to superficial claims but rather evaluated within a systemic framework, which is beyond the scope of this discussion [12].

3. Selecting the Appropriate Database Architecture for Archival Systems

The database lies at the heart of any archive and data management system, making its selection a critical component of the overall design. However, there is no universally optimal solution—the appropriate database choice depends on multiple factors, including the storage use case, system topology, data access patterns, geographical distribution of users, and desired services deployments.

In distributed storage environments, relying on a centralized database for file cataloging introduces significant risks. It creates a single point of failure (SPOF) and becomes a performance bottleneck under concurrent, geographically dispersed queries. This undermines the redundancy and resilience typically sought in distributed systems.

Conversely, centralized database architectures are well-suited for smaller or geographically constrained archives, where high availability can be ensured through network and service redundancy. These systems benefit from ACID-compliant transactions—atomicity, consistency, isolation, and durability—which are essential in contexts requiring strong data integrity, such as financial systems.

However, distributed databases cannot fully guarantee ACID properties and instead operate under the CAP Theorem (Brewer’s Theorem), which states that a distributed system can only simultaneously satisfy two of the following: consistency, availability, and partition tolerance. Trade-offs among these properties must be carefully evaluated depending on the archive’s performance and reliability needs (see Table 2 and Figure 4).

In summary, the choice between centralized and distributed database architectures must align with the system’s scale, access requirements, and fault-tolerance goals. The database model must not only support efficient data access but also integrate seamlessly into the broader storage and computing infrastructure.

In distributed databases, data are replicated across multiple nodes. When network partitions occur—isolating one or more nodes—the system must prioritize among consistency, availability, and partition tolerance (the CAP Theorem).

Prioritizing consistency may halt reads or writes to prevent divergence, sacrificing availability.
Prioritizing availability ensures responsiveness but may serve outdated or inconsistent data.
Prioritizing partition tolerance allows continued operation despite communication failures, though it may compromise either consistency or availability.

Many systems dynamically balance these trade-offs based on application needs. For archival systems, using pre-assigned physical file names and a Write Once, Read Many (WORM) model minimizes consistency concerns. Once written, immutable data simplifies coherence across nodes. This permits a focus on availability and partition tolerance (AP), ensuring the system remains operational and responsive—even if some nodes are unreachable.

Partition tolerance is often the most critical factor in large-scale or globally distributed environments, as network disruptions are inevitable. Ensuring only a single version of any file exists and is replicated guarantees that if a file is accessible, it is valid and consistent system-wide.

Another key factor in choosing a database system is balancing data scalability with the complexity of the data model and queries. As illustrated in Figure 5, certain database families are inherently unsuited to large-scale data. For instance, relational databases (SQL), while efficient for smaller datasets and simpler queries, struggle when dealing with high-complexity joins or terabyte-scale tables. At this point, only three options remain:

Simplify the data model or queries.
Scale up the hardware infrastructure.
Migrate to a different database family—such as a document-oriented (NoSQL) system.

In practice, restructuring or hardware upgrades often cause service interruptions, particularly when the database was not properly designed from the outset. This underscores the importance of selecting the appropriate architecture early in the project lifecycle.

Databases can broadly be categorized into two groups: Relational DBMS (RDBMS) and Not Only SQL (NoSQL) systems [13]. A comparative summary is provided in Table 3.

4. Polyglot Persistence in Modern Archive Systems

For this archival model, we focus on the versatility, schema-less nature, and aggregation capabilities of document-oriented databases. Their architecture supports scalability through replication, sharding, and clustering, depending on performance demands and availability requirements. Strategies for scaling read/write capacity and ensuring high availability are summarized in Table 4.

If data size exceeds single-server capacity, two strategies are available: scaling up infrastructure or scaling out via clustering. Similarly, read performance can be improved through replication and caching, while write scalability benefits from partitioning and sharding. To mitigate SPOFs and ensure service resilience, especially in geographically distributed collaborations, combining clustering with cross-site replication is essential. Inter-datacenter distances of several hundred kilometers are generally sufficient to safeguard against regional failures and enable disaster recovery.

A key principle here is polyglot persistence, which leverages multiple database types, each tailored to a specific data class, written as follows:

Relational databases (e.g., PostgreSQL, MariaDB) for structured data like observation proposals.
Document-oriented databases (e.g., MongoDB) for semi-structured metadata.
Column stores (e.g., Cassandra) for streaming telemetry.
Key–value stores (e.g., Voldemort) for fast-access logs.
Graph databases (e.g., Neo4j, Cosmos DB) for user interaction mapping.
Array or functional query languages for analytical pipelines.

This modular approach allows independent scaling of archive components and optimization of performance and cost. The main drawback lies in the complexity of managing diverse technologies and the associated manpower and training costs.

5. Polyglot Persistence in a Data Lake Scenario

In modern observatories, archives manage more than just raw scientific data. A data lake approach is adopted to incorporate a wide range of heterogeneous data products—proposals, schedules, weather station outputs, logs, alarms, analytics, and system monitoring.

Different database systems are better suited for handling different types:

Relational databases for structured data.
Object storage for unstructured or large datasets (e.g., images, videos, documents).
NOSQL databases for semi-structured data that does not fit into a rigid schema.
Graph databases for analyzing complex relationships and social semantic analytics.

Polyglot persistence ensures that each data type is managed by the most appropriate database and storage technology, enabling long-term flexibility and integration across services. Figure 6 shows a generic case study of the archives commonly managed within an astronomical observatory facility.

Polyglot persistence relies on a unified access layer—a middleware abstraction that enables seamless querying, handling, and processing of heterogeneous datasets across diverse storage backends. This layer simplifies interaction with various database systems and protocols within a distributed archive.

Different data types are best served by specialized database technologies, such as the following:

Structured proposal data can be easily managed by a relational DBMS (e.g., MariaDB, PostgreSQL).
Logs and alarms require high-throughput so a key–value stores (e.g., Voldemort) can well fit.
JSON-based scientific metadata can rely on a document-oriented DBs (e.g., MongoDB).
Streaming telemetry and event data may need column–family databases (e.g., Cassandra) approach
Tracking accesses and users interactions could be managed by graph databases (e.g., Neo4j, Azure Cosmos DB).
Data analytics/pipelines can be easily stored by an array or functional query systems approach.

By matching each data type to the most suitable database family, this model enables independent scaling of archive components and optimized performance. Object storage handles large unstructured datasets efficiently, while NoSQL systems provide high responsiveness for semi-structured content. However, this flexibility comes at the cost of increased operational complexity and a steep learning curve for different technologies. In Figure 7 it is possible to see a high-level description of how different database technologies, used for various types of data within a common observatory infrastructure, could be linked and aggregated into a unified service. The core aggregator is a document oriented distributed dB (in the image is indicated ReThinkDB). A common approach to a data lake scenario can be easily implemented once all these types of data are aggregated into a single cluster: the final work is a matter of implementing a suitable “extractor service” for what it is needed.

Extending this model, a multi-observatory abstraction layer can integrate science-ready data products from multiple facilities into a unified archive, enabling MOLAP-based multiwavelength research with consistent access to distributed, heterogeneous datasets, optimized and standardized by Virtual Observatory standards (see Figure 7).

6. Distributed Strategy for a Petascale Astronomical Observatory

Consider a distributed observatory composed of mountaintop telescope arrays, multiple observing sites, and geographically dispersed data centers. Managing tens of petabytes of data annually and enabling broad scientific access—potentially to proprietary datasets—requires an archive system that is scalable, efficient, and responsive (see Figure 8).

A data cloud paradigm must be adapted to a research context rather than a commercial one. In an astronomical observatory, the only form of “payment” is through the submission of an observing proposal. If the proposal is highly rated by the Time Allocation Committee, it is prioritized in the observing schedule. This means that the resulting data from the observatory is directly linked to the specific proposal, and the associated scientific data (i.e., high-level data produced after standard reduction) must be accessible only to the principal investigator (PI) and the co-investigators (co-PIs) of that proposal.

This PIship-data archival can be managed quite simply if the database that provides access to the data also stores the proposal metadata at the time of acquisition and carries it through the entire data reduction pipeline. The size of the files, whether large or small, is not considered to have commercial value—only the observational data collected during the allocated time is relevant and must be reduced at different levels to reach a scientific readiness when they are provided to the PI according to the observatory policies.

As widely described for this task database technology is central in any archive solution so taking into account such a distributed scenario, where data are generated on-site and transferred to off-site facilities for long-term storage and processing, the database architecture must mirror the data’s geographical organization. A well-designed geographical topology reduces latency between clients and databases. Geographical proximity helps ensure fast response times, which is particularly important for real-time applications or those with large traffic volumes (see references). To reach such versatile potential geographical distribution, with its flexibility and scalability characteristics, a document-oriented, schema-less database is optimal.

The key point in the database selection is the need to represent the intelligent part of the data handling system, at least ensuring the automation of standard actions to be triggered on the client side. To perform such an automation, it is possible to implement a standard “polling” mechanism or a “change-feed” mechanism.

Several open-source document-oriented databases were evaluated for this. While MongoDB and Couchbase were considered, RethinkDB was selected due to its native “change-feed” mechanism, which enables real-time triggers for any database event. This functionality supports near-automated archival operations, reducing human intervention and eliminating the need for resource-heavy polling systems (Figure 9). Only Azure Cosmos DB offers similar changefeed support, but RethinkDB provided a more lightweight, open-source alternative with low complexity and ease of deployment.

In particular, the change-feed mechanism allows any client to perform an operation when it is triggered by the database: an action can be executed each time there is a change in the result of predefined query (i.e., execute on the client node a standard reduction when new data are present in the DB collection and related calibrations are available). The alternative triggering mechanism is the standard “polling strategy” that executes the same query several times; it compares with the result of a previous query, and if there are differences then it triggers an action (+ sleep + redo!).

Other available database solutions, including the proprietary relational ones, do not have the change-feed functionality integrated, and to develop similar functions, it is necessary to put in place a standard “polling” mechanism (see Figure 9) that consumes a lot of resources and performs several “not-needed” queries and consequent I/O traffic. The change-feed has to be preferred over the polling mechanism for real-time applications.

In principle, when an HA database is distributed and follows the storage distribution, the data management and archival system can be easily customized and adapted to every use case. For this reason a recommended configuration for DB distribution involves deploying at least two RethinkDB instances per data center, ensuring local availability, distributed processing, and high resilience (Figure 10). This configuration is what is explained as edge computing applied to database operations; here, database nodes follow the physical data geographical distribution in order to allow change-feed executions on the single node/datacenter even in the case of network failure, producing output results to be synchronized within the cluster as soon as the network connection is again available.

7. FAIR Principles and VO Integration in Polyglot Persistence

In modern polyglot persistence/data lake environments handling heterogeneous data types, the FAIR principles—Findable, Accessible, Interoperable, and Reusable—serve as foundational guidelines for enabling data discoverability and reuse. These principles, combined with the Open Archives Initiative (OAI), support metadata standardization and cross-repository interoperability [14].

To ensure scientific data are interoperable and accessible at the final stage, adherence to Virtual Observatory (VO) standards is essential. These standards, defined by the International Virtual Observatory Alliance (IVOA), require metadata to be exposed via TAP services and formatted as VO-Tables. This enables seamless integration with VO tools for accessing and analyzing high-level science products such as multi-wavelength catalogs, spectra, and images. Execution workflows are brokered via standardized APIs (e.g., OpenAPI, REST) and submitted to local resource managers such as Slurm, as shown in Figure 11.

Note: In this paper, depending on the context, we use the VO notation both for the Virtual Observatory (for public data access) and the Virtual Organization (for managing access rights and group policies).

8. CTAARCHS Implementation

8.1. Modular Design and Data Transfer Workflow

CTAARCHS provides flexible access to its archive functionalities through the following modular access interfaces:

Command-Line Interface (CLI): executable Python (up to v.3.10) scripts with standardized input/output.
Python Library: core actions encapsulated in run_action () functions, enabling seamless integration into external applications.
REST API: web-based access via HTTP methods (POST, GET, PUT/PATCH, DELETE), allowing CRUD operations through scripts or clients (e.g., CURL, Requests).
Containerized Deployment: distributed as a software container (AMASLIB_IO) to ensure platform compatibility and ease of deployment in Kubernetes (K8s) environments.

8.2. On-Site–Off-Site Data Transfer System

In typical observatory setups, raw data are generated on-site and archived off-site. To facilitate this, CTAARCHS implements a dedicated Data Transfer System (DTS) with optimized bandwidth, error handling, and transfer resumption via client-server architecture and different communication protocols (RPC, gRPC, HTTP/REST, message–queue, etc.).

The on-site storage is treated as a passive element, exposed only to authorized services via secure authentication protocols. This avoids performance bottlenecks and long-term maintenance overhead. Data management and archiving responsibilities reside with off-site data centers, integrated into a broader grid/cloud/edge/fog infrastructure, each with its own virtual organization (VO) (see Figure 12 for architecture).

8.2.1. Prerequisites

To enable automated data transfer from observatory sites to archival facilities, the following prerequisites must be met:

A.: Remote Access to On-Site Storage: On-site storage must be remotely accessible via secure, standardized protocols (e.g., HTTPS or XRootD), with appropriate ports opened between datacenters. This can be achieved through object storage systems or secure web-accessible file directories.
B.: File Monitoring and Triggering: On-site storage must monitor a designated _new_data/ directory to detect new files and trigger transfer actions. A lightweight Python watchdog script can monitor for symbolic links—created upon file completion—and initiate transfer, then remove or relocate the link upon success.
C.: Off-Site Download Mechanism: Off-site datacenters must run an RPC service hosting the Aria2c downloader. Aria2c supports high-throughput parallel downloads, chunking, resume capability, and integrity verification via checksums. A web UI provides real-time monitoring and automatic retries.

Note: While tools such as GridFTP and GFAL2 remain valid alternatives, the use of CERN FTS is generally discouraged due to its complexity, its reliance on site-specific RSE configuration, and its historically high failure rates. For example, in Tier-0 transfers of LHC experiments using FTS 2.0, failure rates ranged from 5% to 15%, averaging around 10% for simulation chains—primarily due to network timeouts (see ATLAS Tier-0 exercise 2019). FTS 3.0 improved the situation by introducing better monitoring and enhanced retry mechanisms, leading to reduced failure rates.

For this use case, Aria2c offers better control, higher reliability, and simpler integration. In our tests, Aria2c achieved a success rate between 99% and 100%, depending on the number of retry attempts and network availability. The tests were conducted using bulk transfers of real scientific data collected over several years from different projects, as well as simulation data generated by various algorithms. The current dataset consists of approximately 1.2 million files (ranging from 1 to 2.5 GB each), totaling around 2.2 PB of data.

8.2.2. Typical Workflow

(1)

Data Generation: Telescope systems write data to local storage; upon completion, a symbolic link is placed in _totransfer/.

(2)

Trigger Detection: A local Python client monitors the directory and detects new links.

(3)

Transfer Initialization:

(a): The symbolic link is resolved to a URI.
(b): The target off-site datacenter is selected based on policy rules (e.g., time-based, data level, or project ID).
(c): The client invokes a command to the off-site Aria2c service, initiating parallel downloads.
(d): Transfer progress is tracked, and completion is confirmed via status queries.
(e): Upon success, the symbolic link is removed.

(4)

Post-Transfer Actions: Additional use cases, such as replication or data ingestion, can be triggered automatically on the off-site side.

8.3. Dataset Ingestion

The ingestion process must adhere to the Open Archival Information System (OAIS) model, which requires that only verified and validated data products be archived. This mandates a structured, pre-ingestion validation phase, where data integrity and metadata completeness are confirmed before registration, and for ingesting datasets, minimal Data Product Acceptable Requirements (DPAR) are applicable (i.e., checksum, fits header format, and content verified). These verification steps cannot be postponed to an on-the-fly registration since the file catalog can be updated only when the data product is ready to be registered/stored, even for temporary data (see Figure 13).

8.3.1. Prerequisites

A.: The _toingest/storage-pool directory must be POSIX-accessible, even if hosted on object storage.
B.: The Python environment must include the fitsio (or astropy), json, rucio, and rethinkdb libraries.
C.: The external storage endpoints called Remote Storage Elements (RSEs) must be accessible via standard A&A protocols (e.g., IAM tokens or legacy credentials).
D.: A write-enabled RethinkDB node must be reachable on the local network.

8.3.2. Typical Workflow

(1)

Data Staging: Data products from Data Producers (pipelines, simulations, or DTS) are placed in _toingest/.

(2)

SIP Creation: A software information package (SIP) is generated, including checksums to verify file integrity.

(3)

Metadata Validation: FITS headers are parsed and validated to ensure required metadata fields are present, correctly typed, and semantically consistent.

(4)

Storage Upload:

(a): Files are uploaded to an Object Storage path (e.g., dCache FS) using RUCIO or equivalent tools like StoRM for common Storage Resource Manager in DataGrid implementations, see references [15].
(b): If already present in the storage, only a move to a final archive path is needed.
(c): Upload status is monitored; once confirmed, metadata (e.g., scope, dataset, RSE) is added to a corresponding JSON record.

Alternative: Use gfal2 to upload directly, guided by storage protocol settings in the ReThinkDB StoragePool collection.

(5): Database Registration: Finalized JSON is ingested into the RethinkDB archive, changing file status to “ingested” and completing the Archive Information Package (AIP) creation.
(6): Trigger Replication: Upon new entry detection (via RethinkDB’s changefeed), the MAKE_REPLICA process is automatically launched.

8.4. Replica Management in CTAARCHS: Automation and Policy Enforcement

As part of the data ingestion process (point no. 6), automated replication ensures compliance with redundancy and long-term preservation policies. Triggered via a change-feed from the ReThinkDB file catalog, the replication logic references a DATA_POLICY_REPLICATION table to determine the required number of copies per data type and storage level. This shared DB collection is the interface to collect all Retention Policies and Custodial Rules to be assigned to data products.

If no policy rule is found, the data product is assumed to be for temporary processing only. Policies define replication support types (e.g., hot, cold, or hot + cold) and preservation intent. This mechanism fulfills key archival use cases such as tracking preservation state and monitoring physical data locations across distributed storage resources.

8.4.1. Replication Status Levels

Ingested: one off-site catalog record exists.
Archived: at least one replica stored across another RSE.
Preserved: includes a backup on cold storage.

Each replication rule specifies the data type, number of required replicas, and preferred storage configuration. The following is an example:

Any record of the DATA_POLICY_REPLICATION table is called “Replication Rule”, the following is an example:

{ “ruleid”: “1”, “rulename”: “AMAS_dl0-raw”, “datatype”: “dl0.raw”, “replica_lev”: “2”, “rule”: “preserve”, “supports”: “hot + cold”, “timeseries”: [ {“RSE1”: “jan-mar” }, {“RSE2”: “apr-jun” },{“RSE3”: “jul-sep” },{“RSE4”: “oct-dec” } ] }

{ “ruleid”: “1”, “rulename”: “AMAS_dl0-fits”, “datatype”: “dl0.fits”, “replica_lev”: “3”, “rule”: “preserve”, “supports”: “any”}

{ “ruleid”: “2”, “rulename”: “AMAS_dl1-fits”, “datatype”: “dl1[a-c].fits”, “replica_lev”: “1”, “rule”: “ingest”, “supports”: “any”}

{ “ruleid”: “3”, “rulename”: “AMAS_dl3-fits”, “datatype”: “dl3.fits”, “replica_lev”: “3”, “rule”: “ingest”, “supports”: “any”}

A generic UML of the Make_Replica is shown in Figure 14.

8.4.2. Prerequisites

A.: All target RSEs must be reachable over secure protocols (e.g., HTTPS, xrootd), and relevant ports must be open across data centers.
B.: The ReThinkDB cluster must support read/write access from local clients.
C.: Each off-site RSE must run an ARIA2c daemon for parallel downloads and transfer monitoring, see Appendix A.

8.4.3. Typical Workflow

(0): Data coming from Data Producers generates a change in the DB catalog.
(1): Ingestion completion updates the file catalog, triggering the replication process via the change-feed.
(2): The client fetches the file’s URI (2a), matches it against the replication policy (2b), and evaluates eligible RSEs based on latency, throughput, and availability (2c).
(3): It initiates parallel data transfers using ARIA2c RPC (3a) and monitors each transfer (3c).
(4): On success, the checksum is verified, a new replica record is added to the file’s JSON metadata, and the replica count is updated.

8.5. Dataset Search

Once a data product is ingested—regardless of its archival status (“ingested”, “archived”, or “preserved”)—its metadata becomes searchable through the ReThinkDB catalog. This enables external users to retrieve dataset identifiers and associated replica information (see Figure 15).

8.5.1. Prerequisite

A.: Read-only access to the ReThinkDB cluster must be available from at least one node in the local network.

8.5.2. Typical Workflow

(1): A user submits a query via the archive interface, specifying metadata fields of interest.
(2): The interface maps the request to searchable metadata intervals.
(3): It then queries the ReThinkDB cluster through a local node.
(4): The database returns a list of matching data products in JSON format, including URIs and identifiers.
(5): This list is delivered to the user for potential retrieval.

Note: This process is typically followed by the “Retrieve” use case.

8.6. Dataset Retrieval

Once a dataset is ingested into the archive, regardless of its status (ingested, archived, or preserved), external users can query the RethinkDB metadata catalog to retrieve corresponding datasets and their available replicas. This process involves querying the catalog for metadata, translating the request into predefined searchable metadata intervals, and executing the query via a local node connection. The database returns a JSON file list containing URIs and identifiers of data products matching the query criteria, which are then provided to the user, see Figure 16 for a generic workflow.

8.6.1. Prerequisites

A.: Remote Storage Elements (RSEs) must be accessible across data centers via secure protocols (e.g., HTTPS, XRootD), with required ports open. Resources may be object storage pools or directories exposed via HTTPS with encryption and authentication.
B.: The RethinkDB cluster must be accessible in read-write mode from at least one node within the local network.

8.6.2. Typical Workflow

(1): A Data Consumer provides a JSON list of requested data products to the retrieval interface.
(2): The system queries the local RethinkDB node.
(3): The database returns a list of replica URIs for each product.
(4): The interface verifies the existence of each replica.
(5): Valid URIs are downloaded in parallel;
(5bis) if no URI from the replica list is available, the system calls the RUCIO catalog to obtain a DID (filename + scope) and find it in the RUCIO catalog.
(6): The parallel download starts for any available URI.
(7): Retrieved files are stored in a user-specified local or remote directory.

8.7. Search and Retrieve Integration/Concatenation

Search and Retrieve are often combined as a single use case, chaining Python methods to locate metadata and then download the associated data products efficiently, see Figure 17.

Note. This concatenation can be executed only by higher-level users in order to automate a retrieval procedure starting from well-defined (and tested) queries.

The search.py utility interfaces with the RethinkDB cluster to locate data products based on metadata queries. Depending on the execution context, results may point to internal POSIX paths, external URIs, or any remote-storage-based identifiers (i.e., in RUCIO we have RSE + LFN + SCOPE).

The advanced AMAS Search Interface exposes a REST API via a dedicated web server, supporting fast and complex metadata-based queries across distributed data centers. Users can execute searches from any location or pipeline stage, provided they have network access, the AMAS-REST domain is public, and it can be browsed without authorization (see GITlab CTAARCHS references).

A typical query can be executed with a simple curl command, specifying key–value filters such as date, run number, or filename (see Figure 18).

LIST=$(curl “http://amas-rest.oa-roma.inaf.it/search?key=DATE&val=2025-01-15:2025-02-26&key=RUN&val=846:895&key=FILE&val=20250120_MA01_OffFixed”)

echo $LIST

{

“files”: [

“https://amas.oa-roma.inaf.it/static/data/Miniarray/.../20250120_MA01_OffFixed-60-015_Fixed_00000849_I_001761_1001.lv0.fits.gz”,

“… more URIs …”],

“nfiles”: 9

}

The dataset search returns a JSON-formatted file list containing URIs pointing to RSE storage locations. Access typically requires user authentication.

The retrieve.py interface reads this list (e.g., from STDIN) and then downloads the corresponding files to a user-specified directory. It connects to the local RethinkDB node using read-only credentials to fetch replica metadata.

To optimize performance, the system dynamically selects the most efficient replica for each file using a “down-cost” algorithm. This decision is based on the following site-specific parameters:

Cost(i): estimated retrieval cost from site I;
Latency(i): time to initiate transfer;
FileSize: total size of the file;
Throughput(i): nominal data rate;
Workload(i): current system load (0 = idle, 1 = saturated);
Distance(i): network or geographic distance.

These parameters are used to minimize download time and network usage. Workload reflects real-time system strain, while throughput, latency, and distance help assess the optimal retrieval path—especially important in geographically distributed storage systems or under regulatory constraints. Distance could be affected by latency or used explicitly if needed for geo-pinning or regulatory concerns.

C o s t_{i} = α \cdot ({Latency}_{i} + \frac{FileDimension}{{Throughput}_{i} \cdot (1 - {Workload}_{i})}) + β \cdot {Distance}_{i}

The optimal replica for download is dynamically selected by computing the retrieval cost (Cost_i) in real time. The replica with the lowest cost is chosen, and its URI is returned. Final access requires authentication and authorization.

Latency_i is easily measured via network ping; Throughput_i and Distance_i are typically available from infrastructure documentation. Estimating Workload_i, however, is more complex and can be approximated by comparing the MeasuredThroughput_i—from a small test download—to the NominalThroughput_i.

{Workload}_{i} = 1 - \frac{{MeasuredThroughput}_{i}}{{NominalThroughput}_{i}}

if latency > threshold or throughput < expected × 0.5:

Workload_i = 0.8 # heavy

elif throughput < expected * 0.8:

Workload_i = 0.5 # moderate

else:

Workload_i = 0.1 # low

The computation is performed periodically (hourly) by an agent module on each datacenter, a temporary ranking is assigned to any storage resources, and the result is saved in a dedicated collection of the rethinkDB cluster in order to be queried before starting each data transfer.

The interfaces (see Python implementations in Appendix A) are readily accessible via the amas-api_1.0.2 docker image and/or podman in order to be run on Containerd/CRI on Kubernetes or HPC environments.

Note: as an example with docker installed, users can deploy the environment using the following minimal setup (there are very similar runtime using nerdctl or ctr as CLI):

docker load -i amas-api_1.0.2.tar;

docker run -it amas-environment bash;

./venv/bin/python ./search.py

9. Deployment of CTAARCHS at the CIDC and AMAS

Deploying a data center requires careful planning to ensure efficiency, scalability, and security. At the CTA Italy Data Center (CIDC), deployment of the ASTRI and Miniarray Archive System followed a structured strategy aligned with observatory goals and technical constraints. Emphasis was placed on building a secure, scalable infrastructure, minimizing risks while supporting operational demands. Initial phases included logical and physical design—rack layouts, network topology, cooling, and power—supported by the Tier-2 facility at INFN Frascati, where CIDC is currently hosted [16,17].

9.1. Hardware Resources

The CTAARCHS implementation is based on the AMAS archive system, supporting the ASTRI-Horn prototype and the nine-telescope ASTRI Miniarray at Teide Observatory, Tenerife. AMAS represents the complete off-site infrastructure for these projects and serves as the technical deployment of CTAARCHS.

Built on the CTAARCHS/AMAS IaaS, the CTA Italy Data Center (CIDC) forms one of four designated off-site data centers for the CTAO Project (see Section 9.7.1). Hardware requirements for computing and storage are defined annually by each project office and reflected in a procurement plan for 2025–2026.

Software services follow a Continuous Integration/Delivery (CI/CD) model, with the exception of the archive system, which must be accessible from project initiation. Archive deployment is coordinated with collaboration partners and adapted through a virtualized abstraction layer.

The AMAS implementation of CTAARCHS relies on shared hardware located at the following three main sites:

INAF—OAR, Astronomical Observatory of Rome;
INAF—SSDC, ASI Science Data Center;
INFN—LNF, National Laboratoies of Frascati.

In total, AMAS hardware list consists of a federated distributed “hot” storage of 6PB (directly upgradable to 10PB); around 10PB (directly upgradable up to 100PB) of “cold” storage (Fiber Channel Tape Library); an HPC@OAR consisting of about 800 cores (8.8 kHS06) with ~1TB RAM; and a grid HTC@LNF consisting of about 1400 cores (15.4 kHS06) with ~2.5TB of RAM. In SSDC are foreseen only minimal services and resources not listed here for sharing MWL data.

9.2. The Setup

Datacenters can join the CTAARCHS environment by registering to access repositories of Docker containers, virtual machines, and Kubernetes (K8s) orchestration for various services.

The K8s clusters at INAF-OAR and INFN-LNF sites share resources within the ReDB “resource_pools” collection, managing storage, computing, services, and user registrations. The distributed RethinkDB cluster spans multiple sites—OAR (DC1), LNF (DC2), and SSDC (DC3, pending activation)—as illustrated in Figure 19.

9.3. Users Interfaces

The main users of the common archive and data management system are:

Pipeline/Simulation users—for low-level data products
Science users—for higher-level data products
BDMS users and Admin—for high-level operations on archives

Access type depends on the nature of the task. If specific archive or database management operations are required, the preferred method is through the command-line interface (CLI) from the BDMS or Science Archive machines. This access is POSIX-compliant, and BDMS/Science Admin roles are mapped within the authorization framework (see Section 9.8).

Bulk database operations are restricted to administrators and can be executed via the ReThinkDB console. Portal access supports both science users (in the principal investigator role) and admins, who may also manage LDAP credentials.

CLI and POSIX access is granted to low-level users such as automated pipelines, enabling operations like bulk search and retrieval, file renaming/removal, and other data management tasks. Authorization for all access types adheres to the A&A policy and its associated services (see Section 9.8).

Using the Data I/O Interfaces, authorized science users can search, query, and retrieve proprietary data within defined POSIX paths. Proprietary data mapping is handled at the Science Database-level, ensuring secure and structured access.

9.4. Pipeline/Simulation Users Access and Interface

Users access data through a variety of tools and workflows. Simulation and Pipeline users utilize Workload Management Systems (WMS) such as DIRAC or PANDA to execute Directed Acyclic Graphs (DAGs) on grid computing or HPC environments (e.g., Condor, Slurm), interacting with remote object storage systems.

A more recent approach involves a Kubernetes-based computing element service (CES) to orchestrate job queues and manage virtual organizations and authorization mechanisms. However, current WMS platforms like DIRAC and PANDA are not yet compatible with Kubernetes environments.

Simulation users typically write their output directly to Object Storage for asynchronous ingestion. Pipeline users, on the other hand, first query the archive using metadata to locate input datasets, and then process data close to storage to reduce data transfer overhead (this process is detailed in the “Search” use case). Higher-level data products can be generated inside the same data centers where the bulk of the dataset physically resides, using Airflow’s DAG Editor, which delivers science-ready outputs to the Science Archive collection (see Section 9.7).

Note: All I/O operations must strictly adhere to defined Use Cases (UC) without custom modifications. If a WMS cannot conform to these requirements, it should be adapted or replaced. The archive design must remain unchanged: interactions between the archive and WMS must occur exclusively through defined interfaces. Customization is permitted only at the interface level, as the two systems must remain strictly decoupled.

9.5. Unconventionl Challenges

International collaborations face challenges due to political mandates to use pre-existing systems or software developed by IKC and used for other data models and/or scientific scenarios. For instance, the RUCIO Data Management System and/or DIRAC for the Workload Management System impose several limitations on CTAARCHS. These software often become single points of failure (SPOFs) in a no-SPOF infrastructure, forcing inefficient archive adaptations and violating OAIS principles that mandate strict separation between Data Producers, Data Consumers, and archive submodules through standard interfaces. Modifying the archive requirements to adapt to these limitations becomes detrimental to the continuation of a good collaboration.

For example, RUCIO suffers from SPOF in its centralized PostgreSQL catalog and is complex for multi-institutional sharing due to its fixed CERN-centric data model, leading to storage overhead and high operational costs. A natural antagonist of RUCIO is the OneData, which is a distributed data management system too, designed to integrate diverse storage resources, facilitating seamless data access and sharing across institutions. Differently from RUCIO, OneData offers a storage federation model based on a distributed, document-oriented database model; being based on a distributed, document-oriented DB cluster (i.e., CouchBase), it offers a storage federation that better supports metadata management, open data, and collaboration, aligning with Open Science goals (see Table 5). Choosing storage federation technology to serve an astronomical observatory community should prioritize technical effectiveness and use case fit over political or economic pressures [18].

Finally, because of the several points of failure involved with the RUCIO environment it is clear that NO PERSISTENT ARCHIVAL SERVICE can be dependent on potentially unstable archival software without the possibility of having a “plan-B” ready and usable.

So we need to deprecate the wide use of RUCIO as a central storage system for a good archive, and we auspicate to relegate it only as a marginal common interface because it is optimized for different storage element protocols.

Throughout this work, the term RSE (Remote Storage Element) is used generically to denote any remote storage resource accessible via standard protocols, independent of the RUCIO framework.

9.6. Database and Data Model Interfaces

Intermediate and end users may require direct access to metadata for scientific analysis or simulation output. To support this, a dedicated read-only user role enables querying across all data levels. For FITS files, primary headers are indexed within the data model, allowing advanced search capabilities. A sample data model and query interface are provided in Appendix A (see Section 9.7.1), with customizable code available for tailored use cases.

The code sample is similar to those used for the find and query interface client but can be expressly customized on demand (see Appendix A for insights).

RethinkDB supports the creation of secondary indexes on metadata fields, enabling faster queries as datasets grow. This feature is simple to implement, with no strict limits on the number of indexes, making it highly effective for optimizing search performance over time.

r.table(“DLFITS”).index_create(“dateobs”).run(conn) #CREATE INDEX

r.table(“DLFITS”).index_wait(“dateobs”).run(conn) #WAIT COMPLETITION

# Query using the index

r.table(“DLFITS”).get_all(“2024-12-06”, index=“dateobs”).run(conn)

9.7. Web Archive Portal for the End-User and Other Interfaces

Science users—primarily researchers accessing high-level data products—interact with the archive via a dedicated web portal. These users are considered as Data Consumer and are planned to retrieve level-3 datasets in read-only mode to conduct analyses or run customizable pipelines. Data dissemination relies on the distributed database, with pipeline execution triggered by change-feed mechanisms monitoring the level-3 collection (see Figure 20).

9.7.1. Prerequisites

A.: Python environment must include fitsio (via Astropy), json, rethinkdb, and rucio libraries.
B.: Bulk and science RSEs must be accessible via supported authentication methods: IAM tokens (preferred), legacy grid certificates (deprecated), or credentials.
C.: The ReThinkDB cluster must be accessible in read-only mode via at least one local node.
D.: The Science Database may reside within ReThinkDB or any compatible RDBMS.

9.7.2. Typical Workflow

(0): A pipeline processes data and ingests new DL3 products into the archive.
(1): Detection of new DL3 entries triggers the get&process action.
(2): The associated URI is fetched from the source RSE and transferred to the Science RSE.
(3): DL3 metadata are extracted from ReThinkDB and written to the Science DB.
(4): Optional automated workflows convert DL3 to DL4 and DL5 products.

Note: Since higher-level science data (DL3–DL5) involve smaller volumes, they may be handled via lightweight solutions such as local Airflow DAGs and executed on dedicated clusters (see Figure 21).

Figure 21. Simple processing to pass from DL0 to science data.

Community LDAP or VPN access enables shared resource usage and supports defining Airflow pipeline steps. The Search and Retrieve Python APIs remain functional but require read-only access to the ReThinkDB cluster. Alternatively, REST-API endpoints can be used to bypass direct database access (see Figure 22).

The low level processing, for a large amount of data, can be easily shared and distributed among DPPN datacenters, while the science processing can be easily concentrated in one site using a dedicated slurm queue and an AIRFLOW DAG authomatic processing.

The output Science RSE can benefit from a localized access dedicated only to scientific end-users passing through a web portal to browse and access proprietary “proposals” data or through a web gateway facility sharing a user-defined policy repository bucket on a cloud-based storage utility like Min-IO.

Note: Scientific end-user data access cannot rely on complicated grid-based data I/O access like IAM (grid-based certificates/tokens for authentication) required for low-level big data processing. So a cloud-based approach, like amazon-AWS (i.e., a customized MinIO facility), gives the end user very simple access customized on a common LDAP authenticaiton (login + password) and permits access to proprietary data products using standard POSIX and REST API access, as well as mounting and sharing local storage areas for analysis and collaboration within research groups [19].

A simple implementation for High Energy Astronomical Archives has been realized for the ASTRI Project in the AMAS, ASTRI and Miniarray Archive System, containing the proposal handling system, an observing scheduler and planner a PI-web-based data access for all relevant scientific data levels and if necessary also for logs/alarms, housekeeping, data quality checks, and a quicklook browser as a service to browse data, perform a very simple evaluation on the goodness of a data acquisition, and provide the viewer of plots and graphs. The common usage foresees the end-user accessing through a collaboration VPN access and sharing a local-cluster facility, using find/search queries and data retrieval and using a user-defined namespace local bucket shared on a MinIO infrastructure interface [20,21].

The bulk data processing is distributed among three different nodes, and the archive system is distributed too; the science processing instead is principally driven through an AIRFLOW facility running on top of a SLURM HPC queue in the OAR cluster.

As shown, storage and computing resources are built on the AAAS (Astronomical Archive as a Service) paradigm applied to the astronomical use case; the resulting infrastructure is easily horizontally scalable as well as upgradable without out-of-service when needed. See Figure 23 for current implementations.

9.8. Authentication and Authorization (A&A) Challenges

Collaboration and interconnection with international resources require a common approach to Authentication and Authorization (A&A), as the management and usage of resources must be orchestrated across multiple institutions, countries, and potentially varying ICT security levels. CTAARCHS is ready to implement most A&A protocols, ranging from customized LDAP-based accounting to combinations of various A&A services [22].

Typically, the identification of who is accessing a system, software, or service relies on storing a user ID associated with standard eduGAIN credentials in a credential database. Some collaborations adopt proprietary systems, such as Microsoft Entra ID, to store user credentials and maintain a unified authentication layer.

Once a user is authenticated, their authorization level must be verified to determine whether they are permitted to access a specific platform, resource, or dataset. A common solution is to integrate Grouper as part of a Common Trusted Access Platform architecture, which enables attribute- and role-based authorization and group membership management in an auditable manner. This allows Grouper-enabled services to decentralize the management of authorization for authenticated users.

In scenarios where multiple services work together, authorization levels are often established through SSL grid certificates or via temporary (or persistent) tokens. Using Indigo IAM, services can validate these tokens to grant access to resources. This is especially useful for automated agents and pipelines that need access to storage or computing facilities. In such cases, a specialized mapping mechanism is required to distinguish between public and proprietary data.

Any high-level data produced by a pipeline process must be mapped back to its origin via the proposal database, ensuring access is granted only to the relevant Principal Investigators (PIs and Co-PIs). In CTAARCHS, this mapping is managed through the ReThinkDB Science collection, which links data to the PI table within the proposal handling system, and is ready to integrate PI information with any A&A service as needed. For small-level services, CTAARCHS allows A&A a simple and persistent LDAP-structured database, including a customizable list of user credentials, attributes, and roles.

10. Conclusions and Recommendations

The definition of archive solutions to be adopted in a wide range of scientific collaborations is a crucial step for the success of a project, especially in the astronomical fields where, differently from nuclear and subnuclear particle experiments, the number of end-users is several orders of magnitude greater.

Although political choices can be made and pushed on the basis of pre-existing economic and technological contributions, choosing the best technologies to assemble the most efficient system for the project’s use cases is the most important obstacle to overcome and depends extremely on the project management capabilities of the different teams (working groups) identified to assemble the different modules/packages.

In our CTAARCHS we present a feasible and versatile implementation of all archival and data management ecosystems needed for an astronomical observatory use case, with a set of possibilities or alternate scenarios and equally valid technological choices.

11. Software Resources and Repositories

In this section we summarized a small list of CTAARCH software packages, modules, resources, and repositories, including third-party packages.

11.1. CTAARCH and AMAS

-: Homepage: https://amas.oa-roma.inaf.it (accessed on 1 August 2025)
-: Repo: https://www.ict.inaf.it/gitlab/AMAS/ (accessed on 1 August 2025)
-: Docker API: https://www.ict.inaf.it/gitlab/AMAS/dockerAPI/ (accessed on 1 August 2025)
-: Py-API: https://www.ict.inaf.it/gitlab/AMAS/pyAPI (accessed on 1 August 2025)
-: REST-API: http://amas-rest.oa-roma.inaf.it (accessed on 1 August 2025)

11.2. RethinkDB

-: Homepage: https://rethinkdb.com (accessed on 1 August 2025)
-: Repo: https://hub.docker.com/_/rethinkdb/ (accessed on 1 August 2025)
-: Py-API: https://rethinkdb.com/docs/install-drivers/python/ (accessed on 1 August 2025)

11.3. Minio

-: Homepage: https://min.io/ (accessed on 1 August 2025)
-: Repo: https://min.io/download?view=aistor (accessed on 1 August 2025)
-: Client: https://min.io/docs/minio/linux/reference/minio-mc.html (accessed on 1 August 2025)

11.7. Rucio

-: Homepage: https://rucio.cern.ch/ (accessed on 1 August 2025)
-: Repo: https://github.com/rucio/rucio (accessed on 1 August 2025)
-: REST-API: https://rucio.cern.ch/documentation/html/rest_api_doc.html (accessed on 1 August 2025)

11.8. Django

-: Homepage: https://www.djangoproject.com/ (accessed on 1 August 2025)
-: Repo: https://github.com/django/django (accessed on 1 August 2025)
-: REST-API: https://www.django-rest-framework.org/ (accessed on 1 August 2025)

11.9. PanDA

-: Homepage: https://panda-wms.readthedocs.io/en/latest/ (accessed on 1 August 2025)
-: Repo: https://github.com/PanDAWMS (accessed on 1 August 2025)
-: PyAPI/REST: https://panda-wms.readthedocs.io/en/latest/client/rest.html (accessed on 1 August 2025)

11.10. Dirac

-: Homepage: https://dirac.readthedocs.io/en/latest/AdministratorGuide/Systems/WorkloadManagement/ (accessed on 1 August 2025)
-: Repo: https://github.com/DIRACGrid (accessed on 1 August 2025)
-: Py-API: https://dirac.readthedocs.io/en/latest/UserGuide/GettingStarted/UserJobs/DiracAPI/ (accessed on 1 August 2025)

Author Contributions

Conceptualization, S.G.; methodology, S.G. and F.L.; software, S.G., F.F. and G.Z.; validation, F.L. and S.G.; data curation, S.G.; writing—original draft preparation, S.G.; writing—review and editing, S.G., G.Z. and F.L. All authors have read and agreed to the published version of the manuscript.

Funding

The writing of this paper/work received no external funding, while the archive templates, prototypes and testbeds rely on hardware funded by PON IBISCO PON R&I 2014-2020 (Ref: https://www.scope.unina.it/progettoibisco.html) and ASTRI projects (see references).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

I would like to thank my colleagues and collaborators of the various national and international projects with which I have had the privilege to work during the last decades and which have allowed me to develop a know-how in the field of data archiving, databases, data handling processing and mining. The projects in question include the following: The Large Binocular Telescope in Arizona (USA) (link: https://www.lbto.org/); ASTRI-Horn and ASTRI Miniarray projects (link: http://www.astri.inaf.it/); The Cherenkov Telescope Array Observatory (link: https://www.ctao.org/).

Conflicts of Interest

No conflicts of interest have been declared by the authors. I declare that the content of this manuscript is human intellectual work and that no artificial intelligence program has been adopted.

Abbreviations

The following abbreviations are used in this manuscript:

ACID	Atomicity, Consistency, Isolation, and Durability
AMAS	ASTRI and Miniarray Archive System
ASTRI	Astrofisica con Specchi a Tecnologia Replicante Italiana
CAP	Theorem: Consistency (data Consistency), Availability (data Accessibility) and Partitioning (partition tolerance)
CLI	Command Line Interface
CTAARCHS	Cloud-Based Technology for Archiving Astronomical Research Contents and Handling System
CTAO	Cherenkov Telescope Array Observatory
DBMS	Database Management System
DPAR	Data Product Acceptable Requirements
DPPN	Datacenters (e.g., DC)
FAIR	Findable, Accessible, Interoperable, and Reusable
IKC	In-Kind Contribution
MWL	Multi Wavelength
NOSQL	Not Only SQL
RDBMS	Relational DBMS
REST API	REpresentional State Transfer Application Programming Interface
RSE	Remote Storage Element
SQL	Structured Query Language
WMS	Workload Management System

Appendix A

Appendix A.1. Data Model Example (DL0 Fits-Data)

The data model can be searched and filtered by any keywords. The selection/filtering can be easily sped up by sorting keyword indexes. In principle each data product and data level has its own data model defined by a dedicated JSON schema file. Appendix A contains an example of a DL0 Fits JSON entry.

{

"aaid": "17447020150367122",

"aipvers": "AMAS_1.0.0",

"archive": {

"PFN": "20250227_MA01_Crab_W0.50p090_00001063_R_002086_1003.lv0.fits.gz",

"PFNP": "/archive/MINIARRAY/PHYSICAL/pass_0.0.1/20250227/00001063/dl0/varlg/v2",

"archdate": "2025-04-15 07:26:55.036728",

"archtime": 1744702015,

"checksum": "3ac92bdc",

"container": "20250415",

"dataset": "fits-data",

"filesize": 14747749,

"paths": {

"RSE": "astriFS",

"replicaflag": 0,

"type": "web-https",

"uid": "17447020150367122",

"uripath": "https://amas.oa-roma.inaf.it/static/data/Miniarray/pass_0.0.1/20250227/00001063/dl0/varlg/v2/20250227_MA01_Crab_W0.50p090_00001063_R_002086_1003.lv0.fits.gz"

},

"replicas": {

"replica": [

{

"number": "0",

"rdata": "2025-04-15 07:26:55.036728",

"rid": "17447020150367122",

"rtype": "web-https",

"spool": "AMAS",

"uri": "https://amas.oa-roma.inaf.it/static/data/Miniarray/pass_0.0.1/20250227/00001063/dl0/varlg/v2/20250227_MA01_Crab_W0.50p090_00001063_R_002086_1003.lv0.fits.gz"

}

]

},

"scope": "MA01"

},

"author": "Stefano Gallozzi",

"camera": "{'name': 'ASTRIMA', 'origin': 'ASTRIDPS', 'creator': 'adas preprocessing v1.1', 'npdm': 37, 'modeid': 'R', 'datatype': 'fits-data'}",

"daqmode": "R",

"datadesc": "lv0_var_lg",

"datatype": "1003",

"dateobs": "2025-02-27",

"event": {

"obsdate": "2025-02-27"

},

"file_version": 1,

"filename": "20250227_MA01_Crab_W0.50p090_00001063_R_002086_1003.lv0.fits.gz",

"fsize": 14747749,

"header": {

"Primary": {

"ALT_PNT": -999,

"AZ_PNT": -999,

"BITPIX": 16,

"CHECKSUM": "g5XEj3XBg3XBg3XB",

"COMMENT": [

"= 'FITS (Flexible Image Transport System) format is defined in ''Astron'"

],

"CREATOR": "adas preprocessing v1.1.1",

"DAQ_ID": "002086",

"DAQ_MODE": "R",

"DATAFORMAT": "v1.0",

"DATALEVEL": "lv0",

"DATAMODE": "10",

"DATASUM": "0",

"DATE": "2025-02-28T07:34:30",

"DATE-END": "2025-02-27T21:10:49",

"DATE-OBS": "2025-02-27T20:29:41",

"DEC_OBJ": 22.0174,

"DEC_PNT": 21.517,

"EQUINOX": "2000.0",

"EXTEND": true,

"FILENAME": "20250227_MA01_Crab_W0.50p090_00001063_R_002086_1003.lv0.fits",

"FILEVERS": 1,

"INSTRUME": "CAMERA",

"MJDREFF": 0.00080074,

"MJDREFI": 58849,

"NAXIS": 0,

"NTEL": 1,

"OBJECT": "Crab",

"OBS_DATE": "20250227",

"OBS_MODE": "W0.50p090",

"ORIGIN": "ASTRIDPS",

"ORIG_ID": "00",

"PROG_ID": "001",

"RADECSYS": "FK5",

"RA_OBJ": 83.6324,

"RA_PNT": 83.632,

"RUN_ID": "00001063",

"SBL_ID": "002",

"SIMPLE": true,

"SUBMODE": "02",

"TELAPSE": "2468",

"TELESCOP": "ASTRI-MA",

"TEL_ID": "01",

"TIMEOFFS": 0,

"TIMESYS": "TT",

"TIMEUNIT": "s",

"TSTART": "162851381",

"TSTOP": "162853849"

}

},

"id": "f9c92d8e-e2a8-4c8b-bba9-7d6209317676",

"infomail": "stefano.gallozzi@inaf.it",

"latest_version": 1,

"object": "Crab",

"obsid": 2086,

"obsmode": "W0.50p090",

"packtype": "fits-data",

"programid": 111,

"proposal": {

"carryover": "Y",

"category": "EXT/RP",

"cycle": {

"name": "cycle2024/1",

"period": "[ 2023-05-01,2023-05-05 ]",

"type": "semester"

},

"obsprog": {

"arrayconf": {

"acq_mode": "wobble",

"acq_submode": "2.5",

"telmatrix": {

"confname": "fulla_ma",

"on_off": "[ 1,0,0,0,0,0,0,0,0 ]",

"type": "array"

},

"trigger": "S2"

},

"constraints": {

"maxMIF": "0.7",

"maxZA": "60.0 [deg]",

"minAT": "0.7",

"minET": "100 [hrs]",

"minMD": "100 [deg]",

"minZA": "0.0 [deg]"

},

"progid": "1",

"target": {

"coord": "[ 83.6329, 22.014 ]",

"dec": 22.014,

"diameter": "7.0 [arsec]",

"epoch": "J2000",

"magnitude": "8.4 [ABmag]",

"name": "Crab_Nebula",

"rad": 83.6329,

"tooflag": "0",

"type": "wcs"

}

},

"piname": "Giovanni Pareschi",

"propdate": "2023-11-21",

"propid": "2",

"proplink": "https://amas.oa-roma.inaf.it/proposals/2/",

"proptype": "SCI",

"reqtime": "300 [hrs]"

},

"runid": 1063,

"schema": "https://amas.oa-roma.inaf.it/static/aipMADLFITSschema",

"schemavers": "FITS_v0.0.1",

"telescope": {

"altitude": "2390 [m]",

"diamS1": "4.6 [m]",

"diamS2": "1.5 [m]",

"geo": {

"coord": [

28.30015,

−16.50965

],

"type": "gps"

},

"optconf": "DM",

"telid": "01",

"telname": "MA01",

"type": "AIC"

},

"timestamp": "2025-02-28 09:31:54+00:00"

}

Appendix A.2. JSON Object Schemas

The following is an example of a JSON schema for the “archive” object:

{

"$schema": "https://json-schema.org/draft/2020-12/schema",

"title": "Archive Object Schema",

"type": "object",

"properties": {

"archive": {

"type": "object",

"properties": {

"PFN": { "type": "string" },

"PFNP": { "type": "string" },

"archdate": { "type": "string", "format": "date-time" },

"archtime": { "type": "integer" },

"checksum": { "type": "string" },

"container": { "type": "string" },

"dataset": { "type": "string" },

"filesize": { "type": "integer" },

"paths": {

"type": "object",

"properties": {

"RSE": { "type": "string" },

"replicaflag": { "type": "integer" },

"type": { "type": "string" },

"uid": { "type": "string" },

"uripath": { "type": "string", "format": "uri" }

},

"required": ["RSE", "replicaflag", "type", "uid", "uripath"]

},

"replicas": {

"type": "object",

"properties": {

"replica": {

"type": "array",

"items": {

"type": "object",

"properties": {

"number": { "type": "string" },

"rdata": { "type": "string", "format": "date-time" },

"rid": { "type": "string" },

"rtype": { "type": "string" },

"spool": { "type": "string" },

"uri": { "type": "string", "format": "uri" }

},

"required": ["number", "rdata", "rid", "rtype", "spool", "uri"]

}

},

"required": ["replica"]

},

"scope": { "type": "string" }

},

"required": [

"PFN", "PFNP", "archdate", "archtime", "checksum", "container",

"dataset", "filesize", "paths", "replicas", "scope"

]

}

},

"required": ["archive"]

}

And a valid “archive” property for that schema is as follows:

"archive": {

"PFN": "20250227_MA01_Crab_W0.50p090_00001063_R_002086_1003.lv0.fits.gz",

"PFNP": "/archive/MINIARRAY/PHYSICAL/pass_0.0.1/20250227/00001063/dl0/varlg/v2",

"archdate": "2025-04-15 07:26:55.036728",

"archtime": 1744702015,

"checksum": "3ac92bdc",

"container": "20250415",

"dataset": "fits-data",

"filesize": 14747749,

"paths": {

"RSE": "astriFS",

"replicaflag": 0,

"type": "web-https",

"uid": "17447020150367122",

"uripath": "https://amas.oa-roma.inaf.it/static/data/Miniarray/pass_0.0.1/20250227/00001063/dl0/varlg/v2/20250227_MA01_Crab_W0.50p090_00001063_R_002086_1003.lv0.fits.gz"

},

"replicas": {

"replica": [

{

"number": "0",

"rdata": "2025-04-15 07:26:55.036728",

"rid": "17447020150367122",

"rtype": "web-https",

"spool": "AMAS",

"uri": "https://amas.oa-roma.inaf.it/static/data/Miniarray/pass_0.0.1/20250227/00001063/dl0/varlg/v2/20250227_MA01_Crab_W0.50p090_00001063_R_002086_1003.lv0.fits.gz"

}

]

},

"scope": "MA01"

},

Appendix A.3. ReThinkDB Sample Python Interface Codes

Sample code to obtain the FULL metadata content of a DB entry once selected and filtered the dataset in the “data” structure stores ALL metadata coming from the JSON record. Using any kind of Python function, it is possible to scan and filter again the “data” JSON structure.

A few lines of code are used to connect and retrieve data from a RethinkDB node using a simple filtering search; this allows any application to interact and extract information from the local ReDB cluster node.

Appendix A.4. Simple Query #2

Appendix A.5. ARIA2c WebUI for Parallel Downloads

Figure A1. AMAS Aria2C WebUI parallel bulk download in action on CTAO LST-1 lev0-data.

References

Giaretta, D. Introduction to OAIS Concepts and Terminology. Advanced Digital Preservation; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar] [CrossRef]
Patel, S.; Patel, R. A comprehensive analysis of computing paradigms leading to fog computing: Simulation tools, applications, and use cases. J. Comput. Inf. Syst. 2023, 63, 1495–1516. [Google Scholar] [CrossRef]
Padhy, R.P.; Patra, M.R. Evolution of cloud computing and enabling technologies. Int. J. Cloud Comput. Serv. Sci. 2012, 1, 182. [Google Scholar] [CrossRef]
Fang, L.; Zhang, C. Managing Big Data for Smart Grid Applications. In Proceedings of the IEEE International Conference on Smart Grid Communications, Miami, FL, USA, 2–5 November 2015. [Google Scholar] [CrossRef]
Shiau, W.L. An evolution, present, and future changes of cloud computing services. J. Electron. Sci. Technol. 2015, 13, 54–59. [Google Scholar]
Manoranjini, J.; Anbuchelian, S. Data security and privacy-preserving in edge computing: Cryptography and trust management systems. In Cases on Edge Computing and Analytics; IGI Global: Hershey, PA, USA, 2021; pp. 188–202. [Google Scholar] [CrossRef]
Akindote, O.J.; Adegbite, A.O.; Dawodu, S.O.; Omotosho, A.; Anyanwu, A. Innovation in Data Storage Technologies: From Cloud Computing to Edge Computing. Comput. Sci. IT Res. J. 2023, 4, 273–299. [Google Scholar] [CrossRef]
Gąbka, J. Edge computing technologies as a crucial factor of successful industry 4.0 growth. The case of live video data streaming. In Advances in Manufacturing II: Volume 1-Solutions for Industry 4.0; Springer International Publishing: Cham, Switzerland, 2019; pp. 25–37. [Google Scholar] [CrossRef]
Bajaj, K.; Sharma, B.; Singh, R. Implementation analysis of IoT-based offloading frameworks on cloud/edge computing for sensor generated big data. Complex Intell. Syst. 2022, 8, 3641–3658. [Google Scholar] [CrossRef]
Bhargavi, P.; Jyothi, S. Object detection in Fog computing using machine learning algorithms. In Architecture and Security Issues in Fog Computing Applications; IGI Global: Hershey, PA, USA, 2020; pp. 90–107. [Google Scholar] [CrossRef]
Kumar, D.; Singh, R.K.; Mishra, R.; Vlachos, I. Big data analytics in supply chain decarbonisation: A systematic literature review and future research directions. Int. J. Prod. Res. 2024, 62, 1489–1509. [Google Scholar] [CrossRef]
Chen, S.; Rodero, I. Understanding behavior trends of big data frameworks in ongoing software-defined cyber-infrastructure. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, Austin, TX, USA, 5–8 December 2017; pp. 199–208. [Google Scholar] [CrossRef]
Lathar, P.; Srinivasa, K.G.; Kumar, A.; Siddiqui, N. Comparison study of different NoSQL and cloud paradigm for better data storage technology. In Handbook of Research on Cloud and Fog Computing Infrastructures for Data Science; IGI Global: Hershey, PA, USA, 2018; pp. 312–343. [Google Scholar] [CrossRef]
Madera, C.; Laurent, A. The Next Information Architecture Evolution: The Data Lake Wave. In Proceedings of the 8th International Conference on Management of Digital EcoSystems (MEDES), Biarritz, France, 1–4 November 2016. [Google Scholar]
Ghiselli, A.; Magnoni, L.; Zappi, R. Storm: A Manager for Storage Resource in Grid, Colloquia CSFI 2008. IL Nuovo C. C 2009, 32, 273–277. [Google Scholar] [CrossRef]
Kleppmann, M. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable and Maintainable System; O’Reilly Media: Sebastopol, CA, USA, 2017; ISBN 978-1449373320. [Google Scholar]
Abbott, M.L.; Fisher, M.T. The Art of Scalability: Scalable Web Architecture, Processes and Organization for the Modern Enterprise; Addison Wesley Professional: Boston, MA, USA, 2015; ISBN 978-0134032801. [Google Scholar]
Svoboda, T.; Raček, T.; Handl, J.; Sabo, J.; Rošinec, A.; Opioła, L.; Jesionek, W.; Ešner, M.; Pernisová, M.; Valasevich, N.M.; et al. Onedata4Sci: Life science data management solution based on Onedata. arXiv 2023, arXiv:2311.16712. [Google Scholar] [CrossRef]
Strohbach, M.; Daubert, J.; Ravkin, H.; Lischka, M. Big data storage. In New Horizons for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in Europe; Springer Internacional Publishing: Cham, Switzzerland, 2016; pp. 119–141. [Google Scholar] [CrossRef]
Malhotra, S.; Yashu, F.; Saqib, N.; Mehta, D.; Jangid, J.; Dixit, S. Evaluating Fault Tolerance and Scalability in Distributed File Systems: A Case Study of GFS, HDFS, and MinIO. arXiv 2025, arXiv:2502.01981. [Google Scholar] [CrossRef]
Gadban, F. Analyzing the Performance of the S3 Object Storage API for HPC Workloads. Appl. Sci. 2021, 11, 8540. [Google Scholar] [CrossRef]
Zhang, J.; Chen, B.; Zhao, Y.; Cheng, X.; Hu, F. Data security and privacy-preserving in edge computing paradigm: Survey and open issues. IEEE Access 2018, 6, 18209–18237. [Google Scholar] [CrossRef]

Figure 1. Good policies in data management projects.

Figure 2. Archive as a Service using DataCloud paradigms IaaS, PaaS, and SaaS.

Figure 3. OAIS standard explanation.

Figure 4. The CAP Theorem. Of the three properties in the database, you can pick only two.

Figure 5. Performance scale-up to size vs. to complexity in different database families.

Figure 6. Descriptive use of polyglot persistence to different kind of data.

Figure 7. Data lake extractor and aggregator for archive middleware.

Figure 8. The picture shows the common data produced by a modern petascale observatory. This paper concentrates only on the subset of the data collected in the red box and the provides (with CTAARCHS) a feasible system for data management, archival, and handling of that kind of astronomical data.

Figure 9. Differences between standard polling and change-feed.

Figure 10. Five different datacenters hosting a database topology cluster with two nodes (for HA) in each datacenter clustered together are shown in the figure. This topology allows easy access to local storage and computing resources in order to interact with an intranet node for any DB service operation (queries, etc.).

Figure 11. IVOA implementation schema, services, and tools.

Figure 12. The generic diagram of the file transfer from on-site to off-site. The data producer uploads data to a storage resource, where a polling service monitors for the presence of new data products (1). When a new data product is detected (2), a transfer process is triggered (3,3a) by calling an Aria2 service (3b), which initiates the download of the file (3c,3d) and monitors its status (3e) until completion (3f). Once the transfer is successfully completed, the file’s status is updated in the metadata catalog (4).

Figure 13. Generic ingestion UC and replica if ingestion is successful and data policy needs a replica for the corresponding data product. When the data producer provides new data products a Submission Information Package (SIP) is delivered to the ingestion service. This service first checks the file specifications (1), then extracts the relevant metadata (2) and move the data to its final sotrage location (3); as part of this process, tools like Rucio (3a) or similar software can be used to register an entry in the file catalog (3b) physically transfer the data to the remote storage pool (3c) and, once the transfer is completed (3d) update the file catalog with metadata from the remote storage (3e). After these staps, an Archive Information Package (AIP) is compiled (4) as a JSON structure and stored in the distributed metadata database (5). If a new record is detected in the database (6), the system evaluates the file type against the replication policy to determine whether a replica is required (6a). If so, the replication process is initialized (6b) to other remote storage pools, and the system updates the file database with the current replication status throughout the process (6c,6d,6e).

Figure 14. The detailed operation of the make replica task. The change-feed mechanism in ReThinkDB triggers actions based on database activity. When a new entry matches the query “Is there newly ingested data?” (0, 1), the replication process is initiated. This process determines the most suitable origin and destination storage pools (2), then transfers the data product (3) from the origin remote storage element (RSE) to the destination RSE (3a). The system monitors the transfer until completion (3b, 3c). Finally, the replication metadata is saved, and the replica status is updated in the main distributed file database (4).

Figure 15. Generic search for a dataset use case UML. Data consumers query the search service for the existence of a dataset by specifying scientific metadata intervals (1). interface translates this input into a set of searchable keywords (2) and executes the query on the distributed database (3). The results (4) are then returned to the data consumer in the form of a Dissemination Information Package (DIP) (5).

Figure 16. Generic UML for local data product retrieval. Data consumers request the retrieval of data products (1) by submitting a JSON file list to the retrieval service. (2). The service queries the distributed file database (3) to obtain a list of URIs for the requested files (4). If a file is accessible (5), the download begins immediately (6) and data products are saved by the data consumers (7). If it is not accessible, the service scans all available replicas. If no replica is accessible, the system contacts the Rucio management system (5bis) to attempt file recovery and retrieval.

Figure 17. Find and retrieve concatenation, as expressed by the pipe-concatenation of two Python functions exported by the CTAARCHS py-library.

Figure 18. AMAS REST API in action.

Figure 19. The hardware topology of AMAS. The “?” symbol is to indicate that the SSDC node in the AMAS topology is currently under construction and evaluation (not yet implemented).

Figure 20. Standard interface from bulk (<=dl3) archival to science (>=dl3) archival. This interface triggers the ingestion/process up to dl3. When a new DL3 data record is stored in the database, the change-feed mechanism (1) triggers the scientific ingestion process (1a). This process is a lightweight version of the bulk-archive ingestion. The associated files are transferred from the origin-remote storage pool (2) to the remote Science Storage pool (2a). Metadata, already extracted and stored in the distributed file database (3), is used to populate the science database (4). Upon completion of the DL3 ingestion, a pipeline workflow can be triggered to generate higher-level data products using a dedicated DAG in any WMS.

Figure 22. A simple airflow implementation for higher level processing. This workflow is already implemented as DAG in the AMAS enviornment.

Figure 23. AMAS web portal implementation database browsing, web interfaces and AMAS implementation of MinIO—Amazon AWS Cloud Storage System and plots of AMAS Data-Quality Check.

Table 1. Advantages (disadvantages) of distributed and centralized archive solution approaches.

	Distributed Archive	Centralized Archive
Scalability	Easily scalable by adding nodes or storage across locations (complex sync among nodes)	Simpler infrastructure for small-scale systems (hard to scale up)
Resilience and Redundancy	High availability; failure of one node doesn’t compromise access (advanced monitoring)	Vulnerable to outages if no redundancy or failover is in place
Performance and Speed	Improved access times via geographic proximity; enables load balancing	Fast access for users close to the central server (degradation during heavy loads/traffic)
Flexibility and Cost	Potentially cheaper to grow incrementally (overheads for maintaining nodes)	More cost-effective for small/medium deployments (expensive upgrades)
Fault Tolerance	Built-in disaster recovery ensures data integrity (to check consistency always)	Risk of data loss if no good backup/disaster recovery is set
Security	To propagate security standards among nodes	Easier to manage access and enforce security centrally (greater risk if security is compromised)
Data Consistency	Data may be temporarily inconsistent due to network delays or partitions	Strong consistency due to a single control point.
Ease of Management	More complexity in setup and maintenance	Easier setup, backup, and management from a single location
Geographic Access	Efficient access from multiple locations (increased latency if nodes are not well distributed)	Slower access for users located far from the central server

Table 2. CAP Theorem, summary properties.

	Definition	Key Characteristics
Consistency	All nodes return the most recent write for any read request.	- Guarantees up-to-date data across the system - All parts of the system see updates immediately
Availability	Every request receives a response, even if some nodes are down.	- System remains responsive at all times - May not always return the latest data
Partition Tolerance	The system continues to work despite network failures or communication breakdowns between nodes.	- Handles network partitions gracefully - Ensures continued operation despite node isolation or failure

Table 3. Different databases technologies.

	Subtype/Model	Key Characteristics	Use Cases
Relational Databases (RDBMS)	Traditional RDBMS	Structured schema (tables with rows and columns); Uses SQL; Strong consistency with ACID (atomicity, consistency, isolation, durability) properties	Banking systems, ERP, CRM, enterprise apps
	NewSQL	Combines ACID consistency of RDBMS with horizontal scalability; Maintains SQL interface; Built for modern, high-scale applications	High-performance apps requiring strong consistency (e.g., fintech, gaming)
	OLAP/MOLAP	Optimized for analytical and BI queries; Pre-aggregated data cubes; High performance for historical data; Supports complex analytical calculations	Business Intelligence (BI), data warehousing, reporting tools
NoSQL Databases	Key–Value Stores	Simple key–value pairs; Excellent read/write performance; Easy to scale horizontally; Flexible schema	Caching, session data, real-time systems
	Column–Family Stores	Stores data by columns, not rows; Ideal for distributed large datasets; High availability and fault tolerance; Schema-less rows with flexible structure	Analytics, time-series data, telemetry, log storage
	Document-Oriented DBs	Stores semi-structured data in documents (JSON, BSON, XML); Schema-less and flexible; Supports nested data structures; Good for modern app development	Content management, product catalogs, APIs, evolving schema applications
	Graph Databases	Data represented as nodes and relationships; Efficient for traversing complex relationships; Schema flexibility; Optimized for relationship-based queries	Social networks, recommendation systems, fraud detection

Table 4. Common database problems and common solutions.

Problem	Limits and What To Do	Solutions
Scale Data Size	Approaching the maximum server capacity ▻ Distribute tables and databases across multiple machines (nodes)	Clustering
Scale Read Requests	Approaching the maximum number of DB server requests ▻ Reduce the number of requests made ▻ Distribute request traffic among different replicas	Caching Layer and Replication
Scale Write Requests	Approaching the maximum number of write requests handled by a DB server ▻ Split writes among multiple instances ▻ Split table records across multiple shards/containers	Data Partitioning and Sharding
Provide High Availability	Avoind SPOF ▻ Make services independent by crashes	Data Replication

Table 5. Comparison of RUCIO and OneData storage federation software.

Feature	RUCIO	OneData
Main Use Case	Scientific data management of CERN experiments (e.g., ATLAS)	Distributed data access and sharing
SPOF?	Yes (centralized relational catalog)	No (distributed DB catalog)
Data Sharing	Requires data duplication for cross-institution sharing	Supports federated access without data duplication
Integration Flexibility	Limited: specialized for scientific workflows and fixed data models	Advanced: designed for integration with different workflows and data models
Metadata Management	Basic support	Advanced metadata handling with multiple formats
Open-Data Support	Limited	Strong support with integration to open data standards, IVOA, etc.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gallozzi, S.; Zacharis, G.; Fiordoliva, F.; Lucarelli, F. CTAARCHS: Cloud-Based Technologies for Archival Astronomical Research Contents and Handling Systems. Metrics 2025, 2, 18. https://doi.org/10.3390/metrics2030018

AMA Style

Gallozzi S, Zacharis G, Fiordoliva F, Lucarelli F. CTAARCHS: Cloud-Based Technologies for Archival Astronomical Research Contents and Handling Systems. Metrics. 2025; 2(3):18. https://doi.org/10.3390/metrics2030018

Chicago/Turabian Style

Gallozzi, Stefano, Georgios Zacharis, Federico Fiordoliva, and Fabrizio Lucarelli. 2025. "CTAARCHS: Cloud-Based Technologies for Archival Astronomical Research Contents and Handling Systems" Metrics 2, no. 3: 18. https://doi.org/10.3390/metrics2030018

APA Style

Gallozzi, S., Zacharis, G., Fiordoliva, F., & Lucarelli, F. (2025). CTAARCHS: Cloud-Based Technologies for Archival Astronomical Research Contents and Handling Systems. Metrics, 2(3), 18. https://doi.org/10.3390/metrics2030018

Article Menu

CTAARCHS: Cloud-Based Technologies for Archival Astronomical Research Contents and Handling Systems

Abstract

1. Good and Bad Practices in Data Management Projects

2. Storage Topology: Centralized vs. Distributed Approaches

3. Selecting the Appropriate Database Architecture for Archival Systems

4. Polyglot Persistence in Modern Archive Systems

5. Polyglot Persistence in a Data Lake Scenario

6. Distributed Strategy for a Petascale Astronomical Observatory

7. FAIR Principles and VO Integration in Polyglot Persistence

8. CTAARCHS Implementation

8.1. Modular Design and Data Transfer Workflow

8.2. On-Site–Off-Site Data Transfer System

8.2.1. Prerequisites

8.2.2. Typical Workflow

8.3. Dataset Ingestion

8.3.1. Prerequisites

8.3.2. Typical Workflow

8.4. Replica Management in CTAARCHS: Automation and Policy Enforcement

8.4.1. Replication Status Levels

8.4.2. Prerequisites

8.4.3. Typical Workflow

8.5. Dataset Search

8.5.1. Prerequisite

8.5.2. Typical Workflow

8.6. Dataset Retrieval

8.6.1. Prerequisites

8.6.2. Typical Workflow

8.7. Search and Retrieve Integration/Concatenation

9. Deployment of CTAARCHS at the CIDC and AMAS

9.1. Hardware Resources

9.2. The Setup

9.3. Users Interfaces

9.4. Pipeline/Simulation Users Access and Interface

9.5. Unconventionl Challenges

9.6. Database and Data Model Interfaces

9.7. Web Archive Portal for the End-User and Other Interfaces

9.7.1. Prerequisites

9.7.2. Typical Workflow

9.8. Authentication and Authorization (A&A) Challenges

10. Conclusions and Recommendations

11. Software Resources and Repositories

11.1. CTAARCH and AMAS

11.2. RethinkDB

11.3. Minio

11.4. Airflow

11.5. HTCondor

11.6. OneData

11.7. Rucio

11.8. Django

11.9. PanDA

11.10. Dirac

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Data Model Example (DL0 Fits-Data)

Appendix A.2. JSON Object Schemas

Appendix A.3. ReThinkDB Sample Python Interface Codes

Appendix A.4. Simple Query #2

Appendix A.5. ARIA2c WebUI for Parallel Downloads

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI