1. Introduction
Digital twin (DT) platforms are designed to represent physical objects and processes in a digital environment and to support monitoring, analysis, and data-driven decision-making [
1,
2,
3]. In contemporary research and practice, a digital twin is commonly understood not as a static digital model, but as a data-connected representation of a physical entity whose state, behavior, and operational context are updated throughout its lifecycle. The practical significance of this concept increases substantially when the twin can ingest telemetry continuously and use it for condition monitoring, deviation detection, diagnostics, and operational support across domains such as buildings, infrastructure, energy, and smart-city systems [
1,
2,
3,
4,
5].
In this study, telemetry refers to measurements and service-level data received from sensors, gateways, microcontrollers, meters, and software services that generate or relay operational indicators. In large-scale deployment settings, digital twins rarely function as isolated instances; rather, they operate as interconnected elements within broader digital environments in which many assets produce heterogeneous streams that must be interpreted, normalized, and linked consistently. For this reason, telemetry ingestion is not a secondary implementation detail, but a foundational platform function. Without reliable ingestion, normalization, and validation of incoming data, digital twin systems remain difficult to scale, fragile to integrate, and limited in their ability to provide interoperable and reproducible results.
1.1. The Problem of Heterogeneity and the “Fragility” of Integrations
Most often, the problem of digital twin platforms is not the absence of sensors or communication channels, but data heterogeneity. Under real conditions, telemetry sources are usually not aligned with each other: the same indicator may have different names (for example, temp, temperatureC, t_air), be transmitted in different units depending on the measured phenomenon (for example, temperature in °C or K, humidity in %, and illuminance in lx), have different data types (number or string), different message formats (different fields, JSON nesting, different time rules), and different rules for identifying the object to which the measurement belongs (sensor identifier, device identifier, “room-1,” “greenhouse-2,” and so on). This creates a typical situation: each new sensor or service connection requires a separate “manual” integration, while data quality becomes unstable because of errors, missing values, or unexpected values.
A separate critical problem arises when the rules governing the data format exchanged between parts of the system change. For example, a device supplier renamed a field, changed the unit of measurement, or modified the message structure. If the platform does not have a formal “data contract” and automatic validation at the input stage, previously functioning integrations may silently begin to produce incorrect data or stop working entirely. That is why the “fragility” of integrations is associated not only with different formats, but also with the lack of clearly defined rules: what exactly does a measurement mean, what entity does it belong to, what units are permissible, and which fields are mandatory.
1.2. Quality of IoT Data
Even if the message format is the same, telemetry streams may contain missing values, noise, drift, duplicates, and other defects. In applied IoT scenarios, this appears as instability in time series, value “spikes,” irregular arrival patterns, and incomplete sets of fields. As a result, the platform may accumulate data that are formally accepted but, in fact, unsuitable for correct analytics or comparison across different sources. This leads to the requirement for early quality control: some problems must be filtered out already “at the input stage,” before the data enter the storage and analytical modules [
6,
7,
8,
9,
10,
11].
1.3. What Semantic and Context Standards Provide, and What They Do Not Cover
Semantic and context standards provide an important foundation for describing sensor observations, devices, and digital twin entities in a machine-readable and interoperable way. For example, W3C SSN/SOSA supports formal representation of sensors, observations, observable properties, and results [
12], W3C Web of Things Thing Description facilitates consistent machine-readable description of device interfaces [
13], and ETSI NGSI-LD provides a context-oriented model of entities, properties, and relationships that is relevant for digital twin integration [
14]. In addition, OGC SensorThings API and ontology-oriented approaches such as WoTDT/DTAG contribute to interoperable representation and aggregation of sensor observations and digital twins [
15,
16].
At the same time, these standards primarily address representational and semantic interoperability. In practical platform development, telemetry ingestion still requires an additional operational layer that specifies how heterogeneous source fields are mapped to standardized metrics, how units are harmonized, how message structure is formally validated, and how ingestion behavior can be verified reproducibly across environments. For this reason, the role of Stage 1 in the present study is not to replace semantic or context standards, but to operationalize them as a minimal engineering layer for first-stage telemetry ingestion. A broader discussion of related standards and ontology-based approaches is provided in
Section 2.
1.4. Event Contracts, Validation, and API Description
Reliable telemetry exchange in digital twin platforms requires more than connectivity alone. Platform components must share a clear understanding of which event fields are required, how these fields should be interpreted, and how incorrect or incomplete messages should be rejected before they affect downstream storage and analytics. For this reason, first-stage telemetry ingestion benefits from a formal event structure, schema-based validation, and an explicit API description that makes acceptance rules transparent and reproducible [
17,
18,
19,
20].
At the same time, reliable first-stage telemetry ingestion also depends on several non-functional requirements that accompany, but do not replace, the semantic core itself. In particular, digital twin ingestion endpoints should consider authentication and authorization, integrity protection of exchanged data, and basic resource-governance mechanisms such as rate limiting and protection against unrestricted resource consumption, especially when requests are linked to concrete platform entities and object identifiers [
19,
21]. In smart-city and infrastructure settings, these concerns are closely connected with QoS expectations, because predictable latency and controlled request handling are needed to support reliable real-time data management; in the present study, however, such mechanisms are treated as adjacent architectural requirements of Stage 1 rather than as elements of the semantic core itself [
22].
In practice, these mechanisms are often used separately: a system may define event formats without strict validation, provide API documentation without control examples, or adopt semantic models without a reproducible ingestion procedure. This fragmentation increases the fragility of integrations when data sources evolve or message structures change. The present study addresses this problem by treating event contracts, validation, and API-described testing as coordinated elements of a minimal Stage 1 ingestion layer, while their concrete implementation is presented in
Section 4.
2. Literature Review
2.1. Data Heterogeneity and Telemetry Ingestion Challenges in Digital Twin Platforms
Digital twin platforms rely on continuous data exchange among physical assets, sensing devices, gateways, and software services. In practical deployments, however, the main challenge often lies not in the lack of telemetry sources but in the heterogeneity of the incoming data. The same measured phenomenon may be represented by different field names, units, timestamp conventions, message structures, and object identification rules. As a result, telemetry ingestion becomes one of the most fragile layers of a digital twin platform, particularly when the platform evolves and new data sources are added over time [
1,
2,
3].
This challenge becomes more pronounced in multi-source and multi-asset environments, where digital twins operate not only as isolated representations of physical objects but also as interconnected elements within larger systems. In such settings, the platform must determine what a measurement represents, to which entity it belongs, in which unit it is expressed, and whether its structure is valid for further processing. Without such alignment, integration remains dependent on source-specific custom rules, and even minor upstream changes, such as field renaming or unit reformulation, may disrupt downstream processing [
1,
2,
3,
4,
5].
Recent work on urban data platforms further shows that long-term platform usefulness depends not only on technical connectivity but also on the ability to integrate diverse datasets, support real-time data management, and sustain interoperable operational workflows across stakeholders and systems [
22]. This is important for digital twin platforms because ingestion fragility often emerges at the interface between heterogeneous data producers and platform-level operational requirements rather than at the level of isolated devices alone.
Telemetry ingestion should therefore be understood not merely as a transport step but as an operational layer in which heterogeneity is either normalized or propagated. From an engineering perspective, this layer should support standardized representation of measurements, unambiguous binding to digital twin entities, formal validation of incoming events, and repeatable verification of ingestion outcomes. These requirements are especially relevant in infrastructure and smart city settings, where telemetry supports monitoring, comparison across assets, and data-driven operational decision-making [
4,
5].
2.2. Data Quality and Early Validation of IoT Telemetry
The problem of heterogeneity is closely related to the problem of data quality. Even when telemetry is delivered successfully, IoT data streams may contain missing values, malformed fields, inconsistent units, duplicates, timestamp irregularities, noise, and incomplete records. Such defects are common in practical IoT environments because data are generated by heterogeneous hardware and software components operating under different technical and organizational conditions [
6,
7,
8,
9,
10,
11].
Prior studies have shown that data quality should not be treated solely as a post-storage analytics issue. When incorrect or incomplete telemetry is accepted without early checks, subsequent stages such as aggregation, anomaly detection, or historical analysis may be performed on invalid inputs and may therefore lead to unreliable results [
6,
7,
8,
9]. For this reason, early validation at the ingestion stage is important not only for correctness but also for the long-term maintainability and reproducibility of the platform.
For digital twin systems, this implies that the ingestion layer should perform more than syntactic acceptance of messages. It should also determine whether the event structure is valid, whether required fields are present, whether measurement values are usable, and whether the event can be linked consistently to the corresponding digital twin entity. In this sense, early validation reduces storage pollution and serves as a first barrier against silent integration failures [
6,
7,
8,
9,
11].
2.3. Semantic, Ontology, and Context-Based Approaches for Sensor and Digital Twin Data
To reduce semantic ambiguity, several standards and ontology-based approaches provide a common representational foundation for sensor observations, device descriptions, and context entities. W3C SSN/SOSA defines central concepts such as sensor, observation, observable property, feature of interest, and result, thereby enabling more formal interpretation of sensor data [
12]. W3C Web of Things Thing Description supports machine-readable descriptions of devices and their interfaces, including properties, actions, and events, and thus contributes to a more consistent representation of telemetry sources [
13]. ETSI NGSI-LD provides a context-oriented model of entities, properties, and relationships and is widely used in context management systems, including those related to digital twins [
14,
20]. OGC Sensor Things API further supports interoperable representation of sensor observations and data streams [
15].
In parallel, ontology-related and knowledge-graph-based approaches have been proposed to strengthen semantic interoperability and digital twin integration. Recent studies indicate that semantic models and knowledge graphs can support richer linking among assets, devices, observations, and domain relations in industrial and cyber-physical environments [
23,
24]. Related approaches such as DTAG further demonstrate how digital twins can be aggregated or described through ontology-driven structures [
16,
25], which further argues that ontological analysis is important for addressing semantic inconsistencies, heterogeneous data sources, and interoperability challenges in cross-domain digital twin systems, while [
26] show how an ontology-driven framework can connect IoT devices, real-time data processing, and service-level logic within a machine-interpretable digital twin environment. Together, these studies reinforce the view that interoperability in digital twin systems is not only a data transport problem but also a problem of meaning, structure, and context.
At the same time, these standards and approaches primarily provide a representational baseline. They define how entities, observations, and relationships can be described consistently, but they do not by themselves yield a minimal operational ingestion layer for a concrete platform implementation. In practice, a working telemetry pipeline still requires implementation-level artifacts such as metric dictionaries, field-mapping rules, schema-based validation, and reproducible testing procedures. Accordingly, semantic and ontology-based approaches should be viewed as complementary to, rather than substitutive of, practical ingestion engineering.
2.4. Engineering Approaches to Event Contracts, Schema Validation, and API-Described Ingestion
In addition to semantic representation, digital twin platforms require formal mechanisms for exchanging and validating telemetry events. In event-driven architectures, an event contract defines which fields are required, how an event is structured, and how it should be interpreted across platform components. CloudEvents is one example of a standardized event envelope that improves the portability and interoperability of event exchange among services [
17].
Formal validation of event structure is commonly supported through JSON Schema, which enables machine-readable specification of required fields, data types, and structural constraints [
18]. This is particularly valuable in telemetry ingestion because invalid events can be rejected before they reach storage or analytical modules. OpenAPI v3.1.0 complements this by providing a standardized description of endpoints, request and response structures, and the linkage between service interfaces and underlying schemas [
19]. In context-oriented environments, NGSI-LD-related API specifications further support standardized interaction patterns [
20].
However, prior research and practice often address these elements separately. Some systems employ semantic models without strict event contracts, others provide documented APIs without formal validation rules, and still others use schema validation without explicit mapping from raw device fields to standardized measurements. Similarly, data quality pipelines may define quality checks without integrating them tightly with semantic entity binding and digital twin context [
10]. Such fragmentation increases integration fragility because correctness depends on assumptions distributed across multiple artifacts and implementation layers.
From the perspective of digital twin telemetry ingestion, a more practical engineering question is how these elements can be combined into a minimal and reproducible first-stage framework. Such a framework should integrate semantic representation, field-level normalization, formal event validation, and API-based verification into one coherent workflow rather than treating them as isolated technical features.
2.5. Smart City and Municipal Building Digital Twins as an Application Context
Smart city digital twins extend the digital twin concept to urban and municipal environments, where heterogeneous data must be integrated across buildings, infrastructure, facilities, and public services. Recent reviews indicate that smart city and urban digital twins are increasingly used to support monitoring, simulation, resource optimization, and decision-making in urban contexts [
4,
5]. At the same time, broader review work across buildings, landscape, and urban environments shows that digital twin research is increasingly spanning asset, building, and urban scales rather than remaining confined to a single level of representation [
27,
28] further note that, despite major technical progress, many urban digital twin ambitions remain only partially realized because socio-technical integration challenges remain insufficiently addressed.
Recent framework-oriented work also highlights the central role of data integration and semantic urban models in practical urban digital twin development. For example, TwinCity, proposed by [
29], treats validated semantic 3D city models as a foundational element of urban digital twins and explicitly identifies data standards, data acquisition, and lifecycle-supporting software as continuing barriers to implementation. This is directly relevant to the present study because it shows that urban digital twin value depends not only on high-level modeling ambitions but also on operational mechanisms for handling heterogeneous data in implementable workflows.
Municipal buildings represent a particularly relevant class of urban assets for this purpose. Schools, hospitals, administrative buildings, and other public facilities commonly rely on environmental and operational monitoring for temperature, humidity, illumination, energy use, and equipment status. These assets are also managed through hierarchical administrative and spatial structures, such as community, facility, building, floor, and room. This makes them suitable examples for demonstrating how telemetry can be linked to digital twin entities in a structured and reusable way [
4,
5].
This application context is also supported by recent building-oriented studies [
30], which describe a multiscale monitoring framework for retrofitted public schools that combines building-, room-, and device-level sensing with data processing functions such as redundancy handling, fault detection, and consistency scoring. Additionally, ref. [
31] likewise emphasize that digital twins in smart building systems can improve indoor environmental quality, energy efficiency, and operational decision-making. Together, these works strengthen the choice of a municipal-building scenario as a realistic and representative smart-city asset-level context rather than merely an illustrative example.
From the viewpoint of telemetry ingestion, a municipal building scenario is especially appropriate because it combines several common challenges at once: multiple sensor types, heterogeneous field naming, the need for unit harmonization, entity binding across a building hierarchy, and the requirement to preserve reproducibility as the platform evolves. For this reason, municipal building monitoring is not merely a convenient illustration but a representative application context for evaluating a first-stage semantic ingestion framework in a smart city setting.
2.6. Research Gap and Positioning of This Study
The reviewed literature demonstrates that several relevant strands of research are already well developed. Digital twin surveys and architecture-oriented studies explain the conceptual importance of data integration and interoperability in digital twin systems [
1,
2,
3]. IoT data quality research highlights the need for early validation and the risks associated with incomplete, noisy, or inconsistent telemetry [
6,
7,
8,
9,
10,
11]. Semantic and context standards, together with ontology-based approaches, provide strong foundations for representing sensors, observations, entities, and relationships in a machine-readable way [
12,
13,
14,
15,
16,
23,
24]. Smart city studies further show that municipal and urban applications require robust integration of heterogeneous telemetry across multiple assets and domains [
4,
5].
At the same time, the literature less frequently offers a minimal engineering framework that unifies these strands at the level of first-stage telemetry ingestion. Recent work on ontological foundations and urban digital twins reinforces the importance of semantic interoperability, cross-domain integration, and deployable urban data frameworks, but it still leaves open the practical question of how to operationalize these ideas in a compact ingestion layer that standardizes raw source fields, harmonizes units, validates event structure, binds telemetry to digital twin entities, and supports reproducible verification in implementation settings [
25,
28,
29].
Existing studies often address these aspects individually, but less often as a compact, operational, and testable combination suitable for the initial ingestion layer of a digital twin platform.
This study addresses that gap by proposing a minimal Stage 1 semantic core for sensor telemetry ingestion in digital twin platforms. The proposed approach combines a machine-readable semantic model of entities and relations, dictionaries of metrics and units, a mapping table for raw-field normalization, schema-based validation of telemetry events, and an API-based ingestion service supported by reproducibility artifacts. In this way, the study is positioned not as a replacement for semantic or context standards but as a complementary implementation layer that operationalizes them for reliable first-stage telemetry ingestion in smart city and infrastructure-oriented digital twin scenarios.
3. Contributions
The aim of this paper is to describe and validate a minimal but formally verifiable foundation for telemetry ingestion in a digital twin platform, which harmonizes data at the input stage and reduces the fragility of integrations as the platform evolves.
The essence of this approach is to treat semantics as a set of rules that are actually applied during data ingestion: the semantic layer defines permissible entity types, identification rules, dictionaries of metrics and units, as well as the structure and validation of the telemetry event.
The main contributions of this paper are as follows:
A Stage 1 semantic foundation is proposed in the form of a machine-readable catalog of entity types and relationships for the water, energy, building, and agriculture domains, as well as common platform entities.
Dictionaries of metrics and units are developed to ensure indicator normalization and the unification of measurement units.
A telemetry event contract is defined based on the separation of the event passport and payload, which makes message exchange more stable and more understandable for different components.
Formal event validation is implemented using data schemas, which automatically rejects invalid messages and reduces storage “pollution.”
Transformation rules are formalized for converting raw sensor fields into standardized measurements (*name/value/unit*) and for linking telemetry to a digital twin entity.
Reproducibility of validation is ensured through control event examples, a testing protocol, and integrity control of key artifacts.
The rest of the paper is organized as follows:
Section 4 presents the materials and methods of Stage 1;
Section 5 reports the results, including the smart city case and the minimal evaluation;
Section 6 discusses the practical effect and limitations; and
Section 7 provides the conclusions and directions for future work.
4. Materials and Methods
This study considers the first stage (Stage 1) of digital twin platform development, with a specific focus on sensor telemetry ingestion and harmonization at the input stage; the implementation used in this study corresponds to DTwin Stage 1: Semantic Core for Telemetry Ingestion, version 1.0.0 [
32].
Stage 1 is not intended to represent a complete digital twin state core with long-term state history, subscription and notification mechanisms, or advanced analytics. Instead, it defines the minimal operational layer required to receive telemetry, align it with a semantic structure, normalize heterogeneous inputs, and verify ingestion results in a reproducible manner.
Methodologically, Stage 1 is implemented as a coordinated workflow that links several classes of artifacts and procedures into a single ingestion path. The semantic model defines which entities and relations can participate in telemetry binding; dictionaries define how metric names and units are normalized; event contracts define which structures can be formally accepted; the mapping layer defines how raw source fields are transformed into standardized measurements; and the Ingest API executes validation, canonicalization, and controlled event handling. Evidence snapshots and the testing protocol then provide reproducible confirmation that these elements operate together under the same rules.
At the repository level, this workflow is implemented through a fixed set of core artifacts. The main files include:
the semantic (context) domain model in machine-readable form (‘contracts/domain/context_model_v1.json’);
dictionaries of metrics and measurement units (‘contracts/domain/dictionaries_v1.json’);
event schemas (‘contracts/schemas/events/envelope/v1.json’ and ‘contracts/schemas/events/telemetry/v1.json’);
harmonization rules (‘docs/03_mapping/MAPPING_TELEMETRY_v1.md’);
the telemetry ingestion service (‘services/ingest-api/app/main.py’);
the Swagger-based testing protocol (‘docs/02_testing/INGEST_API_SWAGGER_PROTOCOL_v1.md’);
evidence snapshots of accepted events (‘services/ingest-api/ingest/evidence/*.json’), and the Stage 1 architecture description (‘docs/04_architecture/STAGE1_COMPONENTS_AND_MODELS_v1.md’).
These artifacts are summarized in
Figure 1 and are used together as the operational basis of the Stage 1 ingestion workflow.
In this sense, the proposed approach should be understood as complementary to semantic, ontology-based, and context-oriented digital twin approaches rather than as a replacement for them. Standards and semantic models provide the representational foundation for describing entities, observations, and relationships, whereas Stage 1 operationalizes this foundation as an engineering layer for first-stage telemetry ingestion. The purpose of this section is therefore not only to describe the repository artifacts, but also to explain how they function together as a practical ingestion framework.
The implementation builds on a set of widely used technical mechanisms for event exchange, formal schema validation, and API-described testing. In particular, Stage 1 uses a structured envelope–payload event logic, machine-readable JSON-based contracts, API documentation aligned with OpenAPI Specification 3.1.0, and repository-based evidence artifacts for repeatable verification. This combination makes it possible to treat Stage 1 not merely as a conceptual model, but as a deployable and testable ingestion layer whose behavior can be inspected through artifacts, validation rules, and controlled execution scenarios.
4.1. Examples Demonstrating the Relationship Between Artifacts
At the implementation level, Stage 1 is realized as a coordinated set of repository artifacts and procedures that jointly support telemetry ingestion, normalization, validation, and reproducible verification. In this subsection, the relationships among the main artifact groups are illustrated through examples that show how semantic structure, event contracts, mapping rules, ingestion logic, and evidence artifacts operate together within one controlled ingestion path.
The main artifact groups fixed in the repository include the semantic domain model, dictionaries of metrics and measurement units, event schemas for the envelope and telemetry payload, mapping rules for raw-field normalization, the Ingest API implementation, the Swagger-based testing protocol, evidence snapshots of accepted events, and the Stage 1 architecture description. Each artifact group has a distinct role, but their practical value emerges from their coordinated use during ingestion: the semantic model constrains entity binding, dictionaries support normalization, schemas determine formal acceptability, mapping rules transform raw inputs, the ingestion service executes the workflow, and evidence artifacts support repeatable verification.
Figure 1 summarizes this coordinated workflow from the telemetry source to the accepted-event evidence. In this view, the repository serves not only as a storage location for technical files, but also as the environment in which the rules of Stage 1, their implementation, and the corresponding evidence are preserved together. The following subsections illustrate these relationships in more detail, beginning with the separation between stable event metadata and variable telemetry content and then showing how semantic structure and ingestion logic interact operationally.
4.1.1. Envelope + Telemetry
At Stage 1, a telemetry event is organized as two coordinated parts: a stable envelope and a variable telemetry payload. The envelope contains the service-level metadata required for identification, routing, and validation of the event, whereas the payload contains the actual measurements in a standardized form. This separation makes it possible to preserve a stable event-processing logic even when the content of measurements evolves through the addition of new indicators, labels, or source-specific extensions.
From an operational perspective, the envelope contains the fields required to determine what the event is, where it comes from, and to which digital twin entity it belongs, while the payload carries the measurement content that is subject to normalization and further processing.
Table 1 summarizes this separation between stable event metadata and variable telemetry content.
A typical telemetry event in this form can be represented in
Figure 2.
This example illustrates the methodological role of the envelope–payload split in Stage 1. The stable part supports reproducible processing and validation, while the variable part remains extensible and suitable for the harmonized representation of telemetry from heterogeneous sources. In this way, the event structure provides a practical basis for combining contract-based validation with semantic binding and subsequent normalization procedures.
4.1.2. Operational Relationship Between Semantics and Data Ingestion
The operational relationship between semantics and data ingestion in Stage 1 is illustrated in
Figure 3 through the way semantic constraints, mapping rules, and validation logic act together during telemetry processing. The semantic model determines which entity types and relations are admissible in the platform context, while the ingestion workflow uses these constraints to support unambiguous binding of telemetry to the corresponding digital twin entity. In this sense, semantics is not applied only for domain description, but also for controlling how incoming data are interpreted at the ingestion stage.
The mapping table complements this role by defining how heterogeneous raw source fields are transformed into standardized measurements in the form ‘[name, value, unit]’. As shown in
Figure 3, raw fields are aligned with dictionaries of metrics and measurement units before becoming part of a harmonized telemetry representation. As a result, semantic structure and field-level normalization are connected operationally: the semantic model constrains what can be linked, while the mapping layer determines how source-specific telemetry is converted into a platform-compatible form.
Accordingly, Stage 1 uses semantics as an executable ingestion aid rather than as a purely descriptive layer. The practical effect of this design is that telemetry can be interpreted, normalized, and validated under one coordinated rule set, which reduces ambiguity at the input stage and prepares the event for subsequent formal validation and controlled ingestion.
4.2. Semantic Model of Entities and Relationships
At the Stage 1 level, the semantic model defines the minimal set of entity types and relationship types sufficient for linking telemetry to a specific digital twin entity in the domains of water, energy, buildings, agriculture, and the common domain.
Stage 1 artifact: contracts/domain/context_model_v1.json. The model includes:
A list of entity types with a short description and domain affiliation;
A list of relationship types with their semantic meaning;
Constraints for the consistent interpretation of entities across platform components.
The current version of the model contains 37 entity types and 9 relationship types. The semantic model is used as an operational component during event ingestion: it defines the permitted values of entityType, for example, Sensor, and supports unambiguous binding of entityId. The general model graph is shown in
Figure 4, and the minimal subgraph for ingesting a telemetry-measured event is shown in
Figure 5.
4.3. Dictionaries of Metrics and Measurement Units
To unify telemetry from different sources, Stage 1 uses dictionaries of standardized metric names and measurement units. This makes it possible to interpret indicators consistently regardless of their source and reduces the amount of manual configuration required when connecting new sensors.
Stage 1 artifact: contracts/domain/dictionaries_v1.json. The dictionaries are used for:
Validation and normalization of metric names (name) in measurements;
Validation and normalization of measurement units (unit), including harmonization of notations;
Consistent use of additional labels (tags), where applicable.
4.4. Telemetry Event Contracts and Formal Validation
In Stage 1, telemetry is transmitted in the form of an event with separation into a stable envelope and a payload. The envelope contains fields for identification and routing, namely eventId, time, type, source, entityId, and schemaRef, while the payload contains measurements in the standardized form measurements [{name, value, unit}] and additional tags.
Stage 1 artifacts:
During event ingestion, the envelope is validated first, and then the payload is validated. Invalid messages are rejected with an error and are not written to the runtime storage. The separation of the envelope and the payload makes it possible to extend the set of metrics or tags without changing the basic logic of identification and routing.
4.5. Mapping Table for Normalizing Raw Sensor Fields
Telemetry sources often use different indicator names and different unit notations. Therefore, Stage 1 applies a mapping table that formalizes the transformation of raw source fields into standardized measurements [{name, value, unit}] and defines the rules for linking telemetry to a digital twin entity. Those steps are shown in
Figure 6. This approach is consistent with semantic and context standards for describing sensor observations and digital twins [
10,
11,
12,
15,
24].
The mapping table defines three groups of rules:
Normalization of indicator names: source input field → standardized metric name (name);
Normalization of measurement units: harmonization of unit notations and, where necessary, conversion;
Entity identification: generation and validation of entityId and entityType to which the telemetry belongs.
For transparency and reuse of mapping rules, it is advisable to document them in the form of a table structured as “field → metric → unit → transformation.” An example of such a representation is given in
Table 2.
For stable telemetry binding regardless of the specific source, a deterministic ‘entityId’ format is used. This reduces the risk of duplication and ensures unambiguous matching of events to a specific sensor or object. The ‘entityId’ formula and examples are shown in
Figure 7.
The procedure for applying the mapping during event ingestion includes the following steps:
Extraction of raw fields from the source message;
Matching of fields to metrics and validation of name against the dictionaries;
Normalization of unit and, where necessary, transformation of the value;
Construction of measurements and validation of entityId and entityType against the semantic model.
The result is an event with harmonized measurements, which proceeds to formal validation and ingestion into the Ingest API.
4.6. Telemetry Ingestion Service (Ingest API) and Event Processing Pipeline
The telemetry ingestion service is implemented as a web service that, at Stage 1, provides formal acceptance or rejection of events in JSON format. The service has the following basic endpoints:
The overall event ingestion pipeline, from the source to runtime storage and the service response, is shown in
Figure 8.
The event processing procedure at the ‘/ingest’ endpoint is performed sequentially:
Receive the telemetry event through ‘/ingest’.
Validate the event envelope against the envelope schema.
Validate the payload against the telemetry schema.
Build a canonical representation of the event using a unified JSON format defined by the implementation.
Compute the SHA-256 integrity hash of the canonical record.
Store the accepted event in runtime storage.
Return the ingestion result as accepted or rejected together with the integrity hash.
Typical service responses for two scenarios, successful ingestion and rejection due to a validation error, are shown in
Figure 9.
4.7. Evidence Snapshots and Integrity Control for Reproducibility by Design
To ensure reproducibility, Stage 1 deliberately separates operational runtime data from repository-based evidence snapshots. This makes it possible to run the service in any environment while verifying its operability against a fixed reference set of artifacts.
Runtime data: accepted events accumulated during service operation in the directory services/ingest-api/ingest/inbox/; these data are not committed.
Evidence snapshots: a reference set of events and artifacts stored in the directory services/ingest-api/ingest/evidence/; these artifacts are committed as evidence and as a baseline for repeated validation.
SHA-256 integrity hashes are used as a seal of integrity: if the contracts, dictionaries, or reference examples are changed, the computed hash also changes. Combined with the testing protocol, this provides a simple and reproducible way to confirm that Stage 1 operates under the same rules and on the same reference examples.
Stage 1 artifacts that ensure reproducibility:
services/ingest-api/ingest/inbox/—runtime storage, ignored by the version control system;
services/ingest-api/ingest/evidence/—reference examples and evidence, committed to the repository;
SHA-256 for the canonical event record—integrity control of the ingestion result.
4.8. Testing Protocol and Acceptance Criteria for Stage 1
The operability of Stage 1 is evaluated through a standardized testing protocol executed via the API documentation in Swagger UI and recorded in the file ‘docs/02_testing/INGEST_API_SWAGGER_PROTOCOL_v1.md’. The protocol is designed to verify not only whether the ingestion service is reachable, but also whether the main Stage 1 functions—event acceptance, rejection of invalid inputs, portability of execution, and separation of runtime and evidence artifacts—operate in a controlled and reproducible manner.
For clarity, the protocol is organized as a set of test scenarios, abbreviated in this section as ‘TS’ (Test Scenario). A summary of the Stage 1 test scenarios is presented in
Table 3.
The detailed execution steps are described in ‘docs/02_testing/INGEST_API_SWAGGER_PROTOCOL_v1.md’. Together, the scenarios TS-01 to TS-05 verify service availability, successful ingestion of valid telemetry, rejection of invalid events, portability of execution, and the correct separation of runtime storage from repository-based evidence artifacts. In this way, the protocol supports repeatable confirmation that Stage 1 operates under the same validation and evidence-handling rules across repeated executions.
To ensure consistent acceptance of Stage 1 results, the protocol is complemented by an acceptance checklist summarized in
Figure 10.
The success criterion for Stage 1 is the successful completion of the defined test scenarios together with the availability of repository-based evidence snapshots and integrity hashes, which provide reproducible confirmation of the ingestion outcome under the current Stage 1 rules.
5. Results
This section summarizes the Stage 1 results across three interconnected dimensions: the development of the semantic model, the normalization of telemetry through the mapping table, and the implementation of event ingestion through the Ingest API with reproducibility verification based on the testing protocol.
At Stage 1, a semantic context model of the platform was developed, which defines the entity types and permissible relationship types among them for the domains of water, energy, buildings, agriculture, and the common domain. The current version of the model, recorded in the artifact contracts/domain/context_model_v1.json, contains 37 entity types and 9 relationship types. The practical significance of this result lies not only in describing the domain but also in the operational use of the model during telemetry ingestion, since the entityType values in events are aligned with the entity types defined in the model, which ensures correct binding of data to digital twin entities.
To reduce the heterogeneity of telemetry streams, a mapping table was implemented and recorded in the artifact docs/03_mapping/MAPPING_TELEMETRY_v1.md. It formalizes the transformation of raw sensor fields into standardized measurements in the format measurements [{name, value, unit}] and supports the construction of links to the corresponding entityId and entityType. Thus, the result of normalization is not merely a unified representation of individual indicators, but a harmonized telemetry event in which metric names, measurement units, and the binding to the digital twin entity are standardized.
A separate result of Stage 1 is the implementation of the telemetry ingestion service, the Ingest API, described in services/ingest-api/app/main.py. The service provides the endpoints/health and/ingest and performs sequential formal processing of events, including validation of the envelope and payload against the envelope and telemetry schemas, creation of a canonical event record, computation of the SHA-256 integrity hash, storage of the accepted event in runtime storage, and return of the ingestion result. At the same time, a clear rejection policy is implemented: invalid events are rejected with HTTP 400 or the status rejected and are not written to runtime storage. In addition, execution portability is ensured through the use of the variables DTWIN_REPO_ROOT and DTWIN_INGEST_DIR, which reduces the dependence of the service on the current working directory.
The operability and reproducibility of Stage 1 were confirmed through a testing protocol executed via Swagger UI and recorded in docs/02_testing/INGEST_API_SWAGGER_PROTOCOL_v1.md. The protocol covers service availability checks, positive and negative event ingestion scenarios, execution portability, and rules for handling evidence artifacts. Another important result is the creation of a set of evidence artifacts in the directory services/ingest-api/ingest/evidence/, where reference event examples and their associated validation artifacts are stored. Taken together, this makes it possible to consider the successful completion of the test scenarios TS-01 to TS-05, along with the availability of evidence snapshots and SHA-256 integrity hashes, as the acceptance criterion for the successful completion of Stage 1.
5.1. Smart City Case: A Municipal Building as a Digital Twin
A municipal building was selected as the case-study scenario because it represents a typical city-managed public asset in which heterogeneous telemetry is directly relevant to monitoring, maintenance, and operational decision-making. In this study, the scenario is represented by a school or hospital equipped with temperature, humidity, and light sensors connected through an MQTT- or HTTP-based gateway. Within Stage 1, the semantic model defines the hierarchy of containers as Community–Facility–Building–Floor–Room, as well as the technical context Asset–Device–Sensor, while telemetry is linked to a specific sensor through the attributes ‘entityType’ and ‘entityId’. Such a configuration is consistent with common approaches to digital twins in smart cities and urban infrastructure [
16,
17] and provides a realistic context for evaluating first-stage telemetry harmonization, entity binding, and schema-level event validation in a public-infrastructure-oriented digital twin setting.
In practical implementation terms, the Stage 1 workflow can be deployed using the currently available laboratory infrastructure of the project team rather than specialized proprietary equipment. The available environment includes general-purpose laptops and desktop workstations, embedded and edge-development platforms such as Raspberry Pi, Banana Pi, Jetson Nano, and Arduino, as well as server resources for storage, replay-based testing, and controlled execution of the ingestion service. Within this setup, telemetry can be emulated or received from standard sensor and gateway interfaces, while the software side relies on the Stage 1 ingestion service, JSON-based event contracts, schema validation, and repository-based evidence artifacts. In this way, the municipal-building case should be understood as a realistic deployment-oriented scenario supported by currently available laboratory infrastructure, while additional project equipment is planned for subsequent extension rather than assumed as already operational.
In this scenario, Stage 1 is applied as a coordinated ingestion workflow. The context model determines how the sensor is situated within the municipal-building hierarchy and how the observed telemetry is associated with the corresponding asset and room context. The mapping table then transforms raw source fields, such as ‘t’, ‘hum’, and ‘light’, into standardized measurements in the form ‘measurements[{name, value, unit}]’, provided that explicit mapping rules are available. After that, the Ingest API validates the event envelope and payload against the corresponding schemas, applies canonicalization to the accepted event record, and produces evidence artifacts for reproducible verification. In this way, the case demonstrates how Stage 1 connects semantic structure, field-level normalization, and formal event handling within one municipal smart-building telemetry path.
At the case-study level, the main result is that Stage 1 provides a structured and reproducible ingestion path for municipal-building telemetry rather than a collection of source-specific integration rules. Within the limits of the present scenario, the framework supports consistent entity binding, schema-level acceptance or rejection of events, and standardized representation of mapped measurements. Thus, the case serves as a focused demonstration of first-stage telemetry harmonization in a smart-city-oriented public asset context, while broader-scale deployment and long-term operational validation remain outside the scope of Stage 1.
5.2. Minimal Evaluation and Baseline Comparison
This subsection reports a minimal proof-of-concept evaluation of Stage 1 in the municipal-building scenario introduced above. The purpose of the evaluation is not to provide a large-scale performance benchmark, but to characterize the behavior of the proposed ingestion layer with respect to five practical aspects that are critical at the first stage of digital twin telemetry integration: formal rejection of invalid events, normalization of heterogeneous sensor fields, consistency of entity binding, repeatability of evidence artifacts, and request processing time at the ‘/ingest’ endpoint. In this way, the evaluation is intended to show whether Stage 1 provides a controlled and reproducible ingestion path for heterogeneous telemetry in a realistic smart-city-oriented asset context.
The evaluation was performed through repeated replay of a reference event set using the same Stage 1 implementation environment and the same ‘/ingest’ endpoint. In total, N = 250 requests were executed, including both valid telemetry events and synthetically corrupted variants intended to test schema-level rejection behavior. As shown in
Table 4. the replay-based setup was selected in order to examine the response of the ingestion layer under controlled and reproducible conditions, while keeping the evaluation aligned with the repository artifacts and the testing protocol used in Stage 1 validation.
Across the replay-based evaluation, the proportion of accepted events was 0.800, whereas the proportion of rejected events was 0.200. For the ‘/ingest’ endpoint, the median processing latency was 16.99 ms, and the 95th percentile latency reached 31.88 ms. Taken together, these results indicate that Stage 1 can enforce schema-level rejection of invalid events while maintaining low request-processing latency within the limited reference-event setting used in this study.
A separate point concerns the handling of unmapped raw fields during normalization. In the ‘raw_samples’ subset, the median mapping coverage reached 0.67, while the fields ‘t’ and ‘hum’ remained unmapped in two raw examples. In the current Stage 1 evaluation, such fields were treated as inputs that did not participate in the construction of standardized ‘measurements’ unless an explicit mapping rule was available. Accordingly, unmapped aliases did not contribute to the harmonized output and were interpreted as evidence of incomplete field-level normalization rather than as successful standardized transformations. This result indicates that Stage 1 reduces heterogeneity only within the scope of the defined mapping rules and that additional aliases require explicit extension of the mapping table.
The interpretation of entity binding and evidence repeatability should also be kept within the limits of the current evaluation setup. In the reference-event set used in Stage 1 validation, no ‘entityId’ collisions or instability were observed; however, this result should be understood as a case-level observation within a limited replay-based setting rather than as a statistically strong proof of collision resistance in large-scale multi-source deployments. Similarly, matching SHA-256 hashes across repeated runs confirms the repeatability of the canonicalized evidence snapshots produced by the current implementation, but they do not by themselves provide exhaustive proof of semantic correctness for all possible mapping changes or business-logic variations. In this sense, the present evaluation supports implementation-level repeatability and controlled event handling, while broader robustness testing remains a task for subsequent stages.
For the baseline comparison, two ingestion modes were considered within the same reference-event setting. The first mode corresponds to a simplified raw-ingest configuration in which telemetry events are passed forward without field-level mapping and without formal schema-based rejection at the Stage 1 level. The second mode corresponds to the proposed Stage 1 configuration, which applies mapping rules, metric and unit dictionaries, envelope and payload validation, and canonicalization of accepted events. This comparison was intended to distinguish the additional control introduced by Stage 1 from a minimally constrained ingestion path operating on the same event set.
To provide a minimal quantitative contrast, both ingestion modes were evaluated under the same replay-based setting, using the same reference-event set (N = 250), the same execution environment, and the same request pattern. The raw-ingest mode was defined as a simplified configuration in which incoming telemetry events were accepted as valid JSON and written to storage without field-level mapping, metric and unit normalization, schema-based rejection, canonicalization, or evidence-snapshot generation. The Stage 1 mode corresponded to the full ingestion workflow described in
Section 4.
Table 5 summarizes the minimal comparison between these two modes.
This comparison should be interpreted as a bounded implementation-level contrast rather than as a full-scale benchmark. Its purpose is to show, under the same controlled replay conditions, how the additional validation and harmonization logic of Stage 1 affects event acceptance behavior and request-processing time relative to a minimally constrained ingestion path. In this comparison, Stage 1 introduced a median latency increase of 12.64 ms and a 95th percentile increase of 5.72 ms, while providing schema-level rejection of invalid events and controlled harmonization of accepted inputs.
The present evaluation is limited to a controlled replay of reference events and is intended to verify first-stage ingestion behavior rather than to provide exhaustive throughput benchmarking or large-scale robustness assessment. In particular, the current results do not yet constitute statistically strong evidence regarding collision resistance of ‘entityId’ generation, long-run behavior under heterogeneous multi-source traffic, or advanced temporal plausibility checks such as delayed arrival, out-of-order delivery, or timestamp anomalies. These aspects remain outside the scope of Stage 1 and should be addressed in subsequent stages through broader-scale experimental validation.
6. Discussion
The results obtained in the municipal-building case study indicate that Stage 1 can function as a coherent first-stage telemetry ingestion layer rather than as a collection of isolated technical artifacts. In the present scenario, the framework combined semantic structure, field-level normalization, schema-based event handling, and evidence-oriented verification within one controlled ingestion path. Accordingly, the discussion below interprets these results in relation to the main Stage 1 components introduced in
Section 4 and the case-study and evaluation outcomes reported in
Section 5, rather than treating the proposed framework only as an abstract architectural construct.
First, the case-study results support the practical role of the semantic model as an operational component of telemetry ingestion rather than only as a descriptive artifact. In the municipal-building scenario, the hierarchy Community–Facility–Building–Floor–Room and the technical chain Asset–Device–Sensor provided a structured basis for linking incoming events to the corresponding digital twin entities through ‘entityType’ and ‘entityId’. This indicates that even at Stage 1, semantic structure can improve the consistency of telemetry binding in a public-asset setting. At the same time, the current evaluation supports this conclusion only at the level of the limited reference-event case and should not yet be interpreted as a large-scale validation of cross-domain entity-binding robustness.
Second, the results clarify the practical role and the limits of the mapping layer in reducing telemetry heterogeneity. In the current case, the mapping table enabled transformation of raw source fields into standardized measurements only where explicit normalization rules were available, which confirms its value as an operational mechanism for first-stage harmonization. At the same time, the presence of unmapped aliases in the evaluation shows that heterogeneity is not eliminated automatically once semantic structure is defined; rather, it is reduced incrementally through the extension and maintenance of mapping rules. In particular, the unmapped aliases ‘t’ and ‘hum’ observed in the replay-based evaluation illustrate that first-stage harmonization remains dependent on explicit rule coverage. Therefore, the contribution of Stage 1 in this respect lies not in claiming full semantic closure over all possible source formats, but in providing a controlled and transparent procedure for standardizing heterogeneous inputs within a defined rule set.
Third, the replay-based evaluation highlights the practical value of schema-level validation as an input-stage control mechanism. The observed rejection of invalid events, together with the low request-processing latency reported for the ‘/ingest’ endpoint in the controlled replay setting, indicates that Stage 1 can introduce formal acceptance criteria without making the ingestion path operationally impractical at this scale. In this respect, the results support the view that first-stage validation is useful not only for structural correctness but also for preventing invalid telemetry from entering downstream storage and processing chains. At the same time, these findings should be interpreted within the limits of the present proof-of-concept evaluation rather than as a broad performance claim under large-scale or highly heterogeneous deployment conditions.
Fourth, the evidence artifacts and SHA-256-based canonicalization should be interpreted as mechanisms of implementation-level repeatability rather than as exhaustive proof of semantic correctness. In the present evaluation, matching hashes across repeated runs indicates that accepted events can be reproduced in a stable canonical form under the same controlled conditions, which is useful for auditability, change tracking, and repeatable verification of ingestion outputs. At the same time, this result does not imply that all possible future modifications of mapping logic, semantic rules, or business-level interpretation would preserve equivalent meaning. Therefore, the contribution of Stage 1 in this respect lies in providing a transparent and verifiable evidence layer for accepted events, while deeper semantic validation remains beyond the current scope.
From a practical perspective, these results suggest that Stage 1 can serve as a useful first ingestion layer for smart-city and public-asset scenarios in which telemetry originates from heterogeneous sensors and must be linked to structured municipal contexts. In settings such as schools, hospitals, and administrative buildings, the main value of this approach lies not only in accepting telemetry but in making the initial ingestion process more transparent, testable, and less dependent on ad hoc source-specific integration rules. In this sense, Stage 1 is relevant as an operational foundation for municipal-building monitoring, where consistent entity binding, controlled normalization, and input-stage validation are required before broader digital twin functions can be introduced. At the same time, the present results support this implication at the level of a first-stage smart-building case rather than as a complete smart-city deployment claim.
These results should also be interpreted together with the explicit limitations of Stage 1. The current framework is limited to first-stage telemetry ingestion and does not yet include a full digital twin state core, long-term state history, subscription and notification mechanisms, advanced temporal plausibility checks, or large-scale validation under sustained multi-source traffic. Likewise, the present study does not claim exhaustive benchmarking against alternative ingestion implementations, statistically strong evidence of ‘entityId’ collision resistance, or complete coverage of all possible raw telemetry aliases. Therefore, Stage 1 should be understood as a minimal but operational ingestion foundation whose value lies in controlled structure, validation, and reproducibility at the input stage, rather than as a complete end-to-end digital twin platform.
Taken together, the present findings indicate that the next development step should focus on extending Stage 1 from a controlled first-ingestion framework toward broader operational validation and richer digital twin functionality. This includes expansion of mapping coverage, introduction of more advanced temporal and semantic plausibility checks, support for additional event types, and stronger comparative evaluation against alternative ingestion configurations. It also includes integration with state-oriented platform components, such as historical state management, subscription mechanisms, audit functions, and compatibility testing across evolving contracts and mapping rules. In this way, the current results do not close the problem of digital twin telemetry integration, but establish a reproducible first-stage foundation on which subsequent platform stages can be built and evaluated more rigorously.
7. Conclusions and Future Work
This paper proposed and implemented a semantic core for sensor telemetry ingestion as a minimal, formally verifiable, and reproducible foundation for a digital twin platform. At the first stage, a machine-readable semantic model of entities and relationships was developed for the domains of water, energy, buildings, agriculture, and the common domain, together with dictionaries of metrics and measurement units for data unification. A two-layer event contract based on the separation of envelope and payload was proposed, and formal input validation of events against data schemas was implemented, making it possible to automatically reject invalid messages before they enter storage. For the practical harmonization of heterogeneous sources, a mapping table was developed to normalize indicator names and units and to provide deterministic entity identification through entityId and entityType. An Ingest API service with the/health and/ingest endpoints was implemented to perform validation, canonicalization of records, and computation of SHA-256 for integrity control of accepted events. The operability of the approach was confirmed through a standardized testing protocol executed via Swagger UI and repository-based evidence snapshots, which ensure repeatability of validation.
In the replay-based proof-of-concept evaluation (N = 250), Stage 1 accepted 0.800 of events and rejected 0.200, while the ‘/ingest’ endpoint showed a median latency of 16.99 ms and a 95th percentile latency of 31.88 ms; relative to the raw-ingest baseline, this corresponded to an additional median overhead of 12.64 ms and a 95th percentile overhead of 5.72 ms.
Future work includes extending the set of event types, in particular, state events, commands, and notifications. The next step is the introduction of semantic validation, covering checks for valid ranges, consistency of measurement units, correctness of temporal attributes, and anomaly detection. Integration with historical storage and the digital twin state core is also planned, together with the implementation of subscription mechanisms, audit functions, and access control.
A separate direction for further development is related to the formalization of version management processes for contracts and mapping rules, as well as the introduction of automated compatibility tests and validation procedures within a continuous integration environment.