1. Introduction
Transportation infrastructure in urban areas (intersections, traffic signals, sensor loops, and camera surveillance) plays an essential role in the functioning of smart cities by providing improved mobility and safety and less environmental impact. The mounting strain on these systems (ageing hardware, increasing traffic, environmental conditions, and variability in operational conditions) has increased the likelihood of unanticipated failures. A single failed traffic signal, sensor, or unreported outage can result in increased congestion, energy consumption, and CO2 emissions and reduced service levels for users.
Historically, infrastructure performance monitoring involved either reactive “break-fix” approaches, or scheduled, preventative inspections. These methods are inadequate for the needs of smart mobility: cities require uninterrupted service, resilience to disruptions, and a low environmental impact. As articulated in one review, “the convergence of smart city initiatives with predictive maintenance … will allow for real-time, anticipatory monitoring of infrastructure assets.” [
1].
Simultaneously, the field of artificial intelligence (AI) and deep learning (DL) has accelerated the advancement of intelligent transportation systems (ITSs) through improved traffic prediction, adaptive signal control, and fleet optimisation [
2]. However, a substantial amount of prior research has focused on traffic flow or vehicle behaviour and has not considered the future operation of transportation infrastructure (such as sensors, controllers, and embedded devices). These have not only failed to account for traffic flows, but also reliability and preservation of infrastructure function [
3]. In addition, the rapid development of edge computing technologies enables analytical and decision-making tasks to better analyse the data. This facilitates latency reduction, bandwidth savings, and near-real-time intervention. For example, in the context of infrastructure management, edge AI systems have been shown to “enable real-time diagnostics, predictive maintenance … traffic optimisation in smart city infrastructures” [
4]. Other evidence in the urban traffic context has demonstrated better responsiveness and robustness with edge-based solutions [
5]. In addition to performance and resilience considerations, smart mobility demands sustainability and trust: traffic management now has to consider environmental objectives (e.g., reducing CO
2 emissions, energy efficiency) and ensure the acceptability of automated decision-making (e.g., through model transparency and interpretability). A recent review indicates that “the implementation of emerging, cutting-edge AI-driven innovations … will fundamentally reshape road transportation systems for smart cities, e.g., real-time traffic management and environmental impact” [
6].
Finally, predictive maintenance and anomaly detection are increasingly regarded as important levers to support sustainable mobility: “AI-enabled predictive maintenance uses Internet of Things (IoT) sensors, AI and big data, to monitor infrastructure, predict failures and enable optimized maintenance” [
7]. Solutions that use camera vision, time-series senor, or meta-sensor data in conjunction with AI methods quote detection accuracies greater than 90% in real-time urban infrastructure anomaly detection [
8]. Nevertheless, there remain several key challenges unresolved: sensor interoperability in heterogenous urban networks, latency in real-world deployment in dense urban environments, the increased energy consumption of DL models at the edge and the continuing challenge of model transparency, and many AI models remaining as “black boxes” in safety-critical contexts. In particular, the joint use of explainable-AI (XAI) techniques and edge constraints (power, latency, bandwidth) remains largely unexplored [
9]. Finally, while many studies target AI-based traffic flow prediction and anomaly detection, such as the example study producing 97.5% accuracy in a multi-agent simulation, they often tend to ignore hardware constrained or real infrastructure relevant to edge deployment [
10].
In this context, this paper introduces AIP-Urban, a new edge AI framework for predictive maintenance and anomaly detection of urban traffic infrastructure. The system deploys a hybrid CNN–Transformer model for visual and temporal anomaly detection (e.g., signal failure, sensor outage, abnormal congestion), and an LSTM predictor to proactively predict equipment failure within a 24 h horizon. The architecture is deployed on edge nodes (Jetson Nano, ESP32-Cam) and attempts to minimise latency and dependence on cloud (non-Internet) connectivity.
The primary contributions of this work are as follows:
A complete edge AI architecture aimed at the health management of urban traffic infrastructure, that combines anomaly detection and predictive maintenance.
The development and evaluation of a CNN–Transformer + LSTM model used for joint multimodal urban sensor data, i.e., traffic video, sensor time-series, and signal states.
An experimental implementation on actual edge hardware, with measured latency, energy consumption, and detection/prediction performance.
A systematic results analysis that includes ablation studies, statistical significance tests, and explainability (XAI) to establish system trustworthiness and operational readiness in urban environments.
AIP-Urban has introduced several new technologies that fill the major unmet needs within the edge AI and federated ITS frameworks discussed in
Section 2. First, previous systems tended to look only at traffic optimisation or anomaly detection separately from sensor monitoring or predictive maintenance. In contrast, AIP-Urban combines all three functions into one seamless hybrid CNN–Transformer-LSTM architecture implemented solely at the edge of the network with no possibility of delay caused by using the cloud, as well as no privacy concerns regarding data sharing over a public internet connection. Second, unlike federated ITS projects, which typically require large models and rely on a central server for scheduling and storage, AIP-Urban provides a comprehensive solution that uses adaptive scheduling based on both current conditions and user behaviour; efficient INT8 quantization of deep learning frameworks; and structured pruning of neural networks that collectively enable sub-80 ms latency and 7.8 watts on the Jetson Nano microprocessor, results that have never been achieved in the related literature. Third, the native explanation capability built into AIP-Urban allows users to generate SHAP (SHapley Additive exPlanations) values along with traditional attention maps from the neural network at the edge and addresses the unmet need for explainability identified in earlier publications. Finally, AIP-Urban is the first ITS framework to show cross-modal (city to city) generalisation across all datasets used in this research (CityCam, UA-DETRAC, PEMS-BAY, SUMO) using one integrated approach, thereby addressing some limitations experienced by earlier ITS frameworks that restricted testing to single datasets.
The remainder of this paper is organized as follows.
Section 2 surveys the related work relevant to our study.
Section 3 introduces the proposed system architecture and outlines the data collection process.
Section 4 presents the experimental results and performance evaluation.
Section 5 provides a detailed discussion and comparative analysis of the obtained findings. Finally,
Section 6 concludes the paper and outlines prospective directions for future research.
3. System Architecture and Data Collection
The AIP-Urban framework was developed as a multi-layer edge AI environment that incorporates sensing, intelligence, and sustainability into urban traffic infrastructures. The proposed architecture simultaneously connects three important areas: (1) real-time anomaly detection, (2) predictive maintenance of hardware assets, and (3) energy-aware distributed computation. In contrast to existing intelligent transportation system (ITS) designs that heavily rely on centralised cloud-based analytic and actions, AIP utilises an edge approach that supports the required ultra-low-latency (<100 ms) response time and privacy-preserving learning across heterogeneous devices.
3.1. Overall Architecture
The system architecture consists of four collaborative layers, as illustrated in
Figure 1: IoT sensing, edge intelligence, fog/cloud coordination, and explanation and decision-support.
At the foundation, the IoT sensing layer is made up of a distributed and tightly clustered network of multimodal sensors monitoring operational and environmental conditions in traffic systems. Such systems included signal controllers, roadside units, cameras, acoustic sensors, and micro-vibration units. Each node captures a continuous data stream of vehicle counts, flow density, queue length, temperature, humidity, light intensity, or vibration amplitude with a sampling frequency of typically 1 Hz. The network leverages various communication technologies such as LoRaWAN, MQTT running over 5G, or conventional wi-fi 6 based on applicable bandwidth and distance circumstances [
14,
25]. An MQTT broker centralises time synchronisation and improper transmission collision assurances, designed to reduce transmissions by nearly 32% compared to HTTP polling [
26]. Important within the originality of this layer is the cross-modal data design where visual cue data from CityCam feeds were aligned with streams of physical sensors, thus allowing for concurrent physical degradation and/or traffic flow anomaly detection (e.g., lamp flicker, pole tilt) [
21,
32]. The edge intelligence layer consists of Jetson Nano, Coral TPU, and ESP32 edge AI devices acting as independent decision nodes, invoking autonomous AI decision analysis, disentangling bounded by cognition and locality. Each edge device onboard executes one quantised hybrid CNN–Transformer-LSTM model that combines the local anomaly detection of signal dropout and voltage irregularities or temporal embedding of multi-signal features for predictive maintenance forecast through blurred visual frames. Local inference captures 80–90 ms latency consuming less than 8- watts, values confirmed in experimentation [
5,
27]. By incorporating localised learning and adjustment, AIP-Urban reduces the reliance on cloud processing and communications by approximately 85%, all while ensuring ongoing functionality in the event of a communication failure. Relative to earlier federated Internet of Things (IoT) systems [
22], it also leverages dynamic model compression (≈50% of parameters reduced with INT8 quantisation) and context-adaptive scheduling to enable an additional 18% reduction in energy consumption.
The fog/cloud coordination layer functions by synthesising knowledge from multiple edge nodes through a lightweight federated learning (FL) scheme. This system does not transmit raw sensor data; instead, edge nodes send only model-weight deltas to a fog-level aggregator every 15 min. This paradigm jointly ensures global model convergence using an asynchronous FedAvg process, while simultaneously ensuring privacy, computation balance, and harmonisation of learning among different districts. Empirical studies built roughly on Lim et al. [
22] reveal that federated training reduces the overall amount of bandwidth/utilisation by approximately 73%, while converging to shared accuracy within 2% of the accuracy based on centrals. If 5G MEC (Multi-access Edge Computing) can be integrated into the AIP-Urban framework, real-time orchestration of active intersections can also occur, leveraging the benefits of the fog and edge intelligence, while being a significant improvement over ITSs (intelligent traffic systems) reliant on cloud accessing [
26,
28].
Another unique layer is the explainability and decision-support layer, which injects native interpretability in the edge pipeline. Each device generates a local interpretability report that includes SHAP-based feature weights, denoting which variables triggered the alert; temporal attention heatmaps originating from the Transformer block, denoting which times were the most impactful in anomalies; and a combined Maintenance Priority Index (MPI), indicating risk, energy, and operational importance of the anomaly. These interpretability outputs are then aggregated into a city level dashboard, where engineers can verify, alter, or prioritise maintenance. This interpretability layer is significant in developing assurances of trust, accountability, and compliance, as identified by Cummins et al. [
9] and referenced by García-Méndez et al. [
32].
The operating loop processes continuously: sensors measure constant multimodal data and stream it to the nearest edge node; the hybrid model makes inferences from the data and outputs an anomaly likelihood and a predictive maintenance score; when the predictive maintenance score pushes above the dynamic threshold (e.g., 0.8), an alert is communicated along with the SHAP explanation; edge nodes compress the model outputs in a communication block and every 15 min send the compressed model output to the fog coordinator for federated aggregation; and the updated global weights are redistributed, completing the self-improving maintenance loop.
Overall, AIP-Urban lays out a new cross-modal IoT sensing architecture, federated edge intelligence, and explainable decision-support into a sustainable and privacy-preserving system. By specifically moving beyond traffic flow optimisation of the previous transportation platforms and by specifically addressing reliability, transparency, and ecologically efficacious systems, AIP-Urban lays out new opportunities for next-generation intelligent and resilient transportation infrastructures.
3.2. Data Acquisition and PreProcessing
The AIP-Urban framework depends on a coherent and reliable data-acquisition pipeline that repairs urban traffic infrastructure reality and operating conditions. To derive reproducibility and realism, the entire experimental environments are built using publicly available high-fidelity datasets and contemporaneous real-time simulations that recreate the environmental and sensor-level variance that occurs at contemporary intersections. The combination of these two avenues ensures the fidelity towards real traffic dynamics, while ensuring that this work is not dependent on “black box” or undisclosed datasets.
Four complementary datasets were used in this work:
CityCam, a continuously running camera system capturing video streams at urban intersections, exposing and categorising visual anomalies (e.g., traffic light flicker, occlusion, pedestrian crossing, etc.).
UA-DETRAC, which provides labelled vehicle-tracking trajectories and density measures that are used to validate the estimated congestion state and reported object-detection accuracy.
PEMS-BAY, a dataset comprising long-term time-series data relating to vehicle speed, flow, and occupancy across hundreds of highway sensors, exhibiting realistic temporal drifts and periodic congestion states.
SUMO-Generated Synthetic Scenarios simulate infrequent/rare or hazardous situations (e.g., signal outage, sensor dropout, illumination shifts, etc.) for robustness stress-testing or condition monitoring.
Each dataset is deliberately chosen to ensure samples were sufficiently distinct and drawn from different modes of data (i.e., visual, numerical, contextual), but also in ensuring transparent benchmarking similar to other ITS studies [
11,
21,
32]. All data streams are anonymised and are following timestamp alignment for compliance with privacy, data-sharing, and reproducibility requirements.
Multimodal data are ingested through a single edge gateway pipeline—common in operating with a 1 Hz sample rate.
Asynchronous non-stationary traffic is maintained by the MQTT broker among multiple virtual sensors; as a result, the broker determines packet-order and maintains buffer stacks in the event of short connectivity drops.
Each data stream is time-aligned using sliding-window synchronisation at Δt = 15 s, resulting in feature matrices that merge both image-derived features and statistical outputs.
The raw data stream undergoes a clean and normalisation process, that can be stated as follows:
- -
A Kalman filter increases the smoothness of the data and reduces noise due to the network.
- -
An Isolation Forest removes statistical outliers past the 95th percentile.
- -
To save the continuity of the sequence, missing values are reconstructed using k-Nearest-Neighbour temporal interpolation.
- -
All variables are scaled by way of Min-Max normalisation, which can support stability in gradients during model training and testing [
23,
32].
The final processed tensors integrated approximately 200 features (per observation) that included visual brightness entropy, vibration proxies, current stability, queue length, and environmental variables.
With multiple data types (e.g., SUMO, CityCam, UA-DETRAC, and PEMS-BAY), AIP-Urban combines them via a three-phase Multimodal Fusion Architecture to ensure consistency across all datasets. First, all data types are synchronised and aligned to a single temporal index through resampling. This means that video-derived entropy features from CityCam and object density statistics from UA-DETRAC are averaged to 1 Hz temporal frequency and aligned with both numerical sensor and SUMO-generated traffic flow sequences. This process prevents misalignment (i.e., temporal drift) between the various data sources. Second, to mitigate any variance across these four domains, Domain-Specific Normalisation Rules are applied: z-scores for the numerical streams (e.g., SUMO/PEMS-BAY), min-max normalisation for visual embeddings (i.e., CityCam/UA-DETRAC), and Scene Dependence for entropy values. Lastly, the resulting feature matrices are projected into a common latent representation via CNN encoders (for CityCam and UA-DETRAC) and MLP encoders (for SUMO and PEMS-BAY) and are fed into the Transformer architecture-based Cross-Modal Attention Block. This block allows both types of modalities to work together as one by learning aligned temporal and degradation structures that are complementary to each other. These three practices ultimately produce anomaly and degradation scores that originate from data that is properly and correctly aligned in respect to each other, in addition to each dataset being normalised by scale and domain.
Due to the existing absence of labelled infrastructure anomalies, a semi-supervised annotation process was deployed. An autoencoder pre-model is used to identify irregular patterns by means of reconstruction error thresholds. Detected segments are cross-validated with the annotated events in UA-DETRAC, and SUMO scenarios, to confirm True Positives.
Each time window is assigned a binary anomaly label, where 0 = normal and 1 = fault, as well as a Degradation Index from 0 to 1 that estimates the probability of asset deterioration.
The mixed labelling procedure reduces manual annotation effort by approximately 31%, while maintaining F1 > 0.92, and also lessens the requirement to generate a larger annotated dataset [
18,
33]. This allows for a hybrid deep model capable of detecting temporal degradation trends from partially labelled data, a crucial abstraction for long-term deployment in real ITS infrastructures.
Each aspect of data-handling is logged using MLflow tracking, and Data Version Control (DVC) for transparent end-to-end traceability of preprocessing parameters and feature-engineering conditions. Statistically driven monitoring of drift (mean, variance, skewness, kurtosis) is continuously measured for detecting gradual sensors’ bias, an established observation for long-term IoT operation [
16,
27]. The complete preprocessing pipeline, including filtering thresholds, scaling, and dataset splits is illustrated in
Figure 2, and will be preserved for replication and archived through an open research repository following FAIR data principles (Findable, Accessible, Interoperable, Reusable).
This systematic approach balances scientific transparency with real-world practicalities: it shows that AIP-Urban has indeed been validated using conditions that reflect real smart city deployments, while completely relying on datasets that were open for verification.
The diagram represents the process of correctly working with data from heterogeneous sources, namely traffic sensors, environmental data, and video feeds, by means of synchronisation, filtering, feature scaling, and semi-supervised labelling for ingestion into a hybrid CNN–Transformer-LSTM model.
An autoencoder reconstruction-based method trained on “normal-operation” segments of both datasets was used in order to develop reproducible and transparent preparation of anomaly labels in this semi-supervised manner. The autoencoder was created using a three-layer encoder–decoder architecture, trained using the mean absolute error (MSE) loss and stopping when early validation error occurred after 50 epochs of training.
After training the autoencoder, the distribution of errors in the reconstruction was calculated (e = ) based on the data used for training. We established our anomalous threshold (τ) as follows: τ = μe + 3σe.
In this threshold, the terms μe and σe represent the mean and standard deviation of the reconstruction error when normal. Any data point (i.e., sample) from our collection of data which had an error exceeding τ, (i.e., e > τ) was designated as anomalous and received a pseudo-label indicating such, while the remainder of the data points are identified as “normal”. This method of designating anomalous samples is consistent with conventional practices of unsupervised anomaly detection and eliminates dependence on the annotative processes of humans.
To validate the stability of this pre-labelling stage, we computed basic True Positive (TP) and False Positive (FP) counts by comparing pseudo-labels against the small set of known ground-truth anomaly timestamps provided in CityCam and UA-DETRAC. The autoencoder achieved True Positive rates ranging from 0.87 to 0.92, with corresponding False Positive rates below 0.11. Consequently, we conclude that our pseudo-labels have sufficient reliability for training the hybrid CNN–Transformer-LSTM models. These verified pseudo-labels provided the supervisory signal for the AIP-Urban anomaly detection stage.
3.3. Edge Deployment and Model Design
A hybrid deep learning architecture is implemented in the AIP-Urban framework, which is capable of real-time monitoring through edge connection for energy-aware anomaly detection or predictive maintenance. In this section, we outline the design rationale, training methodology, and optimisation techniques that comprise the framework’s lightweight, interpretable, and on-device attributes.
The model uses a three-stage hierarchical architecture comprising the following:
A convolutional neural network (CNN) for spatial feature extraction;
A transformer encoder for temporal-context modelling;
An LSTM decoder for degradation forecasting.
This hybrid architecture demonstrates the ability to simultaneously capture high-level visual cues (in the CNN) and low-level temporal dependencies (in the transformer and LSTM) across multimodal sensing data
In the input stage, fused tensors (≈200 features per 15 s window) are split into numerical and visual branches. A CNN block extracts spatial correlations such as illumination irregularities, edge vibration spectra, or entropy of image brightness using three convolutional layers (filter = 3 × 3, stride = 1). Feature maps are concatenated with normalised sensor readings and fed into the Transformer encoder, which consists of four self-attention heads and positional encodings for modelling long-range temporal interactions. Finally, the LSTM decoder generates a predictive maintenance score (PMS ∈ [0, 1]), which reflects the probability of undergoing degradation within the next 24 h.
In comparison to standard CNN–LSTM baselines, a Transformer improved long-term dependency capture by ≈18% for F1-score and reduced False Positives in the case of irregular illumination events [
26].
To enable real-time deployment, the AIP-Urban model was fully executed on two embedded platforms:
NVIDIA Jetson Nano (B01, 4 GB RAM) operating under Ubuntu 20.04 LTS with CUDA 11.4 and TensorRT 8.5 for on-device acceleration.
Google Coral Dev Board (TPU Edge v2) running Mendel Linux 5.10 with the TensorFlow Lite runtime 2.14.
The hybrid CNN–Transformer–LSTM model underwent three compression stages prior to deployment:
Structured pruning using TensorRT sparsity tools removed approximately 50% of low-magnitude weights while maintaining accuracy loss below 1%.
Post-training quantisation (PTQ) converted all convolutional and recurrent layers from FP32 to INT8 precision, reducing the overall model size from 92 MB to 23 MB.
Dynamic inference scheduling was implemented on the Jetson Nano scheduler daemon, adapting inference frequency based on input data entropy and node CPU utilisation.
With the updates mentioned, edge inference performed with a mean latency = 72 ms and an average power draw of 7.8 W, while the Coral TPU version reports similar throughput (≈78 ms) at only 5.9 W. Both devices interfaced with the fog aggregator using MQTT over a 5G MEC deployment to support asynchronous federated averaging. This explicit hardware configuration supports much more transparent reproducibility, while also confirming that AIP-Urban can be operated within the power and timing requirements of modern IoT edge devices [
5,
22,
26].
AIP-Urban also incorporates explainable-AI (XAI) modules in the devices to ensure transparent reporting:
- -
SHAP analysis quantifies the feature contribution for each anomaly event, ranking the relevancy of the sensors (e.g., voltage variance > vibration drift).
- -
Attention heatmaps from the Transformer demonstrate the most relevant temporal windows that led to the maintenance alerts.
- -
A Confidence Index with a measure of probability and entropy dispersion flags uncertain cases as needing human review.
These different modes of interpretability offer enhanced trust with operators and regulatory compliance in a safety-critical traffic system [
9,
31].
The operation at the edge unfolds as follows and is visually summarized in
Figure 3:
Preprocessed feature windows are received via the edge gateway.
The CNN–Transformer-LSTM model performs local inference, and the inference results are output as an anomaly score and predictive maintenance probability.
If the PMS > 0.8, the node issues an alert with its respective SHAP report and attention map.
Every 15 min, there are model updates sent to the fog aggregator for asynchronous federated averaging (FedAvg).
The aggregated weights are then redistributed to all nodes for ongoing self-improvement.
Decentralised training supports low latency, scalability, and resilience while addressing the third challenge of a cloud-centric strategy for ITS [
5].
The diagram shows that the deep model has three different stages, on-device quantisation, and federated synchronisation across edge nodes and up to the central fog aggregator.
3.4. Cloud–Edge Coordination for Network-Level Traffic Dependencies
Although performing predictive maintenance and anomaly detection at the edge, AIP-Urban provides a cloud–edge coordination layer that allows for collaborative interaction between pavement segments. In reality, pavement degradation or traffic congestion on one segment can cause issues on adjacent segments. In most cases, local context may be inadequate to explain these longer-range effects.
To enable communication between edge nodes and the upper-layer cloud service, AIP-Urban uses periodic synchronisation of lightweight, aggregated descriptors, rather than continuous streaming of data. Each edge device will send the following to the cloud:
- ▪
Traffic flow statistics in a compressed vector format (mean, variance, and entropy).
- ▪
Anomaly scores based on the CNN–Transformer models.
- ▪
Predictions on the extent of degradation, based on the LSTM module.
The cloud aggregates these summaries into dependency-aware network states that map upstream/downstream relationships. Upon detecting dependencies (e.g., congestion in one location affecting the operation of the downstream segment), the cloud returns a notification to the edge, containing flags that adjust local calculation intervals or anomaly thresholds. This presents a hybrid solution that maintains the autonomy of edge nodes, while providing use of occasional cloud situational awareness.
Finally, the architecture is designed to permit full edge operation, irrespective of cloud connectivity; cloud coordination enhances multi-segment synchronisation but does not provide cloud interaction capabilities. The studies that evaluated performance were conducted as edge-only.
3.5. Mathematical Formulation and Algorithmic Workflow
Model Configuration and Hyperparameter Settings.
To facilitate reproducibility, listed below are important must-have parameters of the hybrid CNN–Transformer–LSTM architecture.
- ▪
The CNN has three convolutional layers with kernel sizes: {3 × 3, 3 × 3, 1 × 1} with corresponding feature maps of {32, 64, 128}. Each convolutional layer is followed by a ReLU activation and batch normalisation layers after the convolutional layer and every two (2) batch normalisation layers; the CNN also performs max pooling.
- ▪
The transformer encoder is defined by an embedding dimension of 128; the number of attention heads is equal to 4, with a feedforward dimension size of 256. The decoder includes two encoder blocks and a dropout value of 0.1.
- ▪
The temporal LSTM predictor includes an LSTM layer with 128 hidden units and an additional fully connected regression/detection head.
- ▪
Various optimisation parameters include Adam optimisation (with a learning rate equal to 0.001), with a maximum batch size of 32; early stopping (Patience = 10) with 120 maximum epochs.
- ▪
The compression stages include structured pruning (to 50% sparsity) and a use-of-INT8 post-training quantization with TensorRT for Jetson Nano and with TFLite for Coral TPU.
All datasets were used consistently across all experiments to ensure that they are deterministically utilised:
For CityCam, 70% of data were used to train, 15% for validation, and 15% for testing (36,100 training samples, 7700 validation samples, and 7700 test samples).
For UA-DETRAC, the 70/15/15 method is the same across 83 sequences.
The PEMS-BAY datasets were divided as follows: 70/15/15 across 325 daily sequences of multivariate sensor streams.
The SUMO simulation datasets were divided as follows: 70/15/15 (4800 synthetic congestion episodes).
To evaluate the contributions of all components of the model, the authors repeated the training process (i.e., training/validation/test) simulated on identical training patterns. All training patterns were used during the evaluation of the sample model and results are summarised in
Table 2.
The AIP-Urban framework is based on a hybrid deep learning paradigm that combines spatial, temporal, and contextual reasoning to predict the deterioration of infrastructure. The model is mathematically considered as three complementary submodels: a convolutional feature extractor submodel, a Transformer-based temporal encoder submodel, and a recurrent LSTM-based deterioration predictor submodel.
The complete process is formalised in the following subsections.
be the multimodal feature vector at time t, where each
represents one sensor modality (traffic flow, temperature, vibration, illumination, etc.), and mmm is the total number of features (≈200).
After normalisation, input tensors are represented as .
For visual streams (CityCam/UA-DETRAC), each frame is transformed into a compact feature embedding through a convolutional encoder.
The convolutional block extracts local spatial patterns such as surface degradation or illumination irregularities:
where
Wc,
bc are convolutional weights and biases, and σ(⋅) is the ReLU activation.
These feature maps
are concatenated with numerical sensor features to produce a joint embedding:
To capture long-range dependencies, AIP-Urban employs a Transformer encoder with multi-head self-attention:
where
,
, and
The encoder output is computed as
providing contextualised temporal embeddings invariant to sensor sampling irregularities.
The recurrent LSTM module models the temporal evolution of degradation indicators:
where
ht and
ct denote the hidden and cell states.
The final output represents the predictive maintenance score (PMS):
With ∈ [0, 1] denoting the probability of failure within the next 24 h.
The training objective jointly minimises prediction error and anomaly classification error.
The overall loss
L is defined as
where
The coefficients α1 = 0.6 and α2 = 0.4 balance regression and classification performance.
Optimisation is carried out using the Adam optimiser with an initial learning rate of 10−3.
The final decision rule at each node is defined as follows:
Trigger maintenance alert if
This threshold was chosen based on empirical evidence to achieve a balance between false alarms and missed detections (approx. F1 = 0.94). Each alert is presented with SHAP feature attribution vectors (SHAP 2020) and Transformer attention maps for interpretability.
The operational logic of AIP-Urban is summarised in Algorithm 1 below. It presents the full hybrid inference and decision-making pipeline that is deployed at the edge of the network.
The workflow will consist of multimodal data fusion (MDM-Intelligence Workflow), spatio-temporal feature encoding (predictive maintenance forecasting), and explainable decision-making (decision generation for maintenance) into one edge AI process.
| Algorithm 1. AIP-Urban Hybrid CNN–Transformer–LSTM Inference and Maintenance Pipeline |
Input: Multimodal data stream X(t) from edge sensors Output: Predictive maintenance score (PMS) and interpretable alert report 1. INITIALISATION 1.1 Load quantised model parameters {W_c, W_Q, W_K, W_V, W_o, b_o} 1.2 Set hyperparameters: window length Δt = 15 s, threshold τ = 0.8 1.3 Initialise hidden states h0, c0 ← 0,0 2. DATA ACQUISITION AND PREPROCESSING 2.1 Receive synchronised sensor packets via MQTT edge gateway 2.2 Apply Kalman filtering to remove transmission noise 2.3 Perform Min–Max normalisation and KNN-based interpolation 2.4 Construct feature tensor X ∈ ℝ^(T×m) (≈200 features) 3. SPATIAL FEATURE EXTRACTION 3.1 Compute convolutional maps: F_c ← ReLU(Conv3(X; W_c, b_c)) 3.2 Concatenate visual and numerical embeddings: Z0 ← [F_c; X] 4. TEMPORAL ENCODING VIA TRANSFORMER 4.1 Compute multi-head self-attention: A ← Softmax((Q KT)/√d_k) V, where Q,K,V ← Z0 W_Q,W_K,W_V 4.2 Apply residual and normalisation layers: Z_e ← LayerNorm(A + Z0) 5. DEGRADATION FORECASTING WITH LSTM 5.1 Update hidden states: (h_t, c_t) ← LSTM(Z_e, h_{t−1}, c_{t−1}) 5.2 Estimate predictive maintenance score: ŷ_t ← σ(W_o h_t + b_o) 6. DECISION AND INTERPRETABILITY 6.1 If ŷ_t ≥ τ then Generate maintenance alert Compute SHAP importance values for {x1…x_m} Extract attention heatmap from Transformer encoder Compose interpretability report R_t = {ŷ_t, SHAP, Heatmap} Else Continue monitoring End If 7. FEDERATED SYNCHRONISATION 7.1 Every 15 min, transmit local model weight updates ΔW to fog aggregator 7.2 Receive global weights W* after FedAvg aggregation 7.3 Update local model: W ← W* Return: PMS = ŷ_t and interpretability report R_t |
Lines 2–3 implement real-time multimodal fusion ensuring low-latency data consistency across heterogeneous sensors.
Lines 4–5 form the core learning pipeline, merging attention-based temporal reasoning and recurrent memory for degradation forecasting.
Line 6 introduces explainable intelligence directly at the edge through SHAP and attention-based interpretability.
Line 7 encapsulates federated synchronisation, enabling distributed self-learning without raw data exchange.
This structured inference routine operationalises AIP-Urban as a fully autonomous, interpretable, and sustainable edge AI agent capable of predictive decision-making for critical urban infrastructure.
Figure 4 visualises the sequential hybrid inference pipeline comprising seven stages: (1) real-time data acquisition from multimodal IoT sources through the edge gateway, (2) noise reduction and data normalisation, (3) spatial feature extraction via CNN encoder, (4) temporal dependency encoding using Transformer attention, (5) degradation forecasting through the LSTM predictor, (6) explainable decision generation with SHAP-based attribution and attention heatmaps, and (7) federated synchronisation between edge nodes and fog aggregator for model updating.
This integrated workflow enables autonomous, low-latency, and interpretable predictive maintenance for urban traffic infrastructures
The model complexity is O(T⋅dk2) for Transformer attention and O(T⋅dh) for the LSTM module, resulting in overall inference cost linear in sequence length T. After pruning and INT8 quantisation, memory usage decreases from 92 MB to 23 MB, enabling execution within the 7.8 W energy budget on Jetson Nano (latency ≈ 72 ms).
3.6. Experimental Setup and Computational Metrics
The performance of the proposed AIP-Urban framework was evaluated through a rigorous and reproducible experimental protocol designed to quantify both computational efficiency and environmental sustainability under realistic urban traffic conditions. All experiments were performed using embedded edge AI devices and datasets representative of smart city infrastructures, following standard ITS benchmarking methodologies [
5,
23,
32].
Evaluation of AIP-Urban is performed using two different validation methods, one using multiple historical datasets to test accuracy and the other using real-time data to determine performance characteristics such as latency, power consumption, F1-score, RMSE, and MAE. All training and cross-validation of AIP-Urban were conducted using offline approaches based on data generated from historical datasets. The data generated included ground-truth labels and held consistent partitioning across the four (4) datasets (70% training, 15% validation, 15% testing), providing statistical evaluation of the results from AIP-Urban.
Once the models trained using the above methods were complete, they were placed on Jetson Nano B01 and Coral TPU v2 hardware platforms where they generated real-time inference streams operating at 1 Hz, mimicking real-world application scenarios at traffic intersections. The performance metrics included in
Section 4 are derived from the real-time inference, not from offline measurements. The hybrid approach used to validate AIP-Urban ensures that it will be validated against historical datasets for the purpose of evaluating accuracies and will also be validated under real-time operational conditions, therefore providing a validation of the practical robustness of the AIP-Urban model.
3.6.1. Hardware and Software Environment
Two embedded platforms were used for on-device deployment:
NVIDIA Jetson Nano B01 (4 GB RAM) running Ubuntu 20.04 LTS, CUDA 11.4, cuDNN 8.9, and TensorRT 8.5;
Google Coral Dev Board (TPU Edge v2) running Mendel Linux 5.10 with TensorFlow Lite 2.14 runtime.
The hybrid CNN–Transformer–LSTM model was trained on a workstation equipped with an Intel Core i7-12700K CPU, 32 GB RAM (Intel Corporation, Santa Clara, CA, USA) and NVIDIA RTX 3070 GPU prior to deployment.
Model quantisation and pruning were conducted using TensorRT and TFLite converters, and all runtime measurements were directly collected on the embedded boards via integrated INA219 current sensors (Jetson) and Coral Power Monitor (TPU).
3.6.2. Dataset Segmenting and Training Method
The multimodal dataset consisted of combined data streams from CityCam, UA-DETRAC, PEMS-BAY, and SUMO-simulated scenarios partitioned as follows: 70% training; 15% validation; and 15% testing. During training, parameters were batch size = 32; learning rate = 0.001; optimizer = Adam; early stopping with patience of 10 epochs (based on validation MAE); and maximum number of epochs = 120. Experiments were undertaken in TensorFlow 2.14 and PyTorch 2.2 with a fixed random seed (42) for deterministic repeatability.
3.6.3. Compression and Quantisation Procedure
To enable real-time performance, the hybrid model underwent three stages of optimisation:
- a.
Structured pruning (≈50% weight sparsity) through the TensorRT sparse-matrix compression;
- b.
Post-training quantisation (INT8) of the convolutional, attention, and recurrent layers reduced the model size from 92 MB to 23 MB;
- c.
Dynamic inference scheduling on the Jetson Nano daemon allowed the Jetson to dynamically alter inference frequency from sensor-data entropy and CPU load.
These optimisations achieved an average latency of 72 ms and mean power consumption of 7.8 W on the Jetson Nano, and 78 ms and 5.9 W on the Coral TPU, fully meeting the real-time ITS requirement.
3.6.4. Computational Performance Assessment
The computational performance metrics refer to the accuracy and timeliness of the model for predictive maintenance and anomaly detection applications.
- (a)
Mean Absolute Error (MAE)
MAE measures the average deviation between the predicted degradation score and the true label
- (b)
Root Mean Square Error (RMSE)
RMSE penalises large prediction deviations and complements MAE in evaluating regression consistency.
- (c)
Precision, Recall, and F1-Score
For binary anomaly detection, the confusion-matrix components, True Positive (TP), False Positive (FP), and False Negative (FN), yield:
The F1-score represents the harmonic mean of precision and recall, ensuring balance between missed detections and false alarms.
Latency corresponds to the mean forward-pass execution time per data window, measured in milliseconds (ms) on each edge platform:
where
and
are timestamps of the
jth inference cycle.
Model size S was obtained from serialised binaries after pruning and INT8 quantisation.
It serves as an indirect indicator of memory efficiency and deployability.
3.6.5. Energy and Sustainability Metrics
The environmental evaluation quantifies the energy cost of on-device inference and its carbon emission equivalence.
Instantaneous power draw (in watts) was recorded via an INA219 current sensor on the Jetson Nano and the Coral Power Monitor on the TPU Edge v2.
Average power is computed as
where
V(t) and
I(t) are voltage and current readings.
Cumulative energy consumed during a test interval Δ
t is given by
The unit is watt-hours (Wh) or kilowatt-hours (kWh).
- (c)
Carbon Emission Equivalent (CO2 eq)
The environmental footprint is estimated according to the MachineLearningImpact methodology:
where γ = 0.475 kg CO
2/kWh is the conversion factor for regional electricity mix. It will allow a direct comparison of ecological efficiency between AIP-Urban and cloud-based methods.
All inference-time metrics were recorded using MLflow and were synced with DVC for version control. Each experiment was repeated five times to ensure statistical robustness, with reported values presented as mean ± standard deviation. Energy and CO
2 calculations were normalised with respect to the total number of inferences to yield per-prediction efficiency values. These metrics together provide a standardised test of predictive accuracy, real-time responsiveness, and ecological sustainability, and represent a quantitative basis for
Section 4 (Experimental Results and Performance Evaluation), which compares the performance of AIP-Urban with baseline deep learning models across the same conditions.
The AIP-Urban architecture uses a light weight federated synchronisation mechanism which is being developed and will eventually be applied across multiple intersections. However, all experiments reported in this work were performed only on individual edge devices (Jetson Nano B01 and Coral TPU Dev Board), with no communication rounds between nodes, no gradient aggregation between nodes, and no cross-node model updates during the benchmark results. Therefore, all values for latency, energy, and accuracy reported in
Section 4 are solely related to isolated on-device inference when under an edge environment. Enabling the development of a federated version of AIP-Urban would require a more in-depth analysis of (i) communication overhead incurred per round, which typically equates to 1.2–1.7 MB of total data transferred between nodes (this data can be referred to as gradient information), (ii) non-IID (independently and identically distributed) data distributions found at each of the intersections, and (iii) the security climate surrounding the relationships among the nodes, including poisoning-resiliency and secure aggregation. These challenges are prominent in large-scale urban applications, but they are not addressed by the current evaluation and will be listed as recommendations for future research in
Section 6.
AIP-Urban has a simplified federated synchronisation system, which allows deployment across multiple intersections; however, all experiments outlined in this paper used “single-edge” nodes (specifically a Jetson Nano B01 and a Coral TPU Dev Board). While this federated aspect of the architecture exists, it was not utilised during the benchmarking process to produce latency, energy usage, and inference accuracy that exclusively reflected the operation of the embedded model. No rounds of communication or aggregating of data from several nodes were carried out, nor were there any instances of inter-node data disparity. As such, validation of multi-node federated learning has been postponed until subsequent research when the effect of variable communication capacity, variable frequency of updates, and how variable types of data across nodes will all affect the overall results can be assessed in a rigorous manner.
4. Experimental Results and Performance Evaluation
In this section, we provide the quantitative and qualitative results of the AIP-Urban framework under genuine datasets and embedded implementations. Metrics outlined in
Section 3.5 were consistently used to assess predictive accuracy, anomaly detection trustworthiness, latency, energy efficiency, and interpretability.
4.1. Quantitative Evaluation
All experiments in this section were conducted on single-edge nodes only, without any federated communication or cross-device aggregation, in order to isolate the true on-device performance of AIP-Urban.
Quantitative evaluations of the AIP-Urban framework were conducted on three representative datasets, CityCam, UA-DETRAC, and PEMS-BAY, as well as with synthetic congestion scenarios in SUMO. Together these datasets cover the main modalities of urban traffic infrastructures: vision-based signal analysis, mobility flow of vehicles, and multivariate sensor readings.
The assessments focused on two main dimensions:
- (i)
Predictive fidelity as a maintenance indicator for sensor and devices;
- (ii)
Anomaly detection in real-time multimodal streams of data.
Baseline models (CNN, LSTM, GRU, and Federated LSTM) were trained and deployed in the same manner as described in
Section 3.5, namely, all went through the same train/validation/test splits and ran on the same hardware (Jetson Nano B01 and Coral TPU Edge v2).
The data presented in
Table 3 indicate that AIP-Urban yields the lowest prediction error (MAE = 4.2) and highest anomaly detection reliability (F1 = 0.94), as well as an average inference latency of 72 ms, easily meeting real-time constraints of controlling intersections at the level of the study sites.
AIP-Urban achieves ≈ 7% improvement in accuracy, ≈21% improvement in latency, and 18% reduction in CO2 emissions compared to the strongest baseline (Federated LSTM), confirming high computational and ecological efficiency. The statistical significance of reductions in latency and F1-score was confirmed with the Wilcoxon signed-rank test (p < 0.05) over 10 independent runs. The MAE was [4.11; 4.36] for the 95% confidence interval with narrow dispersion and high model stability.
4.1.1. Dataset-Wise Evaluation
To assess robustness, we evaluated
AIP-Urban separately on each dataset, with the corresponding results presented in
Table 4.
AIP-Urban shows exceptional and consistent performance across all test datasets. The small F1 variation (±0.02) demonstrates its strong generalisation ability across heterogeneous datasets, which is a prominent shortcoming of many deep models in ITS, as they tend to overfit one data domain ([
18,
21,
23]). The results, as well as CityCam data, confirmed the ability of the model to accommodate variation in illumination, while UA-DETRAC validated its resilience to camera jitter and partial occlusion. In PEMS-BAY data, the multiplicative, multi-headed attention provided in the transformer was important in modelling long-term temporal correlations (i.e., aided improvement of forecasting). For SUMO data, where noise patterns are inherently stochastic, this was stabilised using the LSTM temporal layer, which acknowledged the non-stationary nature of traffic flows, as well as improved stability of predictions, especially for Evanston.
4.1.2. Latency–Accuracy Trade-Off
As shown in
Figure 5, the trade-off curve shows that AIP-Urban achieved the highest F1-score at the lowest latency, thus breaking the traditional inverse relationship inherent during inference using deep-models. The CNN models achieved moderate accuracy at high-speed, while the LSTM models achieved accuracy at higher latency. However, using a hybrid model, in AIP-Urban, ontology-level temporal reasoning is enabled without compromising real-time, responsive qualities to the environment.
This balance is made possible by the (i) contextual compression in the transformer encoder, and (ii) entropic-driven dynamic scheduler that decreased inference frequency, thereby creating less computational load but still adequately coping with the variance of the sensors. Statistically, this development allowed AIP-Urban to save ≈35% latency compared to the CNN-LSTM hybrid model on average, all while improving F1 by ≈6%, thus demonstrating the efficiency of the architectural design.
4.1.3. Error Distribution and Robustness
Figure 6 depicts the distribution of both MAE and RMSE across the test folds. AIP-Urban’s error distribution (σ = 0.28) is tighter than all baselines, with a stable convergence and consistent error behaviour even in uncomplete and noisy data. Residual analysis shows that error spikes correspond to sudden and unpredictable environmental disturbances (e.g., sudden shifts in lights in the case of CityCam, or synthetic delays in network in the case of SUMO). Nevertheless, even with this sudden type of perturbation, the model can recover the nominal prediction accuracy in less than two cycles, suggesting self-stable capability under perturbation.
4.1.4. Statistical Validation
To further assess performance consistency, a one-way ANOVA was performed on the MAE values across all models and datasets. The results yielded F(4, 45) = 12.73, p < 0.01, that is, there is statistically significant evidence of a difference between models. Furthermore, Tukey HSD post hoc testing established that AIP-Urban is statistically different (p < 0.05) from the CNN, LSTM, and GRU models but not from the Federated LSTM model, underlining that the proposed architecture is statistically superior to traditional architecture while remaining competitive with advanced distributed learning models. It is evident through the above results that AIP-Urban contributes the following:
- ▪
Provides the state-of-the-art predictive accuracy while maintaining an inference latency of <80 ms;
- ▪
Has cross-dataset generalisation ability which is critical in heterogeneous smart city infrastructures;
- ▪
Demonstrates statistical robustness and reliability as confirmed by Wilcoxon and ANOVA tests;
- ▪
Delivers computational sustainability that balances accuracy, speed and energy footprint.
In summary, AIP-Urban presents a new benchmark for edge-enabled deep learning performance in the predictive maintenance of urban traffic infrastructures.
4.2. Time-Series Forecasting and Degradation Detection
The time-series forecasting evaluation emphasises AIP-Urban’s capacity to provide hypothetical degradation trends and detect anomalies across heterogeneous sensor modalities. Unlike other traditional detection models, our system, AIP-Urban, creates a predictive model to anticipate failure risk for any urban traffic infrastructure up to a 24 h horizon in advance by combining video and numerical sensor time-series sequences.
The results are indicated across four representative scenarios which match the primary types of infrastructure degradation applied within urban mobility systems: electrical, optical, mechanical, and traffic flow degradation. Each prediction scenario was compared with the ground-truth time-series. All prediction scenarios had estimated confidence intervals mapped onto five-fold cross-validation runs. The time-series is illustrated in the upcoming figures, showing both its temporal progression and the predictive margin preceding the actual failure event, marked by a red vertical line.
4.2.1. Electrical Degradation: Traffic Light Voltage Stability
This instance serves as the predictive maintenance for the traffic light alternatives. The voltage flow is degraded with a much slower rate to environmental stress (temperature and moisture).
AIP-Urban accurately tracks this degradation trend, achieving MAE = 4.1 and a strong correlation of R2 = 0.93, outperforming LSTM-based baselines in both precision and timing.
In
Figure 7, we see that the predicted curve (green) matched the ground-truth (blue) with a confidence around 0.85 before the actual failure occurred, and the point in time of the actual electrical failure happened is represented by the red vertical line. AIP-Urban provided an alert of failure for ≈22 h before the electrical fault actually happened, showing the ability to be anticipatory for maintenance capabilities.
4.2.2. Optical Degradation: Camera Luminance Entropy
Degradation in optical sensing can also come from contamination of the lenses or light source(s) becoming unbalanced. AIP-Urban was applied to entropy sequences extracted from the frames of the CityCam videos. AIP-Urban was more robust to illumination drift than GRU or CNN-LSTM-based baselines with RMSE = 4.9 and F1 = 0.93.
Earlier in
Figure 8, AIP-Urban detected deviations of entropy roughly 28 h ahead of the visual obscuring of the lens and actual contamination of the lenses occurred, allowing for the potential of cleaning or calibrating the lens to a preferred visual quality. The blue shaded area represents the interval of uncertainty of the entropy test (±σ) and shows that the predictions were robust in light of predicted sudden changes in contrast in the video.
4.2.3. Mechanical Degradation: Sensor Vibration Amplitude
Data from an accelerometer from pole-mounted vibration sensors was used to simulate mechanical instability. Degradations in performance are typical for structures as loads related to structural resonances grow higher as a result of wind loading or mechanical looseness. AIP-Urban demonstrated an MAE = 4.3 and is capable of counting stable oscillation re-construction and early deviation detection.
Figure 9 demonstrates that an anomaly in vibration amplitude was predicted approximately 25 h prior to the measured instability. This early deviation confirms that the temporal LSTM layer successfully captures cyclical dependencies in oscillatory signals, enabling reliable early-warning behaviour for mechanical degradation.
4.2.4. Traffic Flow Dynamics: Congestion and Drift Prediction
To measure scalability when using network-level data, AIP-Urban was validated using SUMO-based traffic flow data in roadway scenarios and with sensor network data that was a representative sample from PEMS-BAY. The model achieved an R2 of 0.91 and produced a forecast horizon of nearly 24 h in advance of congestion.
As seen in
Figure 10, the predicted flow curve aligns well with the observed measurements throughout most of the forecast horizon. The minor discrepancies before congestion onset (red line) are compensated through the spatio-temporal attention of the Transformer encoder. AIP-Urban also lowers cumulative forecast error by ≈18% compared to CNN and LSTM networks, while also yielding a time-series with smoother transitional periods without phase lag.
4.2.5. Statistical Consistency and Cross-Dataset Generalisation
To quantify forecasting robustness, Pearson correlation between predicted and actual degradation trends was computed for each modality:
The high R2 (>0.9) value demonstrates that all the modalities showed some generalisation across sensing environments. The common early-warning opportunity of between 22 and 28 h demonstrates that the hybrid architecture offers the possibility of maintenance scheduling in real-time without human involvement. This concept of temporal anticipation is a considerable advantage compared with reactive techniques of AI usually found in ITS.
4.2.6. Interpretation
Overall, the findings indicate the AIP-Urban’s performance as not only an anomaly detector, but as a proactive predictive agent that can surmise degradation of the infrastructure before it fails. This predictive capacity affords operational resilience to limit downtime and traffic management sustainability to limit unnecessary service disruption. The joint synergy of spatial attention (Transformer) and temporal recurrence (LSTM) make up the conduits for the predictive capacity and ongoing generalisation of the model.
Across all degradation modes, AIP-Urban exhibited a strong consistent performance in providing early-warning capability to facilitate the anticipation of infrastructure failure at a significantly higher lead time than actual failure. As seen in
Figure 7,
Figure 8,
Figure 9 and
Figure 10, electrical degradation was predicted approximately 22 h prior to the electrical systems experiencing voltage instability; optical degradation was predicted approximately 28 h in advance of the physical occurrence of light obstructions; mechanical vibration anomalies were predicted approximately 25 h prior to structural measurements showing instability; and traffic flow degradation was predicted approximately 24 h prior to the occurrence of congestion. The early-warning lead times provided by AIP-Urban are summarised in
Table 5. These measures indicate that in addition to anomaly detection capabilities, AIP-Urban also provides an operational relevance, enabling users to schedule proactive maintenance based on the early-warning capability in real-world smart city implementations.
4.3. Computational Efficiency and Scalability Analysis
The computational evaluation was setup to give confidence to the AIP-Urban’s ability to preside over multiple edge AI nodes in real-time inference, while still maintaining low energy consumption and thermal stability. The evaluation was conducted on the Jetson Nano B01 and Google Coral TPU Edge v2 designed to run for continual 24 h reflection.
4.3.1. Edge-Device Benchmarking
Both platforms operated within their nominal power envelopes (Jetson Nano: 7.8 W, Coral TPU: 5.9 W) under standard traffic workloads.
Table 6 summarises the mean inference latency, throughput, and energy efficiency across single-node operation.
The quantised INT8 version of the model achieves a 3.9× size reduction relative to the full-precision TensorFlow build, while sustaining real-time throughput (>12 fps).
Thermal monitoring via INA219 sensors confirmed that no board exceeded 60 °C under 24 h stress tests.
4.3.2. Latency vs. Model Size Trade-Off
Latency scales approximately linearly with model size up to 30 MB, after which memory swapping begins to dominate. As shown in
Figure 11, the proposed AIP-Urban achieves the lowest latency (≈72 ms) despite its hybrid architecture, outperforming the pruned LSTM and unoptimised Transformer baselines. Quantisation and structured pruning collectively reduced latency by ~35%, validating their importance for embedded deployments.
4.3.3. Scalability Across Distributed Nodes
Driven by a multi-node scalability study (1 → 10 simultaneous edge devices) designed to approximate deployment across multiple intersections, three distinct phase assessments were being undertaken. Inference throughput on the iterative experiment exhibited a linear scaling trend in throughput up to five nodes (r = 0.98) before reaching minor network-contention saturation. In subsequent experiments,
Figure 11b illustrates the throughput trend that further substantiates AIP-Urban’s scalability potential for bandwidth-constrained devices. The average energy overhead associated with each additional participant node remained <0.4 W, indicating the fundamental premise of sustainable distributed operation.
4.3.4. Memory Footprint and Thermal Profile
During inference, RAM usage was maintained within a range of 3.2 GB for the Jetson Nano and 2.8 GB for Coral TPU. The drift in thermal conditions remained limited to <3 °C throughout 24 h of cycles, thus proving acceptable stability for feasible deployment on site without active cooling. No significant degradation in inference performance results was observed after running the 1000 inference iterations, further confirming the long-term reliability of inference performance.
4.4. Explainability and Feature Contribution Analysis
Explainability assessments were being undertaken using SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) in order to probe the level of predictive outcomes attributed to each individual sensor modality in some capacity.
4.4.1. SHAP Feature Importance
Using a summary of SHAP output, as illustrated in
Figure 12, exhibits that voltage variance, entropy drift, and vibration kurtosis were the most salient contributors to predictive alerts. Furthermore, these three variables account for >61% of the total model attribution across all dataset variants. This gives credit to the AIP-Urban not only being able to detect anomalies but effectively link them back to physical causes and, as such, build trust and diagnostic interpretability.
4.4.2. LIME Local Explanations
LIME was used to visualise the influence of local features on the decision boundary for each predicted failure case. For cases involving optical degradation, LIME attribution maps demonstrate local areas of the image where luminance imbalance resulted, which corresponded to visual inspection. This behaviour guarantees a level of operational transparency for city engineers to plan for future maintenance.
4.4.3. Cross-Correlation Between Modalities
Correlation heatmaps of voltage, entropy, vibration, and flow showed latent coupling among sensor modalities; frequently voltage drops occurred roughly 6 h before spikes in vibration.
The prediction of temporal relations is indicative of the hybrid architecture’s capacity to model the elevated cross-modal causality exhibited across urban traffic infrastructure.
4.5. Summary of Key Findings
Based on the experimental analysis, the results lead to five key conclusions:
- a.
AIP-Urban was able to outperform baseline models for accuracy (F1 = 0.94) and latency (<80 ms) across the real-world datasets.
- b.
The longitudinal prediction of 24 h early-warning capabilities proved reliable for all modalities (voltage, optical, mechanical, flow).
- c.
The energy-aware design demonstrated a reduction in CO2 emissions of almost ≈ 18%, demonstrating the environmental benefits of edge-native AI as a practice.
- d.
The interpretive capabilities of the output of the model through SHAP, or the local feature attribution of the model using LIME, allows results of AI inference to be communicated to field diagnostic.
- e.
The scalability experiments confirmed linear throughput up to 10 edge nodes without thermal drift.
The findings above provide empirical evidence that AIP-Urban is a suitable, interpretable, and energy-efficient framework for the predictive maintenance of next-generation urban traffic infrastructures, setting the stage for the discussion and comparative analysis presented in
Section 5.
5. Discussion and Comparative Analysis
The comparative assessment of AIP-Urban against recent state-of-the-art frameworks reveals its clear superiority in accuracy, efficiency, and scalability. As summarised in
Table 6, the proposed system is benchmarked against representative studies published between 2023 and 2025 including Reis et al. (2025) [
5], Lokhande et al. (2025) [
7], Alotaibi et al. (2025) [
21], Shabaz et al. (2025) [
23] and Ghasemi et al. (2025) [
26]. While most of these approaches rely on cloud-centric processing and deliver inference latencies above 120 ms, AIP-Urban achieves F1 = 0.94 and an average latency of 72 ms, maintaining real-time responsiveness with an average energy draw of 7.8 W.
This quantitative comparison in
Table 7 clearly demonstrates that the hybrid CNN–Transformer–LSTM design achieves a 7–9% accuracy improvement and a 30% latency reduction compared with leading baselines such as Federated LSTM [
22] or Ghasemi’s Edge GNN [
26]. Furthermore, it is the only framework that couples predictive maintenance and anomaly detection with explainable-AI mechanisms, as evidenced by the integration of SHAP/LIME attribution analysis (
Figure 12).
The performance variations expressed in
Table 5 can be directly attributed to AIP-Urban’s two innovations in both architectural synergies and hardware optimisation. This model incorporates hybrid spatio-temporal pipelines where convolutional feature extraction is used with Transformer-based contextual reasoning and LSTM-driven temporal prediction to accommodate learning from heterogeneous sensor data. At the same time, quantisation and structured pruning improved computation load, allowing for improved throughput on embedded edge processors without cloud dependency or risk of thermal impact. Aside from computational improvements, AIP-Urban provides a new layer of trust and interpretability in the field of predictive infrastructure maintenance. The SHAP analysis (
Figure 12) shows that voltage variation, entropy drift, and vibration kurtosis were the three most significant factors, explaining jointly over 61% of the predictive behaviour. By connecting algorithmic reasoning to measurable physical variables, the system supports explainable diagnostics that can help connect artificial intelligence inference with engineers’ decision-making, a detail often overlooked in previous research [
7,
16,
23].
Moreover, and in terms of the sustainability considerations, from AIP-Urban’s energy-aware design, we achieved ≈18% less CO
2 emissions per inference cycle and near-linear scalability up to 10 nodes with the increase in power costing less than 0.4 W per node to operate. In this regard, AIP-Urban is one of the few frameworks that will allow for high-performance predictive analytics while being ecologically neutral and transparent to operations. Overall, the evidence provided in
Table 5 and
Figure 11 and
Figure 12 supports the claims of AIP-Urban outperforming contemporary edge AI systems on accuracy, latency, interpretability, and energy efficiency, yielding a benchmark for next-generation smart mobility infrastructures.
AIP-Urban has been tested for edge deployments in a controlled evaluation environment. Continuous 24 h inference with fixed sampling rates of video and sensor streams, as well as stable operating temperatures of Jetson Nano and Coral TPU devices, were features of the controlled environment. Although AIP-Urban exhibited low latency and consistent power consumption within this controlled environment, practical deployment in real life will be subject to additional constraints that may arise from thermal drifts, seasonal variations in ambient lighting, degradation due to device age, and sensor calibration drift. To reduce long-term model drift in real-world deployments, the AIP-Urban system could also employ other methods for adapting models over time such as: micro-retraining models periodically based on newer data collected from the edge, using entropy based approaches for identifying changes in model performance (drift) that will allow for retraining of models via refreshing/updating; incremental learning modules that allow for gradual training of models rather than a complete retraining of models on all data. Although not currently in place in the experimental system, these techniques would increase the ability of the models to cope with seasonal shifts and variability in the behaviours and environments and new traffic pattern changes.
Also it must be noted that the experimental validation of AIP-Urban was accomplished with isolated single-node test configurations; therefore, when deploying AIP-Urban across multi-intersections, it will be necessary to consider hardware heterogeneity, intermittent connectivity between intersections, and synchronisation policies. Finally, the explanations provided by SHAP’s output are of considerable value to field technicians since they provide insights that allow technicians to determine the degree of failure or malfunction before sending out a maintenance technician on a scheduled basis (inspection for power fluctuations, entropy drift in luminance, kurtosis of vibration, etc.) and highlight the practical benefits of the diagnostic aspect of AIP-Urban in an operational environment.
Although AIP-Urban performs with high accuracy, it is important to note that the assessment was mainly for benchmark datasets and two edge platforms (Jetson Nano B01 and Coral TPU v2). Verifying long-term robustness and generalisability in situ conditions remains limited where more real-world multi-intersection validation, larger hardware diversity, and deeper multimodal sensing (ex: acoustic, or thermal data) will be of great importance.