AIP-Urban: Edge-Enabled Deep Learning Framework for Predictive Maintenance and Anomaly Detection in Urban Traffic Infrastructure

Abdallah, Wajih; Alghamdi, Mansoor

doi:10.3390/systems13121117

Open AccessArticle

AIP-Urban: Edge-Enabled Deep Learning Framework for Predictive Maintenance and Anomaly Detection in Urban Traffic Infrastructure

by

Wajih Abdallah

^*

and

Mansoor Alghamdi

Department of Computer Science, Applied College, University of Tabuk, Tabuk 71491, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(12), 1117; https://doi.org/10.3390/systems13121117

Submission received: 18 November 2025 / Revised: 3 December 2025 / Accepted: 8 December 2025 / Published: 11 December 2025

(This article belongs to the Special Issue AI-Driven Transportation Systems: Innovations, Challenges, and Future Mobility)

Download

Browse Figures

Versions Notes

Abstract

Urban traffic infrastructures like traffic signals, surveillance cameras, and embedded sensors play an essential role in providing sustainable mobility but are also susceptible to malfunctions, data drift, and degradation from environmental conditions. In this study, we propose AIP-Urban, an edge AI-enabled predictive maintenance framework that employs deep spatio-temporal learning with continuous anomaly detection for smart transportation systems. Our framework integrates IoT sensing, computer vision, and time-series analytics to identify and forecast infrastructure failures before they occur. For visual and numerical anomalies (e.g., traffic signal outage, abrupt congestion, sensor disconnection), we employ a hybrid CNN–Transformer model, while we utilise a Temporal LSTM predictor to estimate a degradation trend to predict maintenance events within 24 h. The models are deployed on Jetson Nano edge devices to enable real-time processing under energy constraints. Extensive simulation studies using datasets from SUMO, CityCam, and UA-DETRAC show that AIP-Urban achieved 94% accuracy for anomaly detection (F1 = 0.94), with RMSE = 0.11 for failure prediction and an edge inference latency of 72 ms, while power consumption remained below 7.8 W. Statistical tests (Wilcoxon p < 0.05) show goodness-of-fit compared to baseline models of CNN, LSTM, and Transformer only. This study shows promise in improving the reliability, safety, and sustainability of urban traffic using proactive, explainable, and energy-aware AI at the edge. AIP-Urban serves as a reproducible reference architecture for future AI-driven transportation maintenance systems that is aligned with intelligent and resilient smart cities principles.

Keywords:

predictive maintenance; anomaly detection; edge AI; deep learning; CNN; LSTM; smart transportation systems; urban traffic infrastructure; sustainable mobility; real-time monitoring

1. Introduction

Transportation infrastructure in urban areas (intersections, traffic signals, sensor loops, and camera surveillance) plays an essential role in the functioning of smart cities by providing improved mobility and safety and less environmental impact. The mounting strain on these systems (ageing hardware, increasing traffic, environmental conditions, and variability in operational conditions) has increased the likelihood of unanticipated failures. A single failed traffic signal, sensor, or unreported outage can result in increased congestion, energy consumption, and CO₂ emissions and reduced service levels for users.

Historically, infrastructure performance monitoring involved either reactive “break-fix” approaches, or scheduled, preventative inspections. These methods are inadequate for the needs of smart mobility: cities require uninterrupted service, resilience to disruptions, and a low environmental impact. As articulated in one review, “the convergence of smart city initiatives with predictive maintenance … will allow for real-time, anticipatory monitoring of infrastructure assets.” [1].

Simultaneously, the field of artificial intelligence (AI) and deep learning (DL) has accelerated the advancement of intelligent transportation systems (ITSs) through improved traffic prediction, adaptive signal control, and fleet optimisation [2]. However, a substantial amount of prior research has focused on traffic flow or vehicle behaviour and has not considered the future operation of transportation infrastructure (such as sensors, controllers, and embedded devices). These have not only failed to account for traffic flows, but also reliability and preservation of infrastructure function [3]. In addition, the rapid development of edge computing technologies enables analytical and decision-making tasks to better analyse the data. This facilitates latency reduction, bandwidth savings, and near-real-time intervention. For example, in the context of infrastructure management, edge AI systems have been shown to “enable real-time diagnostics, predictive maintenance … traffic optimisation in smart city infrastructures” [4]. Other evidence in the urban traffic context has demonstrated better responsiveness and robustness with edge-based solutions [5]. In addition to performance and resilience considerations, smart mobility demands sustainability and trust: traffic management now has to consider environmental objectives (e.g., reducing CO₂ emissions, energy efficiency) and ensure the acceptability of automated decision-making (e.g., through model transparency and interpretability). A recent review indicates that “the implementation of emerging, cutting-edge AI-driven innovations … will fundamentally reshape road transportation systems for smart cities, e.g., real-time traffic management and environmental impact” [6].

Finally, predictive maintenance and anomaly detection are increasingly regarded as important levers to support sustainable mobility: “AI-enabled predictive maintenance uses Internet of Things (IoT) sensors, AI and big data, to monitor infrastructure, predict failures and enable optimized maintenance” [7]. Solutions that use camera vision, time-series senor, or meta-sensor data in conjunction with AI methods quote detection accuracies greater than 90% in real-time urban infrastructure anomaly detection [8]. Nevertheless, there remain several key challenges unresolved: sensor interoperability in heterogenous urban networks, latency in real-world deployment in dense urban environments, the increased energy consumption of DL models at the edge and the continuing challenge of model transparency, and many AI models remaining as “black boxes” in safety-critical contexts. In particular, the joint use of explainable-AI (XAI) techniques and edge constraints (power, latency, bandwidth) remains largely unexplored [9]. Finally, while many studies target AI-based traffic flow prediction and anomaly detection, such as the example study producing 97.5% accuracy in a multi-agent simulation, they often tend to ignore hardware constrained or real infrastructure relevant to edge deployment [10].

In this context, this paper introduces AIP-Urban, a new edge AI framework for predictive maintenance and anomaly detection of urban traffic infrastructure. The system deploys a hybrid CNN–Transformer model for visual and temporal anomaly detection (e.g., signal failure, sensor outage, abnormal congestion), and an LSTM predictor to proactively predict equipment failure within a 24 h horizon. The architecture is deployed on edge nodes (Jetson Nano, ESP32-Cam) and attempts to minimise latency and dependence on cloud (non-Internet) connectivity.

The primary contributions of this work are as follows:

A complete edge AI architecture aimed at the health management of urban traffic infrastructure, that combines anomaly detection and predictive maintenance.
The development and evaluation of a CNN–Transformer + LSTM model used for joint multimodal urban sensor data, i.e., traffic video, sensor time-series, and signal states.
An experimental implementation on actual edge hardware, with measured latency, energy consumption, and detection/prediction performance.
A systematic results analysis that includes ablation studies, statistical significance tests, and explainability (XAI) to establish system trustworthiness and operational readiness in urban environments.

AIP-Urban has introduced several new technologies that fill the major unmet needs within the edge AI and federated ITS frameworks discussed in Section 2. First, previous systems tended to look only at traffic optimisation or anomaly detection separately from sensor monitoring or predictive maintenance. In contrast, AIP-Urban combines all three functions into one seamless hybrid CNN–Transformer-LSTM architecture implemented solely at the edge of the network with no possibility of delay caused by using the cloud, as well as no privacy concerns regarding data sharing over a public internet connection. Second, unlike federated ITS projects, which typically require large models and rely on a central server for scheduling and storage, AIP-Urban provides a comprehensive solution that uses adaptive scheduling based on both current conditions and user behaviour; efficient INT8 quantization of deep learning frameworks; and structured pruning of neural networks that collectively enable sub-80 ms latency and 7.8 watts on the Jetson Nano microprocessor, results that have never been achieved in the related literature. Third, the native explanation capability built into AIP-Urban allows users to generate SHAP (SHapley Additive exPlanations) values along with traditional attention maps from the neural network at the edge and addresses the unmet need for explainability identified in earlier publications. Finally, AIP-Urban is the first ITS framework to show cross-modal (city to city) generalisation across all datasets used in this research (CityCam, UA-DETRAC, PEMS-BAY, SUMO) using one integrated approach, thereby addressing some limitations experienced by earlier ITS frameworks that restricted testing to single datasets.

The remainder of this paper is organized as follows. Section 2 surveys the related work relevant to our study. Section 3 introduces the proposed system architecture and outlines the data collection process. Section 4 presents the experimental results and performance evaluation. Section 5 provides a detailed discussion and comparative analysis of the obtained findings. Finally, Section 6 concludes the paper and outlines prospective directions for future research.

2. Related Work

The convergence of artificial intelligence (AI) and edge computing in intelligent transportation systems (ITSs) is transitioning from conventional traffic optimisation to comprehensive asset management and infrastructure reliability. However, following Khan et al. [11] and Saki [12], the majority of frameworks continue to emphasise mobility efficiency over health and operational performance of the underlying infrastructure. This section will review four relevant research themes: (i) AI-based predictive maintenance, (ii) deep learning-based anomaly detection, (iii) edge-based intelligence for decision-making in real-time, and (iv) research gaps that lead to the motivation for AIP-Urban.

2.1. AI-Based Predictive Maintenance for Transportation Infrastructure

The concept of predictive maintenance is essentially to predict failures from real-time sensor streams and context awareness, as opposed to ad hoc reactive “break-fix” work. Several surveys completed deep learning approaches across domains [11,13]. Specifically, Khan et al. [11] noted an over-reliance on convolutional and recurrent architectures (CNN, LSTM, GRU) for early fault detection. In contrast, Saki [12] noted that a CNN-BiLSTM hybrid achieved >93% accuracy in railway asset failure prediction; one disadvantage was the need for significant computational power (i.e., gpu) and it had also not been validated in real-time. Bonci et al. [14] described an approach based on Deep Echo State Networks (DeepESN) for predictive maintenance within the industry where, in their paper, they defined how an approach based on DeepESN would be robust despite noisy signals, which is an important feature for environments that involve urban sensing. In contrast, Nolan and Reynolds [15] observed that generic feedforward models are fragile and degrade significantly when there is a background imbalance within the dataset, although this is a common issue with traffic sensor datasets. Lokhande [7] further built upon AI-based infrastructure maintenance frameworks with big data pipelines and IoT integration. Nevertheless, these architectures are still reliant on the cloud, resulting in unacceptable latencies for time-sensitive traffic management. Dhanasekaran [16] stated that it is time to move towards maintaining analytics that are edge-native to provide resilience. In addition, work such as Reis et al. [5] and Ortiz-Garces [17] have demonstrated the potential to deploy lightweight models (TensorFlow Lite/PyTorch Mobile) to Jetson Nano and Raspberry Pi, but these uses are for home automation or manufacturing applications and do not apply to traffic assets. To date, the generalisation of predictive maintenance to urban ITS assets has been largely unexplored.

2.2. Deep Learning for Anomaly Detection in Traffic and Infrastructure

Anomaly detection is concerned with detecting abnormal events (congestion spikes, sensor outages, or controlling errors) before they become apparent in the mobility network. Early research mostly looked at irregular traffic flow using temporal deep learning [9,18,19], and infrastructure anomalies, if examined, were often disregard [20]. Ma et al. [18] conducted a survey of graph-based models, concluding that GNNs have greater efficiency than RNNs in preserving spatial and temporal interactions, while relying on a large number of labelled data, which is rarely available in transportation. Alotaibi et al. [21] presented an approach based on deep reinforcement learning combined with vision-based analytics on a pedestrian walkway anomaly detection experiment with accuracy results of 96%. Separately, Lim et al. [22] proposed federated learning as a framework for anomaly detection in distributed settings across multiple IoT and transportation nodes, producing improved data privacy and 17% appreciable time reduction to central deep learning models. However, inference latency surpassed 450 milliseconds, which restricts deployment in real crossroads. Shabaz [23] proposed a CNN-based pipeline to predict anomalies in urban settings, achieving nearly 92% accuracy, but it was at the expense of interpretability—a valid concern from Cummins et al. [9] in their review of XAI applied in predictive maintenance. Khan [20] analysed surveillance videos using an ultra-light weight CNN model (Sensors 2022) but also observed that False Positives grown exponentially with the environment illumination varying, which is a typical condition of outdoor crossroads. Instead, a follow-up study by Alotaibi et al. [21] made an attempt to integrate edge-vision and deep optimisation; however, their framework did not include energy benchmarking. Furthermore, comparative studies, like Wang et al. [24], conclude that most multivariate anomaly detectors do not generalise across the domains without retraining. Lastly, concerning capping the reproducibility of datasets, several datasets in the public domain (AICity, UA-DETRAC) were designed to track vehicles instead of assessing infrastructure degradation, which results in a largely faulty evaluation schema across studies [25].

2.3. Edge AI and Decentralised Learning for Smart Mobility

Edge computing has proved useful in a number of test cases to support latency-sensitive ITS applications [14,26,27]. Ghasemi et al. [26] showed that lowering inference to near the sensor results in 40–60% improvement in communication time without more than a 2% drop in accuracy versus cloud inference. Reis et al. [5] replicated this on Jetson Nano, targeting operations using a workload of predictive maintenance, and achieved transportation <120 ms delay time and energy consumption of ≈7.8 W. Meanwhile many realised deployments have only progressed to a fractured prototype: edge nodes are not typically coordinated between intersections and are rarely examined for failover during connectivity drops [16,28]. Klein et al. [29] proposed auxiliary knowledge graphs for assisted anomaly reasoning, which improved detection accuracy but came with high memory requirements. Ortiz-Garces [17] reported 40% down time reduction using edge analytics but did not account for scalable energy consumption objectives. Sharma [30] emphasised that AI-edge systems need to integrate faults and cybersecurity weaknesses often overlooked in experiment works in the literature. In a similar fashion, Bawaneh [31] demonstrated that ML systems using edge deployment in traffic sensors can incur performance decay when exposed to extended periods of heat, a supplementary factor infrequently scrutinised in the literature. Finally, summarised recent contributions of 5G edge AI [25] and cloud–edge hybrid orchestrators [28] represent a viable modal for a distributed intelligence for mobility solution but do not introduce a standard evaluation schema.

To provide the reader with a more clearly articulated comparative overview of the most salient contributions in predictive maintenance, anomaly detection, and edge-intelligent systems, Table 1 outlines the core characteristics, merits, and weaknesses of the review studies referenced above. This synthesis conveys both the growth in methodology within the field and the sustained disconnection of these research streams; whilst deep learning has pushed the boundaries in detection and prediction, most exploits remain relative to their original domain, rely on an enclose proprietary dataset, and neglect explanation, energy efficiency, or cross-infrastructure scalability.

The profound comparative effort in Table 1 reinforces the imperative for an integrated, explicable, energy aware edge AI framework, an evidenced issue outlined by the proposed architecture AIP-Urban.

To summarise, even though autonomous technology and AI-based systems have come a very long way and many positive advancements have taken place, there are still considerable limits on today’s ITSs (intelligent transportation systems), edge AI, and federated learning approaches based on the presence of the following four reoccurring barriers: (1) Only focusing on a single modality. Generally speaking, most extant approaches focus on either traffic flow analysis (via automated routing) or video analytics. (2) Cloud or server-based coordination, as opposed to direct edge-based processing, adds significant latency to most of today’s deployments and de-creases reliability in the real world when used. (3) There is no availability of interpretable diagnosis for safety-critical infrastructure systems. (4) A lack of formalised energy-aware evaluation on real edge hardware (versus the cloud). Therefore, what can be concluded from the work performed in [5,16,22,23,30,32] is that no single solution is currently capable of offering multi-sensor deterioration forecasting, on-device real-time anomaly detection, on-device interpretability, and (the ability to execute at low power via quantisation) on embedded devices. The AIP-Urban framework was developed specifically to address these barriers through an integrated hybrid CNN–Transformer-LSTM pipeline, multimodal edge deployments, interpretable attribution mechanisms, and, ultimately, to develop a sustainable next-generation management system for our historic traffic infrastructure.

3. System Architecture and Data Collection

The AIP-Urban framework was developed as a multi-layer edge AI environment that incorporates sensing, intelligence, and sustainability into urban traffic infrastructures. The proposed architecture simultaneously connects three important areas: (1) real-time anomaly detection, (2) predictive maintenance of hardware assets, and (3) energy-aware distributed computation. In contrast to existing intelligent transportation system (ITS) designs that heavily rely on centralised cloud-based analytic and actions, AIP utilises an edge approach that supports the required ultra-low-latency (<100 ms) response time and privacy-preserving learning across heterogeneous devices.

3.1. Overall Architecture

The system architecture consists of four collaborative layers, as illustrated in Figure 1: IoT sensing, edge intelligence, fog/cloud coordination, and explanation and decision-support.

At the foundation, the IoT sensing layer is made up of a distributed and tightly clustered network of multimodal sensors monitoring operational and environmental conditions in traffic systems. Such systems included signal controllers, roadside units, cameras, acoustic sensors, and micro-vibration units. Each node captures a continuous data stream of vehicle counts, flow density, queue length, temperature, humidity, light intensity, or vibration amplitude with a sampling frequency of typically 1 Hz. The network leverages various communication technologies such as LoRaWAN, MQTT running over 5G, or conventional wi-fi 6 based on applicable bandwidth and distance circumstances [14,25]. An MQTT broker centralises time synchronisation and improper transmission collision assurances, designed to reduce transmissions by nearly 32% compared to HTTP polling [26]. Important within the originality of this layer is the cross-modal data design where visual cue data from CityCam feeds were aligned with streams of physical sensors, thus allowing for concurrent physical degradation and/or traffic flow anomaly detection (e.g., lamp flicker, pole tilt) [21,32]. The edge intelligence layer consists of Jetson Nano, Coral TPU, and ESP32 edge AI devices acting as independent decision nodes, invoking autonomous AI decision analysis, disentangling bounded by cognition and locality. Each edge device onboard executes one quantised hybrid CNN–Transformer-LSTM model that combines the local anomaly detection of signal dropout and voltage irregularities or temporal embedding of multi-signal features for predictive maintenance forecast through blurred visual frames. Local inference captures 80–90 ms latency consuming less than 8- watts, values confirmed in experimentation [5,27]. By incorporating localised learning and adjustment, AIP-Urban reduces the reliance on cloud processing and communications by approximately 85%, all while ensuring ongoing functionality in the event of a communication failure. Relative to earlier federated Internet of Things (IoT) systems [22], it also leverages dynamic model compression (≈50% of parameters reduced with INT8 quantisation) and context-adaptive scheduling to enable an additional 18% reduction in energy consumption.

The fog/cloud coordination layer functions by synthesising knowledge from multiple edge nodes through a lightweight federated learning (FL) scheme. This system does not transmit raw sensor data; instead, edge nodes send only model-weight deltas to a fog-level aggregator every 15 min. This paradigm jointly ensures global model convergence using an asynchronous FedAvg process, while simultaneously ensuring privacy, computation balance, and harmonisation of learning among different districts. Empirical studies built roughly on Lim et al. [22] reveal that federated training reduces the overall amount of bandwidth/utilisation by approximately 73%, while converging to shared accuracy within 2% of the accuracy based on centrals. If 5G MEC (Multi-access Edge Computing) can be integrated into the AIP-Urban framework, real-time orchestration of active intersections can also occur, leveraging the benefits of the fog and edge intelligence, while being a significant improvement over ITSs (intelligent traffic systems) reliant on cloud accessing [26,28].

Another unique layer is the explainability and decision-support layer, which injects native interpretability in the edge pipeline. Each device generates a local interpretability report that includes SHAP-based feature weights, denoting which variables triggered the alert; temporal attention heatmaps originating from the Transformer block, denoting which times were the most impactful in anomalies; and a combined Maintenance Priority Index (MPI), indicating risk, energy, and operational importance of the anomaly. These interpretability outputs are then aggregated into a city level dashboard, where engineers can verify, alter, or prioritise maintenance. This interpretability layer is significant in developing assurances of trust, accountability, and compliance, as identified by Cummins et al. [9] and referenced by García-Méndez et al. [32].

The operating loop processes continuously: sensors measure constant multimodal data and stream it to the nearest edge node; the hybrid model makes inferences from the data and outputs an anomaly likelihood and a predictive maintenance score; when the predictive maintenance score pushes above the dynamic threshold (e.g., 0.8), an alert is communicated along with the SHAP explanation; edge nodes compress the model outputs in a communication block and every 15 min send the compressed model output to the fog coordinator for federated aggregation; and the updated global weights are redistributed, completing the self-improving maintenance loop.

Overall, AIP-Urban lays out a new cross-modal IoT sensing architecture, federated edge intelligence, and explainable decision-support into a sustainable and privacy-preserving system. By specifically moving beyond traffic flow optimisation of the previous transportation platforms and by specifically addressing reliability, transparency, and ecologically efficacious systems, AIP-Urban lays out new opportunities for next-generation intelligent and resilient transportation infrastructures.

3.2. Data Acquisition and PreProcessing

The AIP-Urban framework depends on a coherent and reliable data-acquisition pipeline that repairs urban traffic infrastructure reality and operating conditions. To derive reproducibility and realism, the entire experimental environments are built using publicly available high-fidelity datasets and contemporaneous real-time simulations that recreate the environmental and sensor-level variance that occurs at contemporary intersections. The combination of these two avenues ensures the fidelity towards real traffic dynamics, while ensuring that this work is not dependent on “black box” or undisclosed datasets.

Data Sources

Four complementary datasets were used in this work:

CityCam, a continuously running camera system capturing video streams at urban intersections, exposing and categorising visual anomalies (e.g., traffic light flicker, occlusion, pedestrian crossing, etc.).
UA-DETRAC, which provides labelled vehicle-tracking trajectories and density measures that are used to validate the estimated congestion state and reported object-detection accuracy.
PEMS-BAY, a dataset comprising long-term time-series data relating to vehicle speed, flow, and occupancy across hundreds of highway sensors, exhibiting realistic temporal drifts and periodic congestion states.
SUMO-Generated Synthetic Scenarios simulate infrequent/rare or hazardous situations (e.g., signal outage, sensor dropout, illumination shifts, etc.) for robustness stress-testing or condition monitoring.

Each dataset is deliberately chosen to ensure samples were sufficiently distinct and drawn from different modes of data (i.e., visual, numerical, contextual), but also in ensuring transparent benchmarking similar to other ITS studies [11,21,32]. All data streams are anonymised and are following timestamp alignment for compliance with privacy, data-sharing, and reproducibility requirements.

Acquisition and Synchronisation Pipeline

Multimodal data are ingested through a single edge gateway pipeline—common in operating with a 1 Hz sample rate.

Asynchronous non-stationary traffic is maintained by the MQTT broker among multiple virtual sensors; as a result, the broker determines packet-order and maintains buffer stacks in the event of short connectivity drops.

Each data stream is time-aligned using sliding-window synchronisation at Δt = 15 s, resulting in feature matrices that merge both image-derived features and statistical outputs.

The raw data stream undergoes a clean and normalisation process, that can be stated as follows:

-: A Kalman filter increases the smoothness of the data and reduces noise due to the network.
-: An Isolation Forest removes statistical outliers past the 95th percentile.
-: To save the continuity of the sequence, missing values are reconstructed using k-Nearest-Neighbour temporal interpolation.
-: All variables are scaled by way of Min-Max normalisation, which can support stability in gradients during model training and testing [23,32].

The final processed tensors integrated approximately 200 features (per observation) that included visual brightness entropy, vibration proxies, current stability, queue length, and environmental variables.

Multimodal Fusion and Cross-Source Consistency

With multiple data types (e.g., SUMO, CityCam, UA-DETRAC, and PEMS-BAY), AIP-Urban combines them via a three-phase Multimodal Fusion Architecture to ensure consistency across all datasets. First, all data types are synchronised and aligned to a single temporal index through resampling. This means that video-derived entropy features from CityCam and object density statistics from UA-DETRAC are averaged to 1 Hz temporal frequency and aligned with both numerical sensor and SUMO-generated traffic flow sequences. This process prevents misalignment (i.e., temporal drift) between the various data sources. Second, to mitigate any variance across these four domains, Domain-Specific Normalisation Rules are applied: z-scores for the numerical streams (e.g., SUMO/PEMS-BAY), min-max normalisation for visual embeddings (i.e., CityCam/UA-DETRAC), and Scene Dependence for entropy values. Lastly, the resulting feature matrices are projected into a common latent representation via CNN encoders (for CityCam and UA-DETRAC) and MLP encoders (for SUMO and PEMS-BAY) and are fed into the Transformer architecture-based Cross-Modal Attention Block. This block allows both types of modalities to work together as one by learning aligned temporal and degradation structures that are complementary to each other. These three practices ultimately produce anomaly and degradation scores that originate from data that is properly and correctly aligned in respect to each other, in addition to each dataset being normalised by scale and domain.

Label Generation and Ground-Truth Development

Due to the existing absence of labelled infrastructure anomalies, a semi-supervised annotation process was deployed. An autoencoder pre-model is used to identify irregular patterns by means of reconstruction error thresholds. Detected segments are cross-validated with the annotated events in UA-DETRAC, and SUMO scenarios, to confirm True Positives.

Each time window is assigned a binary anomaly label, where 0 = normal and 1 = fault, as well as a Degradation Index from 0 to 1 that estimates the probability of asset deterioration.

The mixed labelling procedure reduces manual annotation effort by approximately 31%, while maintaining F1 > 0.92, and also lessens the requirement to generate a larger annotated dataset [18,33]. This allows for a hybrid deep model capable of detecting temporal degradation trends from partially labelled data, a crucial abstraction for long-term deployment in real ITS infrastructures.

Experimental Integrity

Each aspect of data-handling is logged using MLflow tracking, and Data Version Control (DVC) for transparent end-to-end traceability of preprocessing parameters and feature-engineering conditions. Statistically driven monitoring of drift (mean, variance, skewness, kurtosis) is continuously measured for detecting gradual sensors’ bias, an established observation for long-term IoT operation [16,27]. The complete preprocessing pipeline, including filtering thresholds, scaling, and dataset splits is illustrated in Figure 2, and will be preserved for replication and archived through an open research repository following FAIR data principles (Findable, Accessible, Interoperable, Reusable).

This systematic approach balances scientific transparency with real-world practicalities: it shows that AIP-Urban has indeed been validated using conditions that reflect real smart city deployments, while completely relying on datasets that were open for verification.

The diagram represents the process of correctly working with data from heterogeneous sources, namely traffic sensors, environmental data, and video feeds, by means of synchronisation, filtering, feature scaling, and semi-supervised labelling for ingestion into a hybrid CNN–Transformer-LSTM model.

An autoencoder reconstruction-based method trained on “normal-operation” segments of both datasets was used in order to develop reproducible and transparent preparation of anomaly labels in this semi-supervised manner. The autoencoder was created using a three-layer encoder–decoder architecture, trained using the mean absolute error (MSE) loss and stopping when early validation error occurred after 50 epochs of training.

After training the autoencoder, the distribution of errors in the reconstruction was calculated (e =

‖x - \hat{x}‖

) based on the data used for training. We established our anomalous threshold (τ) as follows: τ = μ_e + 3σ_e.

In this threshold, the terms μ_e and σ_e represent the mean and standard deviation of the reconstruction error when normal. Any data point (i.e., sample) from our collection of data which had an error exceeding τ, (i.e., e > τ) was designated as anomalous and received a pseudo-label indicating such, while the remainder of the data points are identified as “normal”. This method of designating anomalous samples is consistent with conventional practices of unsupervised anomaly detection and eliminates dependence on the annotative processes of humans.

To validate the stability of this pre-labelling stage, we computed basic True Positive (TP) and False Positive (FP) counts by comparing pseudo-labels against the small set of known ground-truth anomaly timestamps provided in CityCam and UA-DETRAC. The autoencoder achieved True Positive rates ranging from 0.87 to 0.92, with corresponding False Positive rates below 0.11. Consequently, we conclude that our pseudo-labels have sufficient reliability for training the hybrid CNN–Transformer-LSTM models. These verified pseudo-labels provided the supervisory signal for the AIP-Urban anomaly detection stage.

3.3. Edge Deployment and Model Design

A hybrid deep learning architecture is implemented in the AIP-Urban framework, which is capable of real-time monitoring through edge connection for energy-aware anomaly detection or predictive maintenance. In this section, we outline the design rationale, training methodology, and optimisation techniques that comprise the framework’s lightweight, interpretable, and on-device attributes.

The model uses a three-stage hierarchical architecture comprising the following:

A convolutional neural network (CNN) for spatial feature extraction;
A transformer encoder for temporal-context modelling;
An LSTM decoder for degradation forecasting.

This hybrid architecture demonstrates the ability to simultaneously capture high-level visual cues (in the CNN) and low-level temporal dependencies (in the transformer and LSTM) across multimodal sensing data

Hybrid Deep-Learning Architecture

In the input stage, fused tensors (≈200 features per 15 s window) are split into numerical and visual branches. A CNN block extracts spatial correlations such as illumination irregularities, edge vibration spectra, or entropy of image brightness using three convolutional layers (filter = 3 × 3, stride = 1). Feature maps are concatenated with normalised sensor readings and fed into the Transformer encoder, which consists of four self-attention heads and positional encodings for modelling long-range temporal interactions. Finally, the LSTM decoder generates a predictive maintenance score (PMS ∈ [0, 1]), which reflects the probability of undergoing degradation within the next 24 h.

In comparison to standard CNN–LSTM baselines, a Transformer improved long-term dependency capture by ≈18% for F1-score and reduced False Positives in the case of irregular illumination events [26].

Edge Optimisation and Quantisation

To enable real-time deployment, the AIP-Urban model was fully executed on two embedded platforms:

NVIDIA Jetson Nano (B01, 4 GB RAM) operating under Ubuntu 20.04 LTS with CUDA 11.4 and TensorRT 8.5 for on-device acceleration.
Google Coral Dev Board (TPU Edge v2) running Mendel Linux 5.10 with the TensorFlow Lite runtime 2.14.

The hybrid CNN–Transformer–LSTM model underwent three compression stages prior to deployment:

Structured pruning using TensorRT sparsity tools removed approximately 50% of low-magnitude weights while maintaining accuracy loss below 1%.
Post-training quantisation (PTQ) converted all convolutional and recurrent layers from FP32 to INT8 precision, reducing the overall model size from 92 MB to 23 MB.
Dynamic inference scheduling was implemented on the Jetson Nano scheduler daemon, adapting inference frequency based on input data entropy and node CPU utilisation.

With the updates mentioned, edge inference performed with a mean latency = 72 ms and an average power draw of 7.8 W, while the Coral TPU version reports similar throughput (≈78 ms) at only 5.9 W. Both devices interfaced with the fog aggregator using MQTT over a 5G MEC deployment to support asynchronous federated averaging. This explicit hardware configuration supports much more transparent reproducibility, while also confirming that AIP-Urban can be operated within the power and timing requirements of modern IoT edge devices [5,22,26].

Explainability and Reliability at the Edge

AIP-Urban also incorporates explainable-AI (XAI) modules in the devices to ensure transparent reporting:

-: SHAP analysis quantifies the feature contribution for each anomaly event, ranking the relevancy of the sensors (e.g., voltage variance > vibration drift).
-: Attention heatmaps from the Transformer demonstrate the most relevant temporal windows that led to the maintenance alerts.
-: A Confidence Index with a measure of probability and entropy dispersion flags uncertain cases as needing human review.

These different modes of interpretability offer enhanced trust with operators and regulatory compliance in a safety-critical traffic system [9,31].

Deployment Workflow

The operation at the edge unfolds as follows and is visually summarized in Figure 3:

Preprocessed feature windows are received via the edge gateway.
The CNN–Transformer-LSTM model performs local inference, and the inference results are output as an anomaly score and predictive maintenance probability.
If the PMS > 0.8, the node issues an alert with its respective SHAP report and attention map.
Every 15 min, there are model updates sent to the fog aggregator for asynchronous federated averaging (FedAvg).
The aggregated weights are then redistributed to all nodes for ongoing self-improvement.

Decentralised training supports low latency, scalability, and resilience while addressing the third challenge of a cloud-centric strategy for ITS [5].

The diagram shows that the deep model has three different stages, on-device quantisation, and federated synchronisation across edge nodes and up to the central fog aggregator.

3.4. Cloud–Edge Coordination for Network-Level Traffic Dependencies

Although performing predictive maintenance and anomaly detection at the edge, AIP-Urban provides a cloud–edge coordination layer that allows for collaborative interaction between pavement segments. In reality, pavement degradation or traffic congestion on one segment can cause issues on adjacent segments. In most cases, local context may be inadequate to explain these longer-range effects.

To enable communication between edge nodes and the upper-layer cloud service, AIP-Urban uses periodic synchronisation of lightweight, aggregated descriptors, rather than continuous streaming of data. Each edge device will send the following to the cloud:

▪: Traffic flow statistics in a compressed vector format (mean, variance, and entropy).
▪: Anomaly scores based on the CNN–Transformer models.
▪: Predictions on the extent of degradation, based on the LSTM module.

The cloud aggregates these summaries into dependency-aware network states that map upstream/downstream relationships. Upon detecting dependencies (e.g., congestion in one location affecting the operation of the downstream segment), the cloud returns a notification to the edge, containing flags that adjust local calculation intervals or anomaly thresholds. This presents a hybrid solution that maintains the autonomy of edge nodes, while providing use of occasional cloud situational awareness.

Finally, the architecture is designed to permit full edge operation, irrespective of cloud connectivity; cloud coordination enhances multi-segment synchronisation but does not provide cloud interaction capabilities. The studies that evaluated performance were conducted as edge-only.

3.5. Mathematical Formulation and Algorithmic Workflow

Model Configuration and Hyperparameter Settings.

To facilitate reproducibility, listed below are important must-have parameters of the hybrid CNN–Transformer–LSTM architecture.

▪: The CNN has three convolutional layers with kernel sizes: {3 × 3, 3 × 3, 1 × 1} with corresponding feature maps of {32, 64, 128}. Each convolutional layer is followed by a ReLU activation and batch normalisation layers after the convolutional layer and every two (2) batch normalisation layers; the CNN also performs max pooling.
▪: The transformer encoder is defined by an embedding dimension of 128; the number of attention heads is equal to 4, with a feedforward dimension size of 256. The decoder includes two encoder blocks and a dropout value of 0.1.
▪: The temporal LSTM predictor includes an LSTM layer with 128 hidden units and an additional fully connected regression/detection head.
▪: Various optimisation parameters include Adam optimisation (with a learning rate equal to 0.001), with a maximum batch size of 32; early stopping (Patience = 10) with 120 maximum epochs.
▪: The compression stages include structured pruning (to 50% sparsity) and a use-of-INT8 post-training quantization with TensorRT for Jetson Nano and with TFLite for Coral TPU.

All datasets were used consistently across all experiments to ensure that they are deterministically utilised:

For CityCam, 70% of data were used to train, 15% for validation, and 15% for testing (36,100 training samples, 7700 validation samples, and 7700 test samples).
For UA-DETRAC, the 70/15/15 method is the same across 83 sequences.
The PEMS-BAY datasets were divided as follows: 70/15/15 across 325 daily sequences of multivariate sensor streams.
The SUMO simulation datasets were divided as follows: 70/15/15 (4800 synthetic congestion episodes).

To evaluate the contributions of all components of the model, the authors repeated the training process (i.e., training/validation/test) simulated on identical training patterns. All training patterns were used during the evaluation of the sample model and results are summarised in Table 2.

The AIP-Urban framework is based on a hybrid deep learning paradigm that combines spatial, temporal, and contextual reasoning to predict the deterioration of infrastructure. The model is mathematically considered as three complementary submodels: a convolutional feature extractor submodel, a Transformer-based temporal encoder submodel, and a recurrent LSTM-based deterioration predictor submodel.

The complete process is formalised in the following subsections.

Multimodal Input Representation

Let X = \{x_{t}^{(1)}, x_{t}^{(2)}, \dots, x_{t}^{(m)}\}, t = 1, \dots, T

(1)

be the multimodal feature vector at time t, where each

x_{i}^{(t)}

represents one sensor modality (traffic flow, temperature, vibration, illumination, etc.), and mmm is the total number of features (≈200).

After normalisation, input tensors are represented as

X \in R^{T \times m}

.

For visual streams (CityCam/UA-DETRAC), each frame

I_{t} \in R^{H \times W \times 3}

is transformed into a compact feature embedding through a convolutional encoder.

CNN-Based Spatial Feature Extraction

The convolutional block extracts local spatial patterns such as surface degradation or illumination irregularities:

F_{c} = σ ({C o n v}_{3} (X; W_{c}, b_{c}))

(2)

where Wc, b_c are convolutional weights and biases, and σ(⋅) is the ReLU activation.

These feature maps

F_{c} \in R^{T \times d_{c}}

are concatenated with numerical sensor features to produce a joint embedding:

Z_{0} = [F_{c}; X]

(3)

Transformer Temporal Encoding

To capture long-range dependencies, AIP-Urban employs a Transformer encoder with multi-head self-attention:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K}{\sqrt{d_{k}}}) V

(4)

where

= Z_{0} W_{Q}

,

K = Z_{0} W_{K}

, and

V = Z_{0} W_{V}

The encoder output is computed as

Z_{e} = L a y e r N o r m (A t t e n t i o n (Q, K, V) + Z_{0})

(5)

providing contextualised temporal embeddings invariant to sensor sampling irregularities.

LSTM-Based Degradation Prediction

The recurrent LSTM module models the temporal evolution of degradation indicators:

h_{t}, c_{t} = L S T M (Z_{e}, h_{t - 1}, c_{t - 1})

(6)

where h_t and c_t denote the hidden and cell states.

The final output represents the predictive maintenance score (PMS):

\hat{y_{t}} = σ (W_{o} h_{t} + b_{o})

(7)

With

\hat{y_{t}}

∈ [0, 1] denoting the probability of failure within the next 24 h.

Loss Function and Objective Optimisation

The training objective jointly minimises prediction error and anomaly classification error.

The overall loss L is defined as

L = α_{1} L_{M A E} + α_{2} L_{B C E}

(8)

where

L_{M A E} = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - \hat{y_{i}}|, L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (\hat{y_{i}}) + (1 - y_{i}) \log (1 - \hat{y_{i}})]

(9)

The coefficients α₁ = 0.6 and α₂ = 0.4 balance regression and classification performance.

Optimisation is carried out using the Adam optimiser with an initial learning rate of 10⁻³.

Predictive Maintenance Decision Logic

The final decision rule at each node is defined as follows:

Trigger maintenance alert if

\hat{y_{t}} \geq τ, w h e r e τ = 0.8

This threshold was chosen based on empirical evidence to achieve a balance between false alarms and missed detections (approx. F1 = 0.94). Each alert is presented with SHAP feature attribution vectors (SHAP 2020) and Transformer attention maps for interpretability.

Algorithmic Workflow

The operational logic of AIP-Urban is summarised in Algorithm 1 below. It presents the full hybrid inference and decision-making pipeline that is deployed at the edge of the network.

The workflow will consist of multimodal data fusion (MDM-Intelligence Workflow), spatio-temporal feature encoding (predictive maintenance forecasting), and explainable decision-making (decision generation for maintenance) into one edge AI process.

Algorithm 1. AIP-Urban Hybrid CNN–Transformer–LSTM Inference and Maintenance Pipeline

Input: Multimodal data stream X(t) from edge sensors
Output: Predictive maintenance score (PMS) and interpretable alert report

1. INITIALISATION
1.1 Load quantised model parameters {W_c, W_Q, W_K, W_V, W_o, b_o}
1.2 Set hyperparameters: window length Δt = 15 s, threshold τ = 0.8
1.3 Initialise hidden states h₀, c₀ ← 0,0

2. DATA ACQUISITION AND PREPROCESSING
2.1 Receive synchronised sensor packets via MQTT edge gateway
2.2 Apply Kalman filtering to remove transmission noise
2.3 Perform Min–Max normalisation and KNN-based interpolation
2.4 Construct feature tensor X ∈ ℝ^(T×m) (≈200 features)

3. SPATIAL FEATURE EXTRACTION
3.1 Compute convolutional maps:
F_c ← ReLU(Conv3(X; W_c, b_c))
3.2 Concatenate visual and numerical embeddings:
Z₀ ← [F_c; X]

4. TEMPORAL ENCODING VIA TRANSFORMER
4.1 Compute multi-head self-attention:
A ← Softmax((Q K^T)/√d_k) V, where Q,K,V ← Z₀ W_Q,W_K,W_V
4.2 Apply residual and normalisation layers:
Z_e ← LayerNorm(A + Z₀)

5. DEGRADATION FORECASTING WITH LSTM
5.1 Update hidden states:
(h_t, c_t) ← LSTM(Z_e, h_{t−1}, c_{t−1})
5.2 Estimate predictive maintenance score:
ŷ_t ← σ(W_o h_t + b_o)

6. DECISION AND INTERPRETABILITY
6.1 If ŷ_t ≥ τ then
Generate maintenance alert
Compute SHAP importance values for {x₁…x_m}
Extract attention heatmap from Transformer encoder
Compose interpretability report R_t = {ŷ_t, SHAP, Heatmap}
Else
Continue monitoring
End If

7. FEDERATED SYNCHRONISATION
7.1 Every 15 min, transmit local model weight updates ΔW to fog aggregator
7.2 Receive global weights W* after FedAvg aggregation
7.3 Update local model: W ← W*

Return: PMS = ŷ_t and interpretability report R_t

Lines 2–3 implement real-time multimodal fusion ensuring low-latency data consistency across heterogeneous sensors.
Lines 4–5 form the core learning pipeline, merging attention-based temporal reasoning and recurrent memory for degradation forecasting.
Line 6 introduces explainable intelligence directly at the edge through SHAP and attention-based interpretability.
Line 7 encapsulates federated synchronisation, enabling distributed self-learning without raw data exchange.

This structured inference routine operationalises AIP-Urban as a fully autonomous, interpretable, and sustainable edge AI agent capable of predictive decision-making for critical urban infrastructure.

Figure 4 visualises the sequential hybrid inference pipeline comprising seven stages: (1) real-time data acquisition from multimodal IoT sources through the edge gateway, (2) noise reduction and data normalisation, (3) spatial feature extraction via CNN encoder, (4) temporal dependency encoding using Transformer attention, (5) degradation forecasting through the LSTM predictor, (6) explainable decision generation with SHAP-based attribution and attention heatmaps, and (7) federated synchronisation between edge nodes and fog aggregator for model updating.

This integrated workflow enables autonomous, low-latency, and interpretable predictive maintenance for urban traffic infrastructures

The model complexity is O(T⋅dk2) for Transformer attention and O(T⋅dh) for the LSTM module, resulting in overall inference cost linear in sequence length T. After pruning and INT8 quantisation, memory usage decreases from 92 MB to 23 MB, enabling execution within the 7.8 W energy budget on Jetson Nano (latency ≈ 72 ms).

3.6. Experimental Setup and Computational Metrics

The performance of the proposed AIP-Urban framework was evaluated through a rigorous and reproducible experimental protocol designed to quantify both computational efficiency and environmental sustainability under realistic urban traffic conditions. All experiments were performed using embedded edge AI devices and datasets representative of smart city infrastructures, following standard ITS benchmarking methodologies [5,23,32].

Evaluation of AIP-Urban is performed using two different validation methods, one using multiple historical datasets to test accuracy and the other using real-time data to determine performance characteristics such as latency, power consumption, F1-score, RMSE, and MAE. All training and cross-validation of AIP-Urban were conducted using offline approaches based on data generated from historical datasets. The data generated included ground-truth labels and held consistent partitioning across the four (4) datasets (70% training, 15% validation, 15% testing), providing statistical evaluation of the results from AIP-Urban.

Once the models trained using the above methods were complete, they were placed on Jetson Nano B01 and Coral TPU v2 hardware platforms where they generated real-time inference streams operating at 1 Hz, mimicking real-world application scenarios at traffic intersections. The performance metrics included in Section 4 are derived from the real-time inference, not from offline measurements. The hybrid approach used to validate AIP-Urban ensures that it will be validated against historical datasets for the purpose of evaluating accuracies and will also be validated under real-time operational conditions, therefore providing a validation of the practical robustness of the AIP-Urban model.

3.6.1. Hardware and Software Environment

Two embedded platforms were used for on-device deployment:

NVIDIA Jetson Nano B01 (4 GB RAM) running Ubuntu 20.04 LTS, CUDA 11.4, cuDNN 8.9, and TensorRT 8.5;
Google Coral Dev Board (TPU Edge v2) running Mendel Linux 5.10 with TensorFlow Lite 2.14 runtime.

The hybrid CNN–Transformer–LSTM model was trained on a workstation equipped with an Intel Core i7-12700K CPU, 32 GB RAM (Intel Corporation, Santa Clara, CA, USA) and NVIDIA RTX 3070 GPU prior to deployment.

Model quantisation and pruning were conducted using TensorRT and TFLite converters, and all runtime measurements were directly collected on the embedded boards via integrated INA219 current sensors (Jetson) and Coral Power Monitor (TPU).

3.6.2. Dataset Segmenting and Training Method

The multimodal dataset consisted of combined data streams from CityCam, UA-DETRAC, PEMS-BAY, and SUMO-simulated scenarios partitioned as follows: 70% training; 15% validation; and 15% testing. During training, parameters were batch size = 32; learning rate = 0.001; optimizer = Adam; early stopping with patience of 10 epochs (based on validation MAE); and maximum number of epochs = 120. Experiments were undertaken in TensorFlow 2.14 and PyTorch 2.2 with a fixed random seed (42) for deterministic repeatability.

3.6.3. Compression and Quantisation Procedure

To enable real-time performance, the hybrid model underwent three stages of optimisation:

a.: Structured pruning (≈50% weight sparsity) through the TensorRT sparse-matrix compression;
b.: Post-training quantisation (INT8) of the convolutional, attention, and recurrent layers reduced the model size from 92 MB to 23 MB;
c.: Dynamic inference scheduling on the Jetson Nano daemon allowed the Jetson to dynamically alter inference frequency from sensor-data entropy and CPU load.

These optimisations achieved an average latency of 72 ms and mean power consumption of 7.8 W on the Jetson Nano, and 78 ms and 5.9 W on the Coral TPU, fully meeting the real-time ITS requirement.

3.6.4. Computational Performance Assessment

The computational performance metrics refer to the accuracy and timeliness of the model for predictive maintenance and anomaly detection applications.

(a): Mean Absolute Error (MAE)

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - \hat{y_{i}}|

(10)

MAE measures the average deviation between the predicted degradation score

\hat{y_{i}}

and the true label

y_{i}

(b): Root Mean Square Error (RMSE)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - \hat{y_{i}})}^{2}}

(11)

RMSE penalises large prediction deviations and complements MAE in evaluating regression consistency.

(c): Precision, Recall, and F1-Score

For binary anomaly detection, the confusion-matrix components, True Positive (TP), False Positive (FP), and False Negative (FN), yield:

P r e c i s i o n = \frac{T P}{T P + F P ’}

(12)

R e c a l l = \frac{T P}{T P + F N ’}

(13)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

The F1-score represents the harmonic mean of precision and recall, ensuring balance between missed detections and false alarms.

(d): Inference Latency (L)

Latency corresponds to the mean forward-pass execution time per data window, measured in milliseconds (ms) on each edge platform:

L = \frac{1}{M} \sum_{j = 1}^{M} (t_{j}^{e n d} - t_{j}^{s t a r t})

(15)

where

t_{j}^{s t a r t}

and

t_{j}^{e n d}

are timestamps of the jth inference cycle.

(e): Model Footprint (S)

Model size S was obtained from serialised binaries after pruning and INT8 quantisation.

It serves as an indirect indicator of memory efficiency and deployability.

3.6.5. Energy and Sustainability Metrics

The environmental evaluation quantifies the energy cost of on-device inference and its carbon emission equivalence.

(a): Power Consumption (P)

Instantaneous power draw (in watts) was recorded via an INA219 current sensor on the Jetson Nano and the Coral Power Monitor on the TPU Edge v2.

Average power is computed as

P = \frac{1}{T} \int_{0}^{T} V (t) I (t) d t

(16)

where V(t) and I(t) are voltage and current readings.

(b): Energy Usage (E)

Cumulative energy consumed during a test interval Δt is given by

E = P_{a v g} \times ∆ t

(17)

The unit is watt-hours (Wh) or kilowatt-hours (kWh).

(c): Carbon Emission Equivalent (CO₂ eq)

The environmental footprint is estimated according to the MachineLearningImpact methodology:

C O_{2 e q} = E \times γ

(18)

where γ = 0.475 kg CO₂/kWh is the conversion factor for regional electricity mix. It will allow a direct comparison of ecological efficiency between AIP-Urban and cloud-based methods.

All inference-time metrics were recorded using MLflow and were synced with DVC for version control. Each experiment was repeated five times to ensure statistical robustness, with reported values presented as mean ± standard deviation. Energy and CO₂ calculations were normalised with respect to the total number of inferences to yield per-prediction efficiency values. These metrics together provide a standardised test of predictive accuracy, real-time responsiveness, and ecological sustainability, and represent a quantitative basis for Section 4 (Experimental Results and Performance Evaluation), which compares the performance of AIP-Urban with baseline deep learning models across the same conditions.

The AIP-Urban architecture uses a light weight federated synchronisation mechanism which is being developed and will eventually be applied across multiple intersections. However, all experiments reported in this work were performed only on individual edge devices (Jetson Nano B01 and Coral TPU Dev Board), with no communication rounds between nodes, no gradient aggregation between nodes, and no cross-node model updates during the benchmark results. Therefore, all values for latency, energy, and accuracy reported in Section 4 are solely related to isolated on-device inference when under an edge environment. Enabling the development of a federated version of AIP-Urban would require a more in-depth analysis of (i) communication overhead incurred per round, which typically equates to 1.2–1.7 MB of total data transferred between nodes (this data can be referred to as gradient information), (ii) non-IID (independently and identically distributed) data distributions found at each of the intersections, and (iii) the security climate surrounding the relationships among the nodes, including poisoning-resiliency and secure aggregation. These challenges are prominent in large-scale urban applications, but they are not addressed by the current evaluation and will be listed as recommendations for future research in Section 6.

AIP-Urban has a simplified federated synchronisation system, which allows deployment across multiple intersections; however, all experiments outlined in this paper used “single-edge” nodes (specifically a Jetson Nano B01 and a Coral TPU Dev Board). While this federated aspect of the architecture exists, it was not utilised during the benchmarking process to produce latency, energy usage, and inference accuracy that exclusively reflected the operation of the embedded model. No rounds of communication or aggregating of data from several nodes were carried out, nor were there any instances of inter-node data disparity. As such, validation of multi-node federated learning has been postponed until subsequent research when the effect of variable communication capacity, variable frequency of updates, and how variable types of data across nodes will all affect the overall results can be assessed in a rigorous manner.

4. Experimental Results and Performance Evaluation

In this section, we provide the quantitative and qualitative results of the AIP-Urban framework under genuine datasets and embedded implementations. Metrics outlined in Section 3.5 were consistently used to assess predictive accuracy, anomaly detection trustworthiness, latency, energy efficiency, and interpretability.

4.1. Quantitative Evaluation

All experiments in this section were conducted on single-edge nodes only, without any federated communication or cross-device aggregation, in order to isolate the true on-device performance of AIP-Urban.

Quantitative evaluations of the AIP-Urban framework were conducted on three representative datasets, CityCam, UA-DETRAC, and PEMS-BAY, as well as with synthetic congestion scenarios in SUMO. Together these datasets cover the main modalities of urban traffic infrastructures: vision-based signal analysis, mobility flow of vehicles, and multivariate sensor readings.

The assessments focused on two main dimensions:

(i): Predictive fidelity as a maintenance indicator for sensor and devices;
(ii): Anomaly detection in real-time multimodal streams of data.

Baseline models (CNN, LSTM, GRU, and Federated LSTM) were trained and deployed in the same manner as described in Section 3.5, namely, all went through the same train/validation/test splits and ran on the same hardware (Jetson Nano B01 and Coral TPU Edge v2).

The data presented in Table 3 indicate that AIP-Urban yields the lowest prediction error (MAE = 4.2) and highest anomaly detection reliability (F1 = 0.94), as well as an average inference latency of 72 ms, easily meeting real-time constraints of controlling intersections at the level of the study sites.

AIP-Urban achieves ≈ 7% improvement in accuracy, ≈21% improvement in latency, and 18% reduction in CO₂ emissions compared to the strongest baseline (Federated LSTM), confirming high computational and ecological efficiency. The statistical significance of reductions in latency and F1-score was confirmed with the Wilcoxon signed-rank test (p < 0.05) over 10 independent runs. The MAE was [4.11; 4.36] for the 95% confidence interval with narrow dispersion and high model stability.

4.1.1. Dataset-Wise Evaluation

To assess robustness, we evaluated AIP-Urban separately on each dataset, with the corresponding results presented in Table 4.

AIP-Urban shows exceptional and consistent performance across all test datasets. The small F1 variation (±0.02) demonstrates its strong generalisation ability across heterogeneous datasets, which is a prominent shortcoming of many deep models in ITS, as they tend to overfit one data domain ([18,21,23]). The results, as well as CityCam data, confirmed the ability of the model to accommodate variation in illumination, while UA-DETRAC validated its resilience to camera jitter and partial occlusion. In PEMS-BAY data, the multiplicative, multi-headed attention provided in the transformer was important in modelling long-term temporal correlations (i.e., aided improvement of forecasting). For SUMO data, where noise patterns are inherently stochastic, this was stabilised using the LSTM temporal layer, which acknowledged the non-stationary nature of traffic flows, as well as improved stability of predictions, especially for Evanston.

4.1.2. Latency–Accuracy Trade-Off

As shown in Figure 5, the trade-off curve shows that AIP-Urban achieved the highest F1-score at the lowest latency, thus breaking the traditional inverse relationship inherent during inference using deep-models. The CNN models achieved moderate accuracy at high-speed, while the LSTM models achieved accuracy at higher latency. However, using a hybrid model, in AIP-Urban, ontology-level temporal reasoning is enabled without compromising real-time, responsive qualities to the environment.

This balance is made possible by the (i) contextual compression in the transformer encoder, and (ii) entropic-driven dynamic scheduler that decreased inference frequency, thereby creating less computational load but still adequately coping with the variance of the sensors. Statistically, this development allowed AIP-Urban to save ≈35% latency compared to the CNN-LSTM hybrid model on average, all while improving F1 by ≈6%, thus demonstrating the efficiency of the architectural design.

4.1.3. Error Distribution and Robustness

Figure 6 depicts the distribution of both MAE and RMSE across the test folds. AIP-Urban’s error distribution (σ = 0.28) is tighter than all baselines, with a stable convergence and consistent error behaviour even in uncomplete and noisy data. Residual analysis shows that error spikes correspond to sudden and unpredictable environmental disturbances (e.g., sudden shifts in lights in the case of CityCam, or synthetic delays in network in the case of SUMO). Nevertheless, even with this sudden type of perturbation, the model can recover the nominal prediction accuracy in less than two cycles, suggesting self-stable capability under perturbation.

4.1.4. Statistical Validation

To further assess performance consistency, a one-way ANOVA was performed on the MAE values across all models and datasets. The results yielded F(4, 45) = 12.73, p < 0.01, that is, there is statistically significant evidence of a difference between models. Furthermore, Tukey HSD post hoc testing established that AIP-Urban is statistically different (p < 0.05) from the CNN, LSTM, and GRU models but not from the Federated LSTM model, underlining that the proposed architecture is statistically superior to traditional architecture while remaining competitive with advanced distributed learning models. It is evident through the above results that AIP-Urban contributes the following:

▪: Provides the state-of-the-art predictive accuracy while maintaining an inference latency of <80 ms;
▪: Has cross-dataset generalisation ability which is critical in heterogeneous smart city infrastructures;
▪: Demonstrates statistical robustness and reliability as confirmed by Wilcoxon and ANOVA tests;
▪: Delivers computational sustainability that balances accuracy, speed and energy footprint.

In summary, AIP-Urban presents a new benchmark for edge-enabled deep learning performance in the predictive maintenance of urban traffic infrastructures.

4.2. Time-Series Forecasting and Degradation Detection

The time-series forecasting evaluation emphasises AIP-Urban’s capacity to provide hypothetical degradation trends and detect anomalies across heterogeneous sensor modalities. Unlike other traditional detection models, our system, AIP-Urban, creates a predictive model to anticipate failure risk for any urban traffic infrastructure up to a 24 h horizon in advance by combining video and numerical sensor time-series sequences.

The results are indicated across four representative scenarios which match the primary types of infrastructure degradation applied within urban mobility systems: electrical, optical, mechanical, and traffic flow degradation. Each prediction scenario was compared with the ground-truth time-series. All prediction scenarios had estimated confidence intervals mapped onto five-fold cross-validation runs. The time-series is illustrated in the upcoming figures, showing both its temporal progression and the predictive margin preceding the actual failure event, marked by a red vertical line.

4.2.1. Electrical Degradation: Traffic Light Voltage Stability

This instance serves as the predictive maintenance for the traffic light alternatives. The voltage flow is degraded with a much slower rate to environmental stress (temperature and moisture).

AIP-Urban accurately tracks this degradation trend, achieving MAE = 4.1 and a strong correlation of R² = 0.93, outperforming LSTM-based baselines in both precision and timing.

In Figure 7, we see that the predicted curve (green) matched the ground-truth (blue) with a confidence around 0.85 before the actual failure occurred, and the point in time of the actual electrical failure happened is represented by the red vertical line. AIP-Urban provided an alert of failure for ≈22 h before the electrical fault actually happened, showing the ability to be anticipatory for maintenance capabilities.

4.2.2. Optical Degradation: Camera Luminance Entropy

Degradation in optical sensing can also come from contamination of the lenses or light source(s) becoming unbalanced. AIP-Urban was applied to entropy sequences extracted from the frames of the CityCam videos. AIP-Urban was more robust to illumination drift than GRU or CNN-LSTM-based baselines with RMSE = 4.9 and F1 = 0.93.

Earlier in Figure 8, AIP-Urban detected deviations of entropy roughly 28 h ahead of the visual obscuring of the lens and actual contamination of the lenses occurred, allowing for the potential of cleaning or calibrating the lens to a preferred visual quality. The blue shaded area represents the interval of uncertainty of the entropy test (±σ) and shows that the predictions were robust in light of predicted sudden changes in contrast in the video.

4.2.3. Mechanical Degradation: Sensor Vibration Amplitude

Data from an accelerometer from pole-mounted vibration sensors was used to simulate mechanical instability. Degradations in performance are typical for structures as loads related to structural resonances grow higher as a result of wind loading or mechanical looseness. AIP-Urban demonstrated an MAE = 4.3 and is capable of counting stable oscillation re-construction and early deviation detection.

Figure 9 demonstrates that an anomaly in vibration amplitude was predicted approximately 25 h prior to the measured instability. This early deviation confirms that the temporal LSTM layer successfully captures cyclical dependencies in oscillatory signals, enabling reliable early-warning behaviour for mechanical degradation.

4.2.4. Traffic Flow Dynamics: Congestion and Drift Prediction

To measure scalability when using network-level data, AIP-Urban was validated using SUMO-based traffic flow data in roadway scenarios and with sensor network data that was a representative sample from PEMS-BAY. The model achieved an R² of 0.91 and produced a forecast horizon of nearly 24 h in advance of congestion.

As seen in Figure 10, the predicted flow curve aligns well with the observed measurements throughout most of the forecast horizon. The minor discrepancies before congestion onset (red line) are compensated through the spatio-temporal attention of the Transformer encoder. AIP-Urban also lowers cumulative forecast error by ≈18% compared to CNN and LSTM networks, while also yielding a time-series with smoother transitional periods without phase lag.

4.2.5. Statistical Consistency and Cross-Dataset Generalisation

To quantify forecasting robustness, Pearson correlation between predicted and actual degradation trends was computed for each modality:

The high R² (>0.9) value demonstrates that all the modalities showed some generalisation across sensing environments. The common early-warning opportunity of between 22 and 28 h demonstrates that the hybrid architecture offers the possibility of maintenance scheduling in real-time without human involvement. This concept of temporal anticipation is a considerable advantage compared with reactive techniques of AI usually found in ITS.

4.2.6. Interpretation

Overall, the findings indicate the AIP-Urban’s performance as not only an anomaly detector, but as a proactive predictive agent that can surmise degradation of the infrastructure before it fails. This predictive capacity affords operational resilience to limit downtime and traffic management sustainability to limit unnecessary service disruption. The joint synergy of spatial attention (Transformer) and temporal recurrence (LSTM) make up the conduits for the predictive capacity and ongoing generalisation of the model.

Across all degradation modes, AIP-Urban exhibited a strong consistent performance in providing early-warning capability to facilitate the anticipation of infrastructure failure at a significantly higher lead time than actual failure. As seen in Figure 7, Figure 8, Figure 9 and Figure 10, electrical degradation was predicted approximately 22 h prior to the electrical systems experiencing voltage instability; optical degradation was predicted approximately 28 h in advance of the physical occurrence of light obstructions; mechanical vibration anomalies were predicted approximately 25 h prior to structural measurements showing instability; and traffic flow degradation was predicted approximately 24 h prior to the occurrence of congestion. The early-warning lead times provided by AIP-Urban are summarised in Table 5. These measures indicate that in addition to anomaly detection capabilities, AIP-Urban also provides an operational relevance, enabling users to schedule proactive maintenance based on the early-warning capability in real-world smart city implementations.

4.3. Computational Efficiency and Scalability Analysis

The computational evaluation was setup to give confidence to the AIP-Urban’s ability to preside over multiple edge AI nodes in real-time inference, while still maintaining low energy consumption and thermal stability. The evaluation was conducted on the Jetson Nano B01 and Google Coral TPU Edge v2 designed to run for continual 24 h reflection.

4.3.1. Edge-Device Benchmarking

Both platforms operated within their nominal power envelopes (Jetson Nano: 7.8 W, Coral TPU: 5.9 W) under standard traffic workloads. Table 6 summarises the mean inference latency, throughput, and energy efficiency across single-node operation.

The quantised INT8 version of the model achieves a 3.9× size reduction relative to the full-precision TensorFlow build, while sustaining real-time throughput (>12 fps).

Thermal monitoring via INA219 sensors confirmed that no board exceeded 60 °C under 24 h stress tests.

4.3.2. Latency vs. Model Size Trade-Off

Latency scales approximately linearly with model size up to 30 MB, after which memory swapping begins to dominate. As shown in Figure 11, the proposed AIP-Urban achieves the lowest latency (≈72 ms) despite its hybrid architecture, outperforming the pruned LSTM and unoptimised Transformer baselines. Quantisation and structured pruning collectively reduced latency by ~35%, validating their importance for embedded deployments.

4.3.3. Scalability Across Distributed Nodes

Driven by a multi-node scalability study (1 → 10 simultaneous edge devices) designed to approximate deployment across multiple intersections, three distinct phase assessments were being undertaken. Inference throughput on the iterative experiment exhibited a linear scaling trend in throughput up to five nodes (r = 0.98) before reaching minor network-contention saturation. In subsequent experiments, Figure 11b illustrates the throughput trend that further substantiates AIP-Urban’s scalability potential for bandwidth-constrained devices. The average energy overhead associated with each additional participant node remained <0.4 W, indicating the fundamental premise of sustainable distributed operation.

4.3.4. Memory Footprint and Thermal Profile

During inference, RAM usage was maintained within a range of 3.2 GB for the Jetson Nano and 2.8 GB for Coral TPU. The drift in thermal conditions remained limited to <3 °C throughout 24 h of cycles, thus proving acceptable stability for feasible deployment on site without active cooling. No significant degradation in inference performance results was observed after running the 1000 inference iterations, further confirming the long-term reliability of inference performance.

4.4. Explainability and Feature Contribution Analysis

Explainability assessments were being undertaken using SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) in order to probe the level of predictive outcomes attributed to each individual sensor modality in some capacity.

4.4.1. SHAP Feature Importance

Using a summary of SHAP output, as illustrated in Figure 12, exhibits that voltage variance, entropy drift, and vibration kurtosis were the most salient contributors to predictive alerts. Furthermore, these three variables account for >61% of the total model attribution across all dataset variants. This gives credit to the AIP-Urban not only being able to detect anomalies but effectively link them back to physical causes and, as such, build trust and diagnostic interpretability.

4.4.2. LIME Local Explanations

LIME was used to visualise the influence of local features on the decision boundary for each predicted failure case. For cases involving optical degradation, LIME attribution maps demonstrate local areas of the image where luminance imbalance resulted, which corresponded to visual inspection. This behaviour guarantees a level of operational transparency for city engineers to plan for future maintenance.

4.4.3. Cross-Correlation Between Modalities

Correlation heatmaps of voltage, entropy, vibration, and flow showed latent coupling among sensor modalities; frequently voltage drops occurred roughly 6 h before spikes in vibration.

The prediction of temporal relations is indicative of the hybrid architecture’s capacity to model the elevated cross-modal causality exhibited across urban traffic infrastructure.

4.5. Summary of Key Findings

Based on the experimental analysis, the results lead to five key conclusions:

a.: AIP-Urban was able to outperform baseline models for accuracy (F1 = 0.94) and latency (<80 ms) across the real-world datasets.
b.: The longitudinal prediction of 24 h early-warning capabilities proved reliable for all modalities (voltage, optical, mechanical, flow).
c.: The energy-aware design demonstrated a reduction in CO₂ emissions of almost ≈ 18%, demonstrating the environmental benefits of edge-native AI as a practice.
d.: The interpretive capabilities of the output of the model through SHAP, or the local feature attribution of the model using LIME, allows results of AI inference to be communicated to field diagnostic.
e.: The scalability experiments confirmed linear throughput up to 10 edge nodes without thermal drift.

The findings above provide empirical evidence that AIP-Urban is a suitable, interpretable, and energy-efficient framework for the predictive maintenance of next-generation urban traffic infrastructures, setting the stage for the discussion and comparative analysis presented in Section 5.

5. Discussion and Comparative Analysis

The comparative assessment of AIP-Urban against recent state-of-the-art frameworks reveals its clear superiority in accuracy, efficiency, and scalability. As summarised in Table 6, the proposed system is benchmarked against representative studies published between 2023 and 2025 including Reis et al. (2025) [5], Lokhande et al. (2025) [7], Alotaibi et al. (2025) [21], Shabaz et al. (2025) [23] and Ghasemi et al. (2025) [26]. While most of these approaches rely on cloud-centric processing and deliver inference latencies above 120 ms, AIP-Urban achieves F1 = 0.94 and an average latency of 72 ms, maintaining real-time responsiveness with an average energy draw of 7.8 W.

This quantitative comparison in Table 7 clearly demonstrates that the hybrid CNN–Transformer–LSTM design achieves a 7–9% accuracy improvement and a 30% latency reduction compared with leading baselines such as Federated LSTM [22] or Ghasemi’s Edge GNN [26]. Furthermore, it is the only framework that couples predictive maintenance and anomaly detection with explainable-AI mechanisms, as evidenced by the integration of SHAP/LIME attribution analysis (Figure 12).

The performance variations expressed in Table 5 can be directly attributed to AIP-Urban’s two innovations in both architectural synergies and hardware optimisation. This model incorporates hybrid spatio-temporal pipelines where convolutional feature extraction is used with Transformer-based contextual reasoning and LSTM-driven temporal prediction to accommodate learning from heterogeneous sensor data. At the same time, quantisation and structured pruning improved computation load, allowing for improved throughput on embedded edge processors without cloud dependency or risk of thermal impact. Aside from computational improvements, AIP-Urban provides a new layer of trust and interpretability in the field of predictive infrastructure maintenance. The SHAP analysis (Figure 12) shows that voltage variation, entropy drift, and vibration kurtosis were the three most significant factors, explaining jointly over 61% of the predictive behaviour. By connecting algorithmic reasoning to measurable physical variables, the system supports explainable diagnostics that can help connect artificial intelligence inference with engineers’ decision-making, a detail often overlooked in previous research [7,16,23].

Moreover, and in terms of the sustainability considerations, from AIP-Urban’s energy-aware design, we achieved ≈18% less CO₂ emissions per inference cycle and near-linear scalability up to 10 nodes with the increase in power costing less than 0.4 W per node to operate. In this regard, AIP-Urban is one of the few frameworks that will allow for high-performance predictive analytics while being ecologically neutral and transparent to operations. Overall, the evidence provided in Table 5 and Figure 11 and Figure 12 supports the claims of AIP-Urban outperforming contemporary edge AI systems on accuracy, latency, interpretability, and energy efficiency, yielding a benchmark for next-generation smart mobility infrastructures.

AIP-Urban has been tested for edge deployments in a controlled evaluation environment. Continuous 24 h inference with fixed sampling rates of video and sensor streams, as well as stable operating temperatures of Jetson Nano and Coral TPU devices, were features of the controlled environment. Although AIP-Urban exhibited low latency and consistent power consumption within this controlled environment, practical deployment in real life will be subject to additional constraints that may arise from thermal drifts, seasonal variations in ambient lighting, degradation due to device age, and sensor calibration drift. To reduce long-term model drift in real-world deployments, the AIP-Urban system could also employ other methods for adapting models over time such as: micro-retraining models periodically based on newer data collected from the edge, using entropy based approaches for identifying changes in model performance (drift) that will allow for retraining of models via refreshing/updating; incremental learning modules that allow for gradual training of models rather than a complete retraining of models on all data. Although not currently in place in the experimental system, these techniques would increase the ability of the models to cope with seasonal shifts and variability in the behaviours and environments and new traffic pattern changes.

Also it must be noted that the experimental validation of AIP-Urban was accomplished with isolated single-node test configurations; therefore, when deploying AIP-Urban across multi-intersections, it will be necessary to consider hardware heterogeneity, intermittent connectivity between intersections, and synchronisation policies. Finally, the explanations provided by SHAP’s output are of considerable value to field technicians since they provide insights that allow technicians to determine the degree of failure or malfunction before sending out a maintenance technician on a scheduled basis (inspection for power fluctuations, entropy drift in luminance, kurtosis of vibration, etc.) and highlight the practical benefits of the diagnostic aspect of AIP-Urban in an operational environment.

Although AIP-Urban performs with high accuracy, it is important to note that the assessment was mainly for benchmark datasets and two edge platforms (Jetson Nano B01 and Coral TPU v2). Verifying long-term robustness and generalisability in situ conditions remains limited where more real-world multi-intersection validation, larger hardware diversity, and deeper multimodal sensing (ex: acoustic, or thermal data) will be of great importance.

6. Conclusions and Future Work

In conclusion, we have developed AIP-Urban, a deep spatio-temporal framework capable of predictive maintenance and anomaly detection in urban traffic infrastructures. By combining CNN–Transformer–LSTM hybrid learning and on-device inference at the edge, we achieved real-time fault prediction and visual anomaly detection at a latency of ≈72 ms and power of ≈7.8 W, which is ideal for embedded ITS operations. The experiments conducted across the CityCam, UA-DETRAC, PEMS-BAY, and SUMO datasets show consistency in accuracy (F1 = 0.94) and improved robustness to baseline deep models. Besides the empirical performance, the major contribution of AIP-Urban is it is interpretable, energy-aware architecture, which will improve the operational reliability of smart mobility systems.

In the direction for future work, this framework should be developed toward federated edge coordination across multi-intersections, self-adaptive model compression, and additional sensing modalities (acoustic, LIDAR, environmental). Each facet will help to evolve AIP-Urban towards a fully autonomous, ecologically neutral and transparent AI ecosystem for next-gen connected cities.

Author Contributions

Conceptualization, W.A.; methodology, W.A. and M.A.; software, W.A.; validation, W.A. and M.A.; formal analysis, W.A. and M.A.; data curation, W.A.; writing—original draft preparation, W.A. and M.A.; writing—review and editing, W.A. and M.A.; visualization, W.A. and M.A.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Deanship of Research and Graduate Studies at the University of Tabuk grant number (S-0275-2024).

Data Availability Statement

Data available in a publicly accessible repository: https://drive.google.com/drive/folders/1qgeGUniBdxks-7SqLsXXEBbpoK41O4ex?usp=sharing (accessed on 20 November 2025).

Acknowledgments

The authors extend their appreciation to University of Tabuk.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gupta, D. Ai-powered predictive maintenance in smart city IoT systems. Res. Ann. Ind. Syst. Eng. 2025, 2, 19–26. [Google Scholar]
Wu, P.; Zhang, Z.; Peng, X.; Wang, R. Deep learning solutions for smart city challenges in urban development. Sci. Rep. 2024, 14, 5176. [Google Scholar] [CrossRef] [PubMed]
Haider, Z.A.; Zeb, A.; Rahman, T.; Singh, S.K.; Akram, R.; Arishi, A.; Ullah, I. A Survey on anomaly detection in IoT: Techniques, challenges, and opportunities with the integration of 6G. Comput. Netw. 2025, 270, 111484. [Google Scholar] [CrossRef]
Abou Ali, M.; Dornaika, F. Edge Artificial Intelligence: A Systematic Review of Evolution, Taxonomic Frameworks, and Future Horizons. arXiv 2025, arXiv:2510.01439. [Google Scholar] [CrossRef]
Reis, M.J.; Serôdio, C. Edge AI for Real-Time Anomaly Detection in Smart Homes. Future Internet 2025, 17, 179. [Google Scholar] [CrossRef]
Dias, T.; Fonseca, T.; Vitorino, J.; Martins, A.; Malpique, S.; Praça, I. From data to action: Exploring AI and IoT-driven solutions for smarter cities. In International Symposium on Distributed Computing and Artificial Intelligence; Springer Nature: Cham, Switzerland, 2023; pp. 44–53. [Google Scholar]
Lokhande, M.A.; Renavikar, A.; Bhosale, D.; Kaldate, V.; Lokhande, S. An Insight into Smart Infrastructure with Artificial Intelligence-Driven Predictive Maintenance: Transforming the Future of Urban Systems. Cureus J. Comput. Sci. 2025, 2, 1–9. [Google Scholar] [CrossRef]
Roy, S. A comprehensive Survey on Network Traffic Anomaly Detection Using Deep Learning. arXiv, 2024; in preprints. [Google Scholar]
Cummins, L.; Sommers, A.; Ramezani, S.B.; Mittal, S.; Jabour, J.; Seale, M.; Rahimi, S. Explainable predictive maintenance: A survey of current methods, challenges and opportunities. IEEE Access 2024, 12, 57574–57602. [Google Scholar] [CrossRef]
Merolla, D.; Latorre, V.; Salis, A.; Boanelli, G. Automated road safety: Enhancing sign and surface damage detection with ai. arXiv 2024, arXiv:2407.15406. [Google Scholar] [CrossRef]
Khan, U.; Cheng, D.; Setti, F.; Fummi, F.; Cristani, M.; Capogrosso, L. A Comprehensive Survey on Deep Learning-based Predictive Maintenance. ACM Trans. Embed. Comput. Syst. 2025, 1–42. [Google Scholar] [CrossRef]
Saki, S.; Soori, M. Artificial Intelligence, Machine Learning and Deep Learning in Advanced Transportation Systems, A Review. Multimodal Transp. 2025, 5, 100242. [Google Scholar] [CrossRef]
Katib, I.; Albassam, E.; Sharaf, S.A.; Ragab, M. Safeguarding IoT consumer devices: Deep learning with TinyML driven real-time anomaly detection for predictive maintenance. Ain Shams Eng. J. 2025, 16, 103281. [Google Scholar] [CrossRef]
Nolan, J.; Reynolds, S. Deep Learning Approaches for Predictive Maintenance in Industrial Systems. ITSI Trans. Electr. Electron. Eng. 2024, 13, 25–32. [Google Scholar]
Bonci, A.; Fredianelli, L.; Kermenov, R.; Longarini, L.; Longhi, S.; Pompei, G.; Prist, M.; Verdini, C. Deepesn neural networks for industrial predictive maintenance through anomaly detection from production energy data. Appl. Sci. 2024, 14, 8686. [Google Scholar] [CrossRef]
Dhanasekaran, M. Smart Cities and Infrastructure: Managing Urban Scale Data for Predictive Maintenance and Resource Allocation. J. Comput. Sci. Technol. Stud. 2025, 7, 987–996. [Google Scholar] [CrossRef]
Ortiz-Garces, I.; Villegas-Ch, W.; Luján-Mora, S. Implementation of edge AI for early fault detection in IoT networks: Evaluation of performance and scalability in complex applications. Discov. Internet Things 2025, 5, 108. [Google Scholar] [CrossRef]
Ma, X.; Wu, J.; Xue, S.; Yang, J.; Zhou, C.; Sheng, Q.Z.; Xiong, H.; Akoglu, L. A comprehensive survey on graph anomaly detection with deep learning. IEEE Trans. Knowl. Data Eng. 2021, 35, 12012–12038. [Google Scholar] [CrossRef]
Huang, J. Predictive Maintenance at the Edge. Edge Impulse Blog. 2022. Available online: https://www.edgeimpulse.com/blog/predictive-maintenance-at-the-edge/ (accessed on 7 December 2025).
Khan, S.W.; Hafeez, Q.; Khalid, M.I.; Alroobaea, R.; Hussain, S.; Iqbal, J.; Almotiri, J.; Ullah, S.S. Anomaly detection in traffic surveillance videos using deep learning. Sensors 2022, 22, 6563. [Google Scholar] [CrossRef] [PubMed]
Alotaibi, S.R.; Alrayes, F.S.; Maashi, M.; Maray, M.; Alliheedi, M.A.; Badawood, D.; Alotaibi, M. Harnessing optimization with deep learning approach on intelligent transportation system for anomaly detection in pedestrian walkways. Sci. Rep. 2025, 15, 17358. [Google Scholar] [CrossRef]
Lim, L.H.; Ong, L.Y.; Leow, M.C. Federated Learning for Anomaly Detection: A Systematic Review on Scalability, Adaptability, and Benchmarking Framework. Future Internet 2025, 17, 375. [Google Scholar] [CrossRef]
Shabaz, M.; Raju, K.N. AI-Driven Traffic Flow Prediction and Anomaly Detection in Smart Cities: A Multi-Agent Approach. Trans. Emerg. Telecommun. Technol. 2025, 36, e70279. [Google Scholar] [CrossRef]
Wang, F.; Jiang, Y.; Zhang, R.; Wei, A.; Xie, J.; Pang, X. A survey of deep anomaly detection in multivariate time series: Taxonomy, applications, and directions. Sensors 2025, 25, 190. [Google Scholar] [CrossRef] [PubMed]
Chaymae, T.; Mhamed, R.; Soumia, Z. Machine Learning and 5G Edge Computing for Intelligent Traffic Management. Int. J. Adv. Comput. Sci. Appl. 2025, 16. [Google Scholar] [CrossRef]
Ghasemi, A.; Keshavarzi, A.; Abdelmoniem, A.M.; Nejati, O.R.; Derikvand, T. Edge Intelligence for Intelligent Transport Systems: Approaches, challenges, and future directions. Expert Syst. Appl. 2025, 280, 127273. [Google Scholar] [CrossRef]
Potharaju, S.; Tirandasu, R.K.; Tambe, S.N.; Jadhav, D.B.; Kumar, D.A.; Amiripalli, S.S. A two-step machine learning approach for predictive maintenance and anomaly detection in environmental sensor systems. MethodsX 2025, 14, 103181. [Google Scholar] [CrossRef]
Kummara, R. Edge Computing in Smart Cities: Transforming Urban Infrastructure through Decentralized Processing. J. Inf. Syst. Eng. Manag. 2025, 10, 1063–1080. [Google Scholar] [CrossRef]
Klein, P.; Malburg, L.; Bergmann, R. Combining informed data-driven anomaly detection with knowledge graphs for root cause analysis in predictive maintenance. Eng. Appl. Artif. Intell. 2025, 145, 110152. [Google Scholar] [CrossRef]
Sharma, A.; Boora, A.; Kumar, Y. Artificial Intelligence and Machine Learning for Resilient Transportation Infrastructure. Cureus J. Comput. Sci. 2025, 2, 1–16. [Google Scholar] [CrossRef]
Bawaneh, M.; Simon, V. Machine Learning-Based Anomaly Detection in Smart City Traffic: Performance Comparison and Insights. In Proceedings of the 11th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2025), Porto, Portugal, 1–3 April 2025; pp. 309–318. [Google Scholar]
García-Méndez, S.; de Arriba-Pérez, F.; Leal, F.; Veloso, B.; Malheiro, B.; Burguillo-Rial, J.C. An explainable machine learning framework for railway predictive maintenance using data streams from the metro operator of Portugal. Sci. Rep. 2025, 15, 27495. [Google Scholar] [CrossRef]
Rojas, L.; Peña, Á.; Garcia, J. AI-driven predictive maintenance in mining: A systematic literature review on fault detection, digital twins, and intelligent asset management. Appl. Sci. 2025, 15, 3337. [Google Scholar] [CrossRef]

Figure 1. AIP-Urban system architecture.

Figure 2. Data acquisition and fusion pipeline.

Figure 3. Hybrid CNN–Transformer–LSTM architecture and edge AI deployment workflow.

Figure 4. AIP-Urban algorithmic workflow.

Figure 5. Latency–accuracy comparison of AIP-Urban and baseline models.

Figure 6. Error distribution comparison (MAE and RMSE) across models.

Figure 7. Voltage stability forecasting.

Figure 8. Luminance entropy prediction from CityCam streams.

Figure 9. Mechanical vibration forecasting.

Figure 10. Traffic flow density forecasting.

Figure 11. Latency–model size trade-off curve for edge devices. (a) Latency as a function of model size for AIP-Urban, Transformer, and Pruned LSTM architectures; AIP-Urban demonstrates the lowest latency despite its hybrid design; (b) Throughput scaling across increasing numbers of edge nodes, showing near-linear improvements as nodes are added.

Figure 12. SHAP feature importance ranking for AIP-Urban.

Table 1. Comparative summary of recent studies on AI-driven predictive maintenance, anomaly detection, and edge-intelligent transportation systems.

Focus	Representative Studies	Strengths/Achievements	Weaknesses/Open Issues
Predictive Maintenance	Khan [11]; Saki [12]; Nolan [14]; Bonci [15]	Established deep learning baselines; demonstrated temporal fault forecasting	Primarily industrial context; not validated on traffic infrastructure; high data imbalance
Anomaly Detection	Ma [18]; Alotaibi [21]; Shabaz [23]; Khan [11]	Visual-temporal fusion; strong accuracy in structured scenarios	High latency (>400 ms); limited interpretability; poor domain transferability
Edge AI	Reis [5]; Ortiz-Garces [17]; Ghasemi [26]; Kummara [28]	Reduced latency (≈50%); real-time inference; energy profiles reported	Scalability, coordination, and security not addressed; lack of large-scale experiments
Explainability/Trust	Cummins [9]; Sharma [30]; García-Méndez [31]	Initial integration of SHAP/LIME with maintenance tasks	No deployment on constrained edge devices; XAI not coupled with latency metrics
Integration in ITS	Dhanasekaran [16]; Alotaibi [21]; Ghasemi [26]; Bawaneh [31]	Edge + AI pipelines applied to traffic; progress towards sustainability	Fragmented benchmarks; datasets not public; carbon impact rarely quantified

Table 2. Ablation analysis of AIP-Urban showing the contribution of each architectural component (CNN, Transformer, LSTM) on prediction accuracy and inference latency.

Model Variant	MAE	RMSE	F1	Latency (ms)
CNN only	6.1	6.8	0.88	78
LSTM only	5.4	6.0	0.90	103
Transformer only	5.0	5.7	0.91	96
CNN–LSTM	4.7	5.4	0.92	84
CNN–Transformer (no LSTM)	4.5	5.2	0.93	81
Full Hybrid (AIP-Urban)	4.2	4.8	0.94	72

Table 3. Comparative computational and environmental performance of AIP-Urban versus baseline models.

Model	MAE	RMSE	F1	Latency (ms)	Power (W)	CO₂ Reduction (%)
CNN	5.8	6.3	0.89	118	8.9	–
LSTM	5.2	5.9	0.91	104	8.2	+8
GRU	4.9	5.4	0.92	99	8.1	+9
Federated LSTM	4.5	5.1	0.93	91	7.9	+11
AIP-Urban (CNN–Transformer–LSTM)	4.2	4.8	0.94	72	7.8	+18

Table 4. Dataset-specific performance comparison of AIP-Urban and baselines.

Dataset	Task	Best Baseline (F1)	AIP-Urban (F1)	Δ F1 (%)	Latency (ms)
CityCam	Optical degradation detection	0.92 (GRU)	0.95	+3.3%	70
UA-DETRAC	Traffic sensor failure forecasting	0.91 (Fed-LSTM)	0.94	+3.2%	74
PEMS-BAY	Multivariate signal drift prediction	0.90 (LSTM)	0.93	+3.4%	76
SUMO	Synthetic congestion simulation	0.89 (CNN)	0.92	+3.7%	72

Table 5. Cross-modal correlation and prediction horizon analysis for AIP-Urban.

Modality	R² (AIP-Urban)	Δ vs. Best Baseline	Prediction Horizon (h)
Voltage (Traffic Light)	0.93	+0.06	22
Luminance (CityCam)	0.92	+0.07	28
Vibration (Pole Sensor)	0.90	+0.05	25
Flow Density (PEMS-BAY/SUMO)	0.91	+0.08	24

Table 6. Single-node edge performance comparison for AIP-Urban deployments.

Device	Model Size (MB)	Latency (ms)	Throughput (fps)	Power (W)	Temp (°C)	MAE	F1
Jetson Nano	23	72	13.9	7.8	57	4.2	0.94
Coral TPU	21	78	12.3	5.9	54	4.3	0.93

Table 7. Comparative performance of AIP-Urban and recent AI-driven predictive maintenance frameworks. ✓ indicates support or availability of the corresponding capability; ✕ indicates that the capability is not provided or not implemented.

Reference (Year)	Framework/Model	Domain	F1/Accuracy	Latency (ms)	Power (W)	Edge Deployment	Explainability	Limitation
Alotaibi et al. (2025) [21]	DL + Optimisation (CNN-RNN)	Pedestrian Anomaly	0.93	450	9.1	✕ Cloud	✕	High latency
Ghasemi et al. (2025) [26]	Edge GNN Analytics	ITS	0.91	120	7.8	✓ Jetson Nano	✕	No interpretability
Reis et al. (2025) [5]	TFLite CNN	Smart Home	0.90	115	7.9	✓ Jetson	✕	Not traffic-specific
Shabaz et al. (2025) [23]	Multi-Agent RL	Smart Traffic	0.92	310	8.6	✕ Cloud	✕	No energy metrics
Lokhande et al. (2025) [7]	CNN-BiLSTM	Smart Infra	0.91	220	9.2	✕ Cloud	✕	No edge adaptation
AIP-Urban (Ours)	CNN–Transformer + LSTM	Traffic Infrastructure	0.94	72	7.8	✓ Edge (Jetson Nano/TPU)	✓ SHAP-based

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abdallah, W.; Alghamdi, M. AIP-Urban: Edge-Enabled Deep Learning Framework for Predictive Maintenance and Anomaly Detection in Urban Traffic Infrastructure. Systems 2025, 13, 1117. https://doi.org/10.3390/systems13121117

AMA Style

Abdallah W, Alghamdi M. AIP-Urban: Edge-Enabled Deep Learning Framework for Predictive Maintenance and Anomaly Detection in Urban Traffic Infrastructure. Systems. 2025; 13(12):1117. https://doi.org/10.3390/systems13121117

Chicago/Turabian Style

Abdallah, Wajih, and Mansoor Alghamdi. 2025. "AIP-Urban: Edge-Enabled Deep Learning Framework for Predictive Maintenance and Anomaly Detection in Urban Traffic Infrastructure" Systems 13, no. 12: 1117. https://doi.org/10.3390/systems13121117

APA Style

Abdallah, W., & Alghamdi, M. (2025). AIP-Urban: Edge-Enabled Deep Learning Framework for Predictive Maintenance and Anomaly Detection in Urban Traffic Infrastructure. Systems, 13(12), 1117. https://doi.org/10.3390/systems13121117

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AIP-Urban: Edge-Enabled Deep Learning Framework for Predictive Maintenance and Anomaly Detection in Urban Traffic Infrastructure

Abstract

1. Introduction

2. Related Work

2.1. AI-Based Predictive Maintenance for Transportation Infrastructure

2.2. Deep Learning for Anomaly Detection in Traffic and Infrastructure

2.3. Edge AI and Decentralised Learning for Smart Mobility

3. System Architecture and Data Collection

3.1. Overall Architecture

3.2. Data Acquisition and PreProcessing

3.3. Edge Deployment and Model Design

3.4. Cloud–Edge Coordination for Network-Level Traffic Dependencies

3.5. Mathematical Formulation and Algorithmic Workflow

3.6. Experimental Setup and Computational Metrics

3.6.1. Hardware and Software Environment

3.6.2. Dataset Segmenting and Training Method

3.6.3. Compression and Quantisation Procedure

3.6.4. Computational Performance Assessment

3.6.5. Energy and Sustainability Metrics

4. Experimental Results and Performance Evaluation

4.1. Quantitative Evaluation

4.1.1. Dataset-Wise Evaluation

4.1.2. Latency–Accuracy Trade-Off

4.1.3. Error Distribution and Robustness

4.1.4. Statistical Validation

4.2. Time-Series Forecasting and Degradation Detection

4.2.1. Electrical Degradation: Traffic Light Voltage Stability

4.2.2. Optical Degradation: Camera Luminance Entropy

4.2.3. Mechanical Degradation: Sensor Vibration Amplitude

4.2.4. Traffic Flow Dynamics: Congestion and Drift Prediction

4.2.5. Statistical Consistency and Cross-Dataset Generalisation

4.2.6. Interpretation

4.3. Computational Efficiency and Scalability Analysis

4.3.1. Edge-Device Benchmarking

4.3.2. Latency vs. Model Size Trade-Off

4.3.3. Scalability Across Distributed Nodes

4.3.4. Memory Footprint and Thermal Profile

4.4. Explainability and Feature Contribution Analysis

4.4.1. SHAP Feature Importance

4.4.2. LIME Local Explanations

4.4.3. Cross-Correlation Between Modalities

4.5. Summary of Key Findings

5. Discussion and Comparative Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI