Next Article in Journal
Atrial Fibrillation Detection on the Embedded Edge: Energy-Efficient Inference on a Low-Power Microcontroller
Previous Article in Journal
Event-Based Vision Sensor Lifetime Degradation in Low Earth Orbit
Previous Article in Special Issue
Tree-Guided Transformer for Sensor-Based Ecological Image Feature Extraction and Multitarget Recognition in Agricultural Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MemRoadNet: Human-like Memory Integration for Free Road Space Detection

1
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
2
Sichuan Artificial Intelligence Research Institute, Yibin 644000, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(21), 6600; https://doi.org/10.3390/s25216600
Submission received: 27 August 2025 / Revised: 16 October 2025 / Accepted: 20 October 2025 / Published: 27 October 2025

Abstract

Detecting available road space is a fundamental task for autonomous driving vehicles, requiring robust image feature extraction methods that operate reliably across diverse sensor-captured scenarios. However, existing approaches process each input independently without leveraging Accumulated Experiential Knowledge (AEK), limiting their adaptability and reliability. In order to explore the impact of AEK, we introduce MemRoadNet, a Memory-Augmented (MA) semantic segmentation framework that integrates human-inspired cognitive architectures with deep-learning models for free road space detection. Our approach combines an InternImage-XL backbone with a UPerNet decoder and a Human-like Memory Bank system implementing episodic, semantic, and working memory subsystems. The memory system stores road experiences with emotional valences based on segmentation performance, enabling intelligent retrieval and integration of relevant historical patterns during training and inference. Experimental validation on the KITTI road, Cityscapes, and R2D benchmarks demonstrates that our single-modality RGB approach achieves competitive performance with complex multimodal systems while maintaining computational efficiency and achieving top performance among single-modality methods. The MA framework represents a significant advancement in sensor-based computer vision systems, bridging computational efficiency and segmentation quality for autonomous driving applications.

1. Introduction

As sensor technology continues to advance and integrate with computer vision systems, the development of efficient image feature extraction methods becomes increasingly crucial. These methods are essential for applications across various domains, including autonomous driving, smart cities, and intelligent transportation systems. Such applications require robust models capable of accurate drivable region detection across diverse environmental conditions [1]. The precision of road detection/segmentation directly influences navigation safety, path planning efficiency, and overall autonomous vehicle performance in real-world scenarios. Contemporary deep-learning approaches, particularly Convolutional Neural Networks (CNN), have demonstrated substantial progress [2] in addressing this perception task. They achieve this through hierarchical feature learning and multi-scale spatial reasoning [3,4].
Despite these advances, current methodologies face several limitations that constrain their practical effectiveness. Traditional CNN architectures [5] process each input independently. Such architectures do not leverage accumulated experiences or contextual knowledge from previously encountered scenarios. This approach becomes particularly problematic in dynamic driving environments where road conditions, lighting variations, weather patterns, and infrastructure configurations exhibit significant diversity across geographical regions and temporal contexts [6]. While achieving impressive accuracy benchmarks, these models often require substantial computational resources that may not be readily available in embedded vehicle systems [7]. Additionally, most current approaches lack mechanisms to learn from both successful and unsuccessful prediction attempts, and this limitation causes them to miss opportunities to build comprehensive knowledge bases that could inform future segmentation decisions.
Recent advances in Memory-Augmented (MA) neural networks [8,9] have demonstrated promising potential for addressing these fundamental limitations. Drawing inspiration from human cognitive processes, researchers have begun exploring neural architectures that incorporate persistent memory mechanisms. Human cognitive processes effectively leverage episodic and semantic memory systems to inform current decision-making. These new architectures aim for enhanced pattern recognition and contextual reasoning. However, the application of MA approaches to semantic segmentation tasks, particularly free road space detection, remains largely unexplored. Most contemporary approaches rely on multimodal sensor fusion or complex ensemble architectures to achieve competitive performance, increasing computational overhead and system complexity.
For this purpose, we introduce a Human-like Memory Bank that draws inspiration from cognitive neuroscience theories of human memory systems [10,11], while implementing these concepts through established deep-learning mechanisms. The episodic memory module uses similarity-based retrieval to store and recall specific driving experiences, analogous to how humans retrieve context-specific memories. The working memory employs attention mechanisms to maintain task-relevant information during inference. While these modules use conventional machine learning techniques (attention, similarity metrics, weighting), their architectural organization and interaction patterns are inspired by cognitive models of human memory. The emotional valence mechanism implements a quality-based weighting system that prioritizes high-quality experiences, drawing from theories of emotional memory consolidation [10,11]. We emphasize that these are engineering implementations inspired by neuroscience concepts, rather than direct computational models of neural processes.
The memory system implements three interconnected subsystems: episodic memory for storing specific experiences, semantic memory for maintaining generalized knowledge patterns, and working memory for preserving recent contextual information. The MA architecture enables learning from both successful and unsuccessful prediction attempts. This builds a comprehensive knowledge base that informs future segmentation decisions. We implement memory consolidation and forgetting mechanisms that prioritize important experiences. The framework employs contextual encoding that captures visual, spatial, and performance-related information, and this enables rich associative memory retrieval based on current scene characteristics.
Motivated by these observations, this paper addresses the fundamental gap in how neural networks can accumulate and utilize experiential learning for improved free road space detection. We propose a framework that bridges this critical gap by integrating accumulated experiences to enhance current prediction accuracy while operating exclusively on single-modality RGB input. Despite this computational constraint, our approach achieves performance comparable to sophisticated multimodal systems through intelligent memory utilization. The framework combines an InternImage-XL backbone [12] with a UPerNet decoder architecture (similar to [13]), augmented by a Human-like Memory Bank system inspired by cognitive neuroscience principles.
The system extracts compressed feature representations from deep network activations, associates them with contextual information, including visual statistics and performance outcomes, and stores these experiences with emotional valence based on prediction accuracy. Memory retrieval employs similarity computation considering pattern alignment, contextual relevance, temporal recency, and learned importance scores to identify relevant past experiences for current prediction tasks. The integration of retrieved memories with current network processing occurs through attention-based mechanisms that enable selective incorporation of relevant historical knowledge while preserving the network’s ability to process novel scenarios [14].
Experimental validation on challenging road segmentation benchmarks demonstrates the effectiveness of our approach. The framework achieves superior performance compared with single-modality-based methods and competitive results with multimodal approaches. We conduct comprehensive ablation studies to reveal the individual contributions of different memory subsystems and provide insights into memory dynamics that drive performance improvements. The primary contributions of this work can be summarized as follows:
  • We present a framework that integrates human-inspired cognitive architectures implementing episodic, semantic, and working memory subsystems with biologically inspired consolidation and forgetting mechanisms for enhanced performance.
  • Our comprehensive experiments demonstrate superior performance among state-of-the-art single-modality-based methods and competitive performance approaching multimodal systems on challenging road segmentation benchmarks. Additionally, we present a detailed analysis of memory dynamics, retrieval mechanisms, and their impact on performance.
The rest of this paper is organized as follows. Section 2 reviews related work in semantic segmentation, MA neural networks, and cognitive architectures. Section 3 presents our framework, including the InternImage-XL backbone, UPerNet decoder, and Human-like Memory Bank system. Section 4 provides comprehensive experimental validation and ablation studies. Section 5 discusses limitations, environmental impact considerations, and future research directions. Section 6 concludes with a discussion of results and future research directions. The code, weights, and other materials can be found at our GitHub webpage (https://github.com/abdkhanstd/MemRoadNet, accessed on 19 October 2025).

2. Related Work

Free road space detection has witnessed substantial advancement through deep-learning methodologies. It spans both complex multimodal frameworks and computationally efficient single-modality approaches. However, existing architectures fundamentally lack mechanisms for leveraging Accumulated Experiential Knowledge (AEK), processing each input independently without benefiting from previously encountered patterns or contextual associations.

2.1. Multimodal Approaches

Modern multimodal methods demonstrate remarkable performance through sophisticated sensor fusion strategies, integrating complementary information from LiDAR, RGB imagery, depth sensors, and surface normal estimations. For example, Evi-RoadSeg [15] presents an evidence-based approach for real-time road segmentation, enhancing performance through RGB-D data augmentation. Similarly, the study named LFD-RoadSeg [16] introduces a method for ultra-fast road segmentation by leveraging low-level representations to enhance efficiency and performance. Moreover, CLCFNet [17] implements cascaded LiDAR-camera processing pipelines, while PLARD [18] achieves state-of-the-art performance through adaptive LiDAR data integration. USNet [19] delivers low-latency processing via lightweight symmetric network architectures, addressing real-time deployment requirements for autonomous systems.
Surface normal estimation methodologies, including SNE-RoadSeg [20] and its enhanced variant SNE-RoadSegV2 [21], advance heterogeneous feature fusion through fallibility-aware processing mechanisms. Recent depth-aware approaches have further expanded multimodal capabilities. For example, the research named DiPFormer [22] explores deep RGB-D interactions for traffic scene segmentation, demonstrating how depth information enhances spatial understanding in complex driving environments. Transformer-based approaches such as RoadFormer [23] and RoadFormer+ [24] employ scale-aware information decoupling for RGB-Normal semantic parsing, showcasing attention mechanisms’ effectiveness in multimodal integration.
Additional multimodal frameworks, including DFM-RTFNet [25], 3MT-RoadSeg [26], Pseudo-LiDAR [27], TEDNet [28], CLRD [29], and LRDNet [14], present diverse multimodal integration strategies, each addressing specific challenges in sensor fusion and computational efficiency trade-offs. Contemporary approaches have expanded to include LiDAR-image fusion methodologies, exemplified by UdeerLID+ [30], which integrates LiDAR, image, and relative depth information through semi-supervised learning paradigms. Furthermore, pseudo-LiDAR techniques have emerged as cost-effective alternatives, with [27] demonstrating effective road detection through depth-derived LiDAR representations, bridging the gap between expensive sensor suites and accessible RGB-based systems.

2.2. Methods Based on Single-Modality

Single-modality approaches prioritize computational efficiency while maintaining competitive performance through architectural innovations and algorithmic optimizations. LC-CRF [31] leverages conditional random field frameworks for structured prediction, while LFD-RoadSeg [16] employs bilateral network structures for enhanced feature representation. Additionally, specialized resource-efficient architectures, including RoadNet3 [32], ChipNet [33], DEEP-DIG [34], HA-DeepLabv3+ [35], RBANet [36], and Hadamard-FCN [37] demonstrate various strategies for reducing computational overhead while preserving segmentation accuracy. These approaches typically focus on architectural efficiency through lightweight convolutions, parameter reduction, or specialized design patterns.

2.3. Research Gap

Despite substantial progress in both multimodal and single-modality road segmentation, existing methodologies share a fundamental limitation, i.e., the inability to leverage AEK for enhanced prediction accuracy. Current architectures process each input independently, failing to capitalize on patterns, contextual associations, and performance feedback accumulated during training and deployment. Furthermore, while multimodal approaches achieve superior performance, they require complex sensor suites and substantial computational resources that may not be practical for all deployment scenarios. Single-modality methods, though more efficient, typically sacrifice performance to achieve computational constraints. Thus, this paper addresses these limitations by introducing MA mechanisms that enable single-modality approaches to achieve performance comparable with multimodal systems through intelligent utilization of AEK.

3. Methodology

This section presents our semantic segmentation framework that integrates human-inspired cognitive architectures. Figure 1 provides a simplified overview of the complete framework, illustrating how features flow from the backbone through the decoder while being enhanced by accumulated memory experiences.

3.1. Overall Architecture

The proposed framework orchestrates three interconnected components as illustrated in Figure 2, i.e., an InternImage-XL backbone for hierarchical feature extraction, a UPerNet decoder for multi-scale feature fusion, and our human-inspired Memory Bank that enables experiential learning through accumulated knowledge. Figure 2 shows the memory system’s operational dynamics. It shows how specific road experiences, ranging from clear highway segments to challenging shadowed intersections, are encoded with emotional valences based on segmentation performance outcomes. This detailed representation demonstrates the categorization mechanisms that enable our framework to distinguish between highly successful predictions (very positive emotional valence) and problematic scenarios (negative valence). This attempts to create a rich experience repository that fundamentally transforms how neural networks approach road segmentation tasks. The framework operates through an interplay between perception and memory, formulated as:
Y ^ = Ω ( Ψ decoder ( Φ backbone ( X ) R M ( X , C ( X ) ) ) ) ,
where X R B × 3 × H × W represents input imagery, C ( X ) extracts contextual information, ⊕ denotes memory-guided feature enhancement, R M represents our memory recall function, and Ω applies sigmoid activation for binary road segmentation. The architecture’s novelty lies in its dual operational modes, i.e., during training and inference, the Memory Bank continuously accumulates experiential patterns from prediction outcomes and performance feedback, building a comprehensive repository of associations. During training and inference, this accumulated knowledge guides current predictions through intelligent retrieval and integration of relevant historical experiences.

3.2. InternImage-XL Backbone with DCNv3

The InternImage-XL backbone serves as our primary feature extraction mechanism, employing Deformable Convolution v3 (DCNv3) operations [38] for adaptive spatial modeling. This architectural choice stems from DCNv3’s superior capability in modeling long-range dependencies while capturing fine-grained spatial details through dynamic receptive field adaptation, making it particularly effective for complex segmentation scenarios where object boundaries exhibit significant geometric variation. As depicted in Figure 2, the backbone implements a hierarchical structure across four distinct levels, each engineered to capture features at different semantic granularities. The channel progression follows { 192 , 384 , 768 , 1536 } with corresponding spatial resolution reductions achieved through progressive downsampling operations. Each InternImage block/level (i.e., B i ) processes features through a sequence represented as:
F i + 1 = B i ( F i ) = F i + γ 1 · DCNv 3 ( Λ 1 ( F i ) ) + γ 2 · MLP ( Λ 2 ( F i ) ) ,
where F i R C i × H i × W i represents features at level i, layer normalization operations Λ 1 and Λ 2 ensure stable training dynamics, and learnable layer scale parameters γ 1 , γ 2 R C i provide fine-grained control over feature integration. The DCNv3 operation forms the computational core of our backbone, adaptively sampling features based on learned offset and mask predictions:
DCNv 3 ( F ) = k = 1 K w k · F ( p 0 + p k + Δ p k ) · m k · σ ( α k ) ,
where p 0 represents the reference position, Δ p k R 2 represents learned spatial offsets enabling adaptive sampling, m k [ 0 , 1 ] represents attention masks modulating feature importance, w k denotes standard convolution weights, and σ ( α k ) represents learned modulation scalars. Level 3, as highlighted in our architecture diagram, receives particular attention as it serves dual purposes: providing the deepest semantic representations for decoder processing while simultaneously contributing to memory formation through feature compression operations. This dual utilization ensures that our memory system operates on the most semantically rich representations available from the backbone network.

3.3. UPerNet Decoder Head

The UPerNet decoder implements multi-scale feature fusion through its Feature Pyramid Network (FPN) [39] foundation, enhanced with Pyramid Scene Parsing (PSP) modules [40]. As illustrated in Figure 2, the decoder processes backbone features through distinct pathways: Levels 0-2 undergo lateral convolution followed by FPN refinement, while Level 3 (potentially enhanced by memory) passes through the PSP module for global context aggregation. The decoder establishes lateral connections that transform backbone features to unified channel dimensions:
L i = Ξ i ( F ( i ) ) + Υ ( L i + 1 ) ,
where lateral operations Ξ i implement 1 × 1 convolutions with batch normalization and ReLU activation, and the upsampling function Υ employs bilinear interpolation for spatial alignment across pyramid levels. The PSP module captures global contextual information through multi-scale pooling operations:
PSP ( F ) = Bottleneck ( Cat ( F , s { 1 , 2 , 3 , 6 } Θ s ( Pool s ( F ) ) ) ) ,
where adaptive pooling operations Pool s reduce spatial dimensions to s × s grids, projection functions Θ s perform channel-wise transformations, and Cat accomplishes concatenation along the channel dimension. The feature concatenation stage, as shown in our architecture diagram, unifies all processed features through spatial alignment and channel concatenation before the final convolution operation produces the segmentation mask. This multi-scale integration ensures that both fine-grained local details and global contextual information contribute to the final prediction.

3.4. Human-like Memory Bank System

Our Memory Bank system is fundamentally different from traditional neural architectures. It incorporates persistent, adaptive memory mechanisms inspired by human cognitive processes. The system addresses the inherent limitation of feed-forward networks that process each input independently without benefiting from AEK or contextual associations developed through training.

3.4.1. Memory Architecture Design

The Memory Bank, as shown in Figure 2, implements six interconnected components that collectively enable experiential learning. The system architecture comprises:
M = { M working , M episodic , M semantic , A attention , Σ compress , Φ fusion } .
Working Memory maintains recent experiential context through a limited-capacity buffer. It is implemented as M working = Deque ( { e t 9 , , e t } , maxlen = 10 ) , ensuring immediate contextual information remains readily accessible for rapid integration with current processing. Furthermore, Episodic Memory serves as the primary repository for specific experiences. It stores comprehensive representations as tuples e i = ( ϕ i , κ i , ϵ i , τ i , α i , ι i ) . Here, ϕ i R 128 represents compressed feature patterns, κ i encodes rich contextual information, ϵ i captures emotional valence based on performance outcomes. Additional metadata tracks temporal and access characteristics. Semantic Memory organizes generalized knowledge patterns grouped by performance characteristics and emotional categories. This enables rapid access to successful strategies and pattern recognition approaches that have proven effective across diverse scenarios. The practical implementation of our memory architecture, as demonstrated in the detailed Memory Bank visualization of Figure 2, operates on concrete road experiences that exemplify the system’s categorization capabilities. Each stored experience represents a specific encounter with varying road conditions. These range from pristine highway segments that achieve very positive emotional valence through exceptional segmentation accuracy to challenging scenarios involving shadows, occlusions, or complex geometric configurations that receive correspondingly lower valence scores.

3.4.2. Experience Encoding and Memory Formation

During training, our system continuously accumulates experiential knowledge through encoding mechanisms. Feature compression, as depicted in our architecture, transforms high-dimensional backbone activations into compact memory representations:
ϕ = Σ ( GAP ( F ( 3 ) ) ) = Linear ( 1 H 3 W 3 h = 1 H 3 w = 1 W 3 F : , h , w ( 3 ) ) ,
where Σ : R 1536 R 128 represents a learned transformation that preserves semantic content through dimensionality reduction. Contextual encoding captures multimodal scene properties that provide rich environmental information for memory association:
κ = { κ visual , κ spatial , κ training , κ performance } .
Visual context encompasses statistical image properties, including brightness characteristics, contrast metrics, and per-channel statistics. Spatial context encodes geometric properties, while training context maintains meta-information regarding the current training state. Performance context captures prediction quality indicators when ground-truth annotations are available. Emotional valence quantifies experience significance through IoU (Intersection over Union)-based categorization: very positive (IoU > 0.8 ), positive ( 0.6 < IoU 0.8 ), neutral ( 0.4 < IoU 0.6 ), negative ( 0.2 < IoU 0.4 ), and very negative (IoU 0.2 ). This enables prioritizing both highly successful experiences and significant failures as valuable learning signals.

3.4.3. Memory Recall and Integration

The memory recall process, as illustrated through the concrete examples in Figure 2, demonstrates how our system leverages accumulated road experiences to inform current predictions. The episodic memory component maintains detailed records of specific encounters, clear road boundaries that achieved exceptional performance, challenging shadow patterns that required discrimination, and complex intersection geometries that tested the network’s spatial reasoning capabilities. This rich experiential foundation enables the attention module to identify relevant historical patterns that share semantic or contextual similarity with current input scenarios, creating intelligent associations that guide enhanced feature representations toward more accurate segmentation outcomes.
During training, the memory system transitions from accumulation mode to utilization mode, leveraging AEK to enhance current predictions. Figure 3 illustrates the detailed memory recall and integration processes that enable contextual reasoning during road segmentation tasks. The attention module, as shown in Figure 2, implements retrieval mechanisms that identify relevant historical experiences through multi-faceted similarity computation that extends beyond simple cosine similarity to incorporate human-like memory characteristics. The enhanced memory retrieval employs a comprehensive similarity metric that combines pattern matching, contextual relevance, temporal dynamics, and learned importance:
s total ( e i , q ) = α · s pattern ( e i , q ) + β · s context ( e i , q ) + γ · s recency ( e i ) + δ · s importance ( e i ) ,
where query q = ( ϕ q , κ q ) represents current compressed features and contextual information, and the weighting coefficients α = 0.4 , β = 0.2 , γ = 0.2 , δ = 0.2 reflect the relative importance of different similarity aspects based on cognitive psychology principles. The weighting coefficients were determined through empirical validation on held-out data, balancing the contributions of different similarity aspects for optimal retrieval performance. Pattern similarity s pattern ( e i , q ) employs cosine similarity between compressed feature representations:
s pattern ( e i , q ) = ϕ i · ϕ q ϕ i 2 ϕ q 2 .
Contextual similarity s context ( e i , q ) measures alignment between environmental and situational factors:
s context ( e i , q ) = 1 | K | k K sim ( κ i ( k ) , κ q ( k ) ) ,
where K represents the set of contextual features (brightness, contrast, spatial properties) and sim ( · , · ) computes normalized similarity for each contextual dimension. Recency boost s recency ( e i ) implements exponential decay that prioritizes recently formed memories:
s recency ( e i ) = exp ( t current t i 100 ) ,
where t current represents the current global time step and t i denotes the timestamp when experience e i was stored. Importance score s importance ( e i ) reflects the learned significance of each memory based on emotional valence, access frequency, and consolidation strength:
s importance ( e i ) = w emotion ( e i ) + w access ( e i ) + w novelty ( e i ) 3 ,
where emotional weighting w emotion ( e i ) [ 0.3 , 1.0 ] assigns higher importance to very positive (1.0) and very negative (0.9) experiences, access weighting w access ( e i ) increases with retrieval frequency following w access ( e i ) = 1.0 + 0.1 · access _ count i , and novelty weighting captures the uniqueness of the stored pattern relative to the existing Memory Bank. The system retrieves top- k = 9 most relevant experiences, which undergo integration through multi-head attention mechanisms. The memory attention mechanism enables pattern matching between current queries and historical experiences:
A memory = MultiHead ( ϕ q , { ϕ 1 , , ϕ k } , { ϕ 1 , , ϕ k } ) ,
where the multi-head attention employs 8 attention heads, each computing:
Attention ( Q , K , V ) = softmax ( Q K T d k ) V ,
with Q = ϕ q W Q , K = [ ϕ 1 , , ϕ k ] W K , and V = [ ϕ 1 , , ϕ k ] W V , where W Q , W K , W V R 128 × 16 are learned projection matrices for each attention head, and d k = 16 is the dimension per head. The attended memory representations are then fused with current query features through a learned combination:
ϕ enhanced = Θ fusion ( Cat ( ϕ q , A memory ) ) ,
where Θ fusion : R 256 R 128 integrates current and historical information through a learned linear transformation, enabling the memory system to adaptively weight the contribution of recalled experiences based on their relevance to the current segmentation task.

3.4.4. Memory-Guided Feature Enhancement

The Memory Enhancement component, as illustrated in Figure 2, modifies backbone features through learned projection of memory-derived insights. This process represents the culmination of memory system processing, where AEK directly influences current feature representations:
F enhanced ( 3 ) = F ( 3 ) + λ · Θ influence ( ϕ enhanced ) 1 H 3 × W 3 ,
where Θ influence : R 128 R 1536 projects memory information to match backbone feature dimensions, λ = 0.2 controls memory influence strength, and ⊙ represents element-wise multiplication broadcasted across spatial dimensions. This enables the memory system to provide contextual guidance that improves segmentation accuracy, particularly in challenging scenarios where current visual information alone may prove insufficient for reliable road area detection. The enhanced features then proceed through the standard UPerNet decoder processing, benefiting from memory-informed representations that capture learned associations and contextual patterns accumulated during training.

3.5. Training Strategy and Memory Dynamics

Our training methodology carefully orchestrates the dual objectives of accurate segmentation and effective memory formation through multi-objective optimization. The system operates in continuous learning mode, where each prediction contributes to both immediate performance evaluation and long-term knowledge accumulation. The training objective combines segmentation accuracy with memory utilization effectiveness:
L total = L seg + μ · L memory .
Segmentation loss balances pixel-wise accuracy with boundary preservation through a weighted combination of Binary Cross-Entropy (BCE) and Dice formulations:
L seg = 0.4 · L BCE ( Y ^ , Y ) + 0.6 · L Dice ( Y ^ , Y ) ,
where the BCE loss is computed as:
L BCE ( Y ^ , Y ) = 1 N i = 1 N [ Y i log ( σ ( Y ^ i ) ) + ( 1 Y i ) log ( 1 σ ( Y ^ i ) ) ] ,
and the Dice loss encourages spatial coherence and is computed as:
L Dice ( Y ^ , Y ) = 1 2 i = 1 N σ ( Y ^ i ) Y i + ϵ i = 1 N σ ( Y ^ i ) + i = 1 N Y i + ϵ
where σ denotes the sigmoid activation and ϵ = 1 × 10 6 provides numerical stability. Furthermore, memory regularization encourages effective utilization of retrieved experiences through explicit alignment between current query features and recalled memory representations:
L memory = 1 K k = 1 K ϕ query ϕ memory ( k ) 2 2 ,
where ϕ query R 128 represents the current compressed query features, ϕ memory ( k ) R 128 denotes the k-th recalled memory pattern, and K is the number of retrieved memories (typically K = 9 ). The memory regularization weight μ = 0.1 balances segmentation accuracy with memory coherence, ensuring that retrieved experiences provide meaningful contextual guidance without overwhelming the primary segmentation objective. It promotes alignment between current features and recalled memories when such alignment benefits prediction accuracy, ensuring that memory retrieval mechanisms contribute meaningfully to segmentation performance.
Beyond basic storage and retrieval, our memory system implements dynamics that mirror human memory processes. The forgetting mechanism applies gradual decay with access-dependent preservation, ensuring that frequently accessed and important memories persist while less relevant experiences fade naturally:
ι t + 1 ( i ) = ι t ( i ) · ρ decay · ( 1 + 0.1 · α t ( i ) ) ,
where ρ decay = 0.995 represents the base decay rate, α t ( i ) counts access frequency, and ι t ( i ) denotes the importance score of memory i at time t. This implements a “use it or lose it” principle that maintains system efficiency while preserving valuable experiential knowledge. Memory consolidation occurs through sleep-like processes implemented at regular intervals during training, strengthening important memories while optimizing storage efficiency:
ι consolidated ( i ) = ι current ( i ) · 1.1 if ϵ i { very _ positive , very _ negative } 1.0 otherwise ,
where emotional experiences receive importance boosts during consolidation, reflecting the psychological principle that emotionally significant events are preferentially retained in long-term memory. The Memory Bank capacity management employs intelligent forgetting based on comprehensive importance metrics rather than simple temporal ordering:
F i = arg min i ι t ( i ) · s recency ( e i ) · w emotion ( e i ) ,
where F i is the index of the item to be forgotten, ensuring that the least important memories are removed when capacity constraints require memory optimization. The training process employs progressive memory integration, beginning with reduced memory influence during early epochs and gradually increasing memory utilization as the system develops proficiency in both segmentation and memory management. This curriculum approach ensures stable convergence while encouraging memory-prediction interactions as training progresses.

4. Experiments and Results

This section presents a comprehensive experimental validation of the proposed framework. We demonstrate its effectiveness through systematic evaluation against state-of-the-art methodologies and detailed ablation studies examining individual component contributions.

4.1. Datasets

Virtual KITTI 2 [6] provides approximately 21,000 photo-realistic synthetic images spanning diverse weather conditions, lighting variations, and camera perspectives. This dataset reproduces five original KITTI sequences with comprehensive multimodal data, including RGB imagery, depth information, and precise semantic segmentation annotations, making it particularly valuable for memory system training due to its controlled environmental variations.
KITTI road [4] comprises 289 training and 290 test images captured through car-mounted stereo camera systems in real urban environments. This dataset is organized into three distinct categories: UM (Urban Marked) featuring 95 training and 96 test images of roads with clearly visible lane markings, UMM (Urban Multiple Marked lanes) containing 96 training and 94 test images depicting complex multi-lane scenarios with multiple lane markings, and UU (Urban Unmarked) encompassing 98 training and 100 test images of roads without visible lane markings. The dataset features high-resolution imagery with meticulously annotated ground-truth road/non-road binary masks for the training set, representing the gold standard benchmark for road area segmentation evaluation.
Cityscapes [3] encompasses approximately 5000 finely annotated images supplemented by 20,000 coarsely annotated samples across 50 European cities. This dataset provides pixel-level annotations for 30 distinct semantic classes, including detailed road surface annotations, offering extensive diversity in urban road scenarios for comprehensive generalization assessment.
R2D [41] offers synthetic imagery with precise road area annotations enhanced by surface normal information. This dataset complements real-world data with controlled synthetic scenarios, enabling systematic evaluation of MA approaches across varied environmental conditions.

4.2. Training and Testing Protocol

We establish a unified training corpus by combining Virtual KITTI 2, KITTI road, Cityscapes, and R2D training images, creating a comprehensive dataset that balances synthetic diversity with real-world authenticity. This configuration enables our memory system to accumulate experiences across both controlled synthetic environments and authentic driving scenarios. Some samples from the datasets, along with images and their ground-truth masks, are shown in Figure 4. Evaluation employs the official KITTI road test set through provider-based evaluation protocols, supplemented by results on the Cityscapes and R2D datasets to assess generalization capabilities.
To enhance model robustness and facilitate effective memory formation across diverse scenarios, we employ comprehensive data augmentation strategies. These transformations include random horizontal flips with 50% probability, random rotations spanning 15 to + 15 , and color jittering with brightness, contrast, saturation, and hue adjustments up to 20%. Additionally, we apply geometric transformations including translations up to 10% of image dimensions and scaling factors ranging from 0.9 to 1.1, supplemented by random cropping and selective blurring with 30% probability. It should be noted that our framework processes only monocular RGB images as input, without utilizing stereo depth information or any additional modalities.

4.3. Experimental Setup

All experiments were conducted using the PyTorch version 2.9.0 deep-learning framework on a computational system equipped with 8 shared NVIDIA RTX A6000 GPUs (NVIDIA, Santa Clara, CA, USA) and 755 GB RAM, powered by 2 Intel Xeon Gold 6330 processors (Intel, Santa Clara, CA, USA) with 112 cores. The InternImage-XL backbone was initialized using official pretrained weights to leverage established feature representations, while our Memory Bank system was trained from scratch to develop domain-specific associative patterns. Hyperparameter configuration was carefully optimized for training, as shown in Table 1. The Adam optimizer was employed with an initial learning rate of 1 × 10 4 , complemented by ReduceLROnPlateau scheduling with 2-epoch patience and 0.1 reduction factor. Our combined Dice and BCE loss formulation balanced pixel-wise accuracy with boundary preservation. Training employed a batch size of 2 to accommodate the computational overhead of memory operations, with input images resized to 640 × 640 pixels to preserve spatial detail essential for effective memory formation and retrieval.
We follow standard evaluation protocols using official test splits with fixed random seeds for reproducibility. The consistent performance across various datasets shows our method works reliably beyond single test results.

4.4. Evaluation Metrics

We adopt the standard evaluation protocol established by [4] for comprehensive performance assessment. Primary metrics include Precision (PRE), Average Precision (AP), Recall (REC), Intersection over Union (IoU), Accuracy (ACC), False Positive Rate (FPR), False Negative Rate (FNR), and Maximum F1-Score (MaxF) for quantitative segmentation quality evaluation. MaxF represents the Maximum F1-Score across all confidence thresholds and serves as the primary performance indicator following KITTI road benchmark conventions.

4.5. Comparison with State-of-the-Art Multimodal Methods

To assess the effectiveness of our framework, we conducted extensive comparisons with cutting-edge multimodal road segmentation methodologies. These approaches incorporate sensor fusion techniques, integrating diverse information sources including RGB imagery, LiDAR point clouds, surface normal estimations, and depth data to achieve superior segmentation performance through complementary modality exploitation.
Table 2 presents a comprehensive performance evaluation on the official KITTI road benchmark, demonstrating our framework’s competitive positioning within the established state-of-the-art landscape. Our approach achieves a MaxF score of 0.9666, placing it among the top-performing methodologies (on KITTI leaderboard) while operating exclusively on single-modality, i.e., RGB input only. The framework exhibits particularly robust precision (0.9395), indicating excellent discriminative capabilities that effectively minimize false road predictions, a critical requirement for autonomous driving safety.
The comparative analysis reveals that despite operating without the informational advantages of multimodal sensor fusion, our approach maintains performance levels that are remarkably competitive with leveraging LiDAR, depth, and surface normal information. This achievement underscores the effectiveness of MA inference in compensating for single-modality constraints through intelligent utilization of AEK. The slightly elevated false positive and false negative rates reflect the inherent challenges of single-modality processing, yet the overall performance demonstrates that memory mechanisms can effectively bridge the information gap typically addressed through sensor diversity. While multimodal approaches achieve marginally superior performance, they require complex sensor suites, sophisticated calibration procedures, and substantial computational resources that may not be practical for all autonomous vehicle configurations.
Figure 5 provides a detailed visual analysis of our framework’s performance on the KITTI road benchmark, demonstrating segmentation quality through both perspective and Bird’s Eye View (BEV) representations. The comprehensive evaluation reveals that our approach achieves robust road detection across diverse scenarios, with minimal false negatives (red regions) and well-controlled false positives (blue regions), while maintaining excellent true positive coverage (green regions) that accurately delineates drivable road areas.

4.6. Comparison with State-of-the-Art Single-Modality Methods

To provide a comprehensive evaluation context, we conducted a detailed comparison with leading single-modality road segmentation approaches. Our method shows competitive performance through architectural innovations and algorithmic optimizations.
The single-modality comparison presented in Table 3 reveals that our framework achieves notable performance advantages over existing RGB-only approaches. With a MaxF score of 0.9666, our method substantially outperforms the leading single-modality approaches, including RBANet (0.9630) and LC-CRF (0.9568), while demonstrating strong precision capabilities (0.9395) that exceed most compared methodologies.
This performance comparison highlights the transformative impact of memory augmentation in single-modality processing. Traditional RGB-only approaches are fundamentally constrained by the limited information available in individual frames, requiring architectural innovations to achieve competitive performance. Our framework transcends these limitations by accumulating and leveraging experiential knowledge from previously encountered scenarios, effectively expanding the informational context available for segmentation decisions. Furthermore, the competitive precision achieved by our approach indicates that memory-guided inference provides valuable contextual discrimination that helps distinguish genuine road areas from ambiguous regions that might confuse traditional feed-forward processing. This capability is particularly valuable in challenging scenarios, including shadowed road surfaces, complex geometric configurations, and partial occlusions, where accumulated experiential patterns can provide decisive contextual guidance. Additionally, our framework demonstrates strong recall performance while maintaining competitive precision compared with single-modality approaches, indicating a favorable balance between comprehensive road detection and false positive minimization. This characteristic is essential for autonomous driving applications where both missed road areas and incorrectly identified obstacles can compromise navigation safety.
In order to provide a better understanding, Table 4 shows the performance of our framework across different KITTI road categories. The results for the KITTI dataset are calculated using the official evaluation server. The results can be viewed at the official website (https://www.cvlibs.net/datasets/kitti/eval_road_detail.php?result=d9e4ae6781d5c6dbf01d5799bfbf1665afd89a8b accessed 19 October 2025). Furthermore, Table 4 presents the official precision, recall curves over various categories.
To provide better insights to the readers, the official precision and recall curves are presented in Figure 6.

4.7. Comprehensive Ablation Studies

We conducted systematic ablation studies to evaluate the individual contributions of different memory system components and design choices. These experiments were performed on a carefully curated evaluation subset combining Virtual KITTI 2 and KITTI road datasets, enabling controlled assessment of memory dynamics and their impact on segmentation performance.

4.7.1. Impact of Memory System Integration

Figure 7 demonstrates the practical effectiveness of memory augmentation through direct visual comparison across challenging road scenarios.
The comparative analysis reveals that memory integration provides substantial improvements in challenging scenarios. Green-highlighted regions demonstrate successful recovery of road segments missed by the baseline approach, particularly in complex geometric configurations and shadowed areas. Blue regions indicate reduced false positives through AEK, while black regions show comparable performance between approaches. Memory augmentation proves especially effective in ambiguous boundary conditions where visual information alone proves insufficient. The system leverages accumulated contextual associations to distinguish genuine road areas from challenging background regions, demonstrating the practical value of experiential knowledge accumulation in spatial reasoning tasks.

4.7.2. Memory Bank Capacity Analysis

Memory capacity analysis reveals that larger memory banks provide enhanced performance through increased diversity of stored experiences. The 200-memory configuration achieves an optimal balance between experiential coverage and computational efficiency, enabling comprehensive knowledge retention while preserving real-time processing capabilities essential for autonomous driving applications. The progressive improvement with increased capacity demonstrates the memory system’s ability to leverage richer experiential knowledge for enhanced segmentation accuracy.

4.7.3. Memory Influence Weight Optimization

Memory weight analysis demonstrates that a stronger memory influence ( λ = 0.5 ) yields optimal performance with a MaxF of 0.9907 and exceptional average precision of 0.9995. This configuration effectively balances current visual information with AEK, enabling the memory system to provide substantial contextual guidance while preserving the network’s ability to process novel scenarios. The reduced false positive rate (0.0024) at higher memory weights indicates that accumulated experiences help discriminate challenging road boundaries more effectively.

4.7.4. Loss Function Component Analysis

Loss function analysis in Table 5 reveals that Binary Cross-Entropy (BCE) achieves superior individual performance (for our method), while the combined BCE+Dice formulation provides balanced optimization for both pixel-wise accuracy and boundary preservation. The combined approach demonstrates robust performance across all metrics, making it optimal for MA training where both precise classification and sharp boundary delineation are essential for effective experiential knowledge accumulation.

4.7.5. Pretrained Weight Initialization Impact

Pretrained weight initialization analysis in Table 6 indicates that both approaches achieve comparable performance, with training from scratch showing marginal advantages in recall and false negative rate. This suggests that the MA framework can effectively develop domain-specific representations regardless of initialization strategy, highlighting the robustness of the memory-guided learning process in adapting to road segmentation tasks.

4.8. Performance on R2D

The R2D dataset evaluation examines our framework’s effectiveness on synthetic road scenarios enhanced with surface normal information, providing a controlled assessment of memory-guided segmentation across varied geometric configurations and lighting conditions.
The R2D evaluation results presented in Table 7 demonstrate our framework’s robust cross-domain adaptation capabilities. Achieving a MaxF score of 0.9490 with particularly strong precision, our approach maintains competitive performance despite the domain shift from real-world KITTI training data to synthetic R2D scenarios. This performance indicates that the AEK successfully generalizes across different visual characteristics and geometric configurations. The memory system’s effectiveness in synthetic environments suggests that the learned associative patterns capture fundamental road segmentation principles that transcend specific imaging conditions. The competitive precision performance (0.9545) compared with several established multimodal approaches highlights the Memory Bank’s discriminative capabilities in distinguishing genuine road areas from challenging background regions, even when encountering novel synthetic visual characteristics not present in the original training distribution.

4.9. Performance on Cityscapes

Cityscapes evaluation provides a comprehensive assessment of memory-guided segmentation across diverse European urban environments, examining the framework’s adaptability to varied architectural styles, road configurations, and metropolitan driving scenarios.
The Cityscapes evaluation results demonstrated in Table 8 reveal compelling evidence of our memory system’s robust generalization across diverse metropolitan environments. Achieving a MaxF score of 0.9189 with comparable recall performance, our framework successfully adapts to the complex urban road configurations characteristic of European cities while maintaining competitive positioning among established methodologies. The competing recall performance indicates that our approach excels at comprehensive road area detection across the varied geometric configurations, intersection patterns, and architectural contexts present in Cityscapes imagery. This capability suggests that AEK effectively captures generalizable road segmentation patterns that transcend specific geographical and architectural characteristics, enabling robust performance across diverse urban environments.

4.10. Computational Efficiency Analysis

Our framework achieves a favorable performance-efficiency trade-off. While requiring additional memory operations and computational resources (358 M parameters, 476 G FLOPs), the framework achieves near state-of-the-art performance through intelligent utilization of accumulated experiences. Compared with top-performing methods such as SNE-RoadSeg (1950.2G FLOPs) and PLARD (1147.6G FLOPs), our approach demonstrates competitive segmentation quality, making it suitable for autonomous driving applications where both accuracy and resource considerations are important.
Real-time performance is crucial for autonomous driving applications. Our framework achieves an average inference speed of 18.5 FPS on a shared NVIDIA RTX A6000 GPU for 640 × 640 input images. The memory system’s computational overhead remains acceptable for real-time deployment, as the quality improvements justify the speed reduction compared with baseline approaches.

4.11. Qualitative Analysis

Figure 8 presents a qualitative comparison of road segmentation results demonstrating the effectiveness of MA inference across challenging scenarios. The memory system particularly excels in complex road geometries, shadowed regions, and ambiguous boundary conditions where AEK provides valuable contextual guidance for accurate segmentation decisions.
The qualitative results reveal that memory augmentation enhances segmentation consistency and boundary precision, particularly in challenging scenarios where traditional feed-forward processing encounters ambiguity. The memory system’s ability to recall relevant experiential patterns enables more confident and accurate predictions in complex driving environments.

5. Limitations, Environmental Impact, and Future Directions

While our framework demonstrates competitive performance, several limitations warrant discussion. The current implementation achieves 18.5 FPS on an NVIDIA RTX A6000 GPU. This may fall short of ultra-high-speed requirements for certain applications. Evaluation on embedded automotive-grade hardware (e.g., NVIDIA Xavier, Orin) would provide valuable deployment insights. However, consistent with other state-of-the-art methods in the literature, our results are reported on non-embedded GPU platforms. The memory system introduces computational overhead. However, this remains competitive with other high-performing methods such as SNE-RoadSeg and PLARD, as shown in Table 9.
The fixed memory capacity (200 experiences) with importance-based forgetting may introduce catastrophic forgetting risks during extended deployment when encountering continuously diverse scenarios. The current mechanism may inadvertently discard valuable rare-event patterns crucial for handling edge cases. Future work will investigate replay-based consolidation strategies and hierarchical memory architectures to address this limitation.
Framework performance depends on several hyperparameters, including memory size, influence weight ( λ ), and top-k retrieval count. While ablation studies (Table 5, Table 10 and Table 11) demonstrate robustness across tested configurations, optimal settings may vary across different deployment scenarios. The emotional valence categorization relies on fixed IoU thresholds. More sophisticated adaptive quality metrics could improve memory formation effectiveness.
Generalization to significantly different real-world conditions beyond our training distribution remains challenging. Our training data primarily covers public datasets. Performance may degrade when encountering extreme weather conditions, unpaved rural roads, or novel infrastructure designs not present in training data. Future research will explore domain adaptation techniques and active learning strategies to improve robustness across diverse operational conditions.
For CO 2 emissions, training our model for 100 epochs produces approximately 12–15 kg CO 2 equivalent (assuming 0.429 kg CO 2 /kWh grid intensity [42]).
Future work could explore memory compression and efficient retrieval mechanisms. Adaptive capacity management could address memory constraints. Finally, investigating meta-learning approaches for automated hyperparameter optimization would enable self-tuning memory systems capable of adjusting to diverse operational conditions.

6. Conclusions

In this paper, we present a free road space detection framework that integrates human-inspired cognitive architectures with deep-learning models for enhanced image feature extraction, contributing to the advancement of sensor-based computer vision systems. The proposed Human-like Memory Bank system implements episodic, semantic, and working memory subsystems with biologically inspired consolidation and forgetting mechanisms. We demonstrate the effectiveness of experiential knowledge accumulation in improving detection performance. Experimental results indicate that our single-modality RGB approach achieves superior performance among all single-modality methods and competitive performance approaching state-of-the-art multimodal systems through intelligent memory utilization. The comprehensive ablation studies confirm the individual contributions of different memory components. The framework’s ability to maintain this competitive performance while operating exclusively on RGB input demonstrates a favorable performance-efficiency trade-off compared with methods with significantly higher computational requirements, effectively narrowing the gap between single-modality and multimodal approaches. Future research directions include extending memory mechanisms to multi-task learning scenarios, investigating adaptive memory consolidation strategies, and exploring applications to other computer vision tasks requiring contextual reasoning and experiential knowledge accumulation.

Author Contributions

Conceptualization, S.S. and A.A.K.; methodology, S.S. and J.S.; software, S.S. and A.A.K.; validation, A.A.K. and J.S.; writing—original draft preparation, S.S. and A.A.K.; writing—review and editing, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Foreign Expert Project of China (No. H20240938) and Sichuan Science and Technology Program (No. 2025HJRC0021).

Data Availability Statement

Training and evaluation are conducted on publicly available datasets (KITTI road, Cityscapes, Virtual KITTI 2, R2D) with standard evaluation protocols. The code, weights, and other materials can be found at https://github.com/abdkhanstd/MemRoadNet (accessed on 19 October 2025).

Acknowledgments

During the preparation of this work, the author(s) used a personally fine-tuned version of BART (2019-10-29) and Qwen 2.5 in order to perform language correction and enhance the clarity of technical writing. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, X.; Dong, Y.; Li, X.; Zheng, X.; Liu, H.; Li, T. Drivable area recognition on unstructured roads for autonomous vehicles using an optimized bilateral neural network. Sci. Rep. 2025, 15, 13533. [Google Scholar] [CrossRef]
  2. Zhao, J.; Wu, Y.; Deng, R.; Xu, S.; Gao, J.; Burke, A.F. A Survey of Autonomous Driving from a Deep Learning Perspective. ACM Comput. Surv. 2025, 57, 263. [Google Scholar] [CrossRef]
  3. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
  4. Fritsch, J.; Kühnl, T.; Geiger, A. A new performance measure and evaluation benchmark for road detection algorithms. In Proceedings of the 16th International IEEE Conference on Intelligent Transportation Systems, ITSC 2013, The Hague, The Netherlands, 6–9 October 2013; pp. 1693–1700. [Google Scholar]
  5. Lin, H.Y.; Chang, C.K.; Tran, V.L. Lane detection networks based on deep neural networks and temporal information. Alex. Eng. J. 2024, 98, 10–18. [Google Scholar] [CrossRef]
  6. Cabon, Y.; Murray, N.; Humenberger, M. Virtual KITTI 2. arXiv 2020, arXiv:2001.10773. [Google Scholar]
  7. Amponis, G.; Lagkas, T.; Argyriou, V.; Radoglou-Grammatikis, P.I.; Kyranou, K.; Makris, I.; Sarigiannidis, P.G. Channel-Aware QUIC Control for Enhanced CAM Communications in C-V2X Deployments Over Aerial Base Stations. IEEE Trans. Veh. Technol. 2024, 73, 9320–9333. [Google Scholar] [CrossRef]
  8. Rae, J.W.; Hunt, J.J.; Danihelka, I.; Harley, T.; Senior, A.W.; Wayne, G.; Graves, A.; Lillicrap, T. Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes. In Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; pp. 3621–3629. [Google Scholar]
  9. Santoro, A.; Bartunov, S.; Botvinick, M.M.; Wierstra, D.; Lillicrap, T.P. Meta-Learning with Memory-Augmented Neural Networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York, NY, USA, 19–24 June 2016; JMLR: Norfolk, MA, USA, 2016; pp. 1842–1850. [Google Scholar]
  10. Tulving, E. Episodic and semantic memory. Organ. Mem. 1972, 1, 381–403. [Google Scholar]
  11. Baddeley, A. The episodic buffer: A new component of working memory? Trends Cogn. Sci. 2000, 4, 417–423. [Google Scholar] [CrossRef] [PubMed]
  12. Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
  13. Wang, Z. Combining UPerNet and ConvNeXt for Contrails Identification to reduce Global Warming. arXiv 2023, arXiv:2310.04808. [Google Scholar]
  14. Khan, A.A.; Shao, J.; Rao, Y.; She, L.; Shen, H.T. LRDNet: Lightweight LiDAR Aided Cascaded Feature Pools for Free Road Space Detection. IEEE Trans. Multim. 2025, 27, 652–664. [Google Scholar] [CrossRef]
  15. Xue, F.; Chang, Y.; Xu, W.; Liang, W.; Sheng, F.; Ming, A. Evidence-Based Real-Time Road Segmentation With RGB-D Data Augmentation. IEEE Trans. Intell. Transp. Syst. 2025, 26, 1482–1493. [Google Scholar] [CrossRef]
  16. Zhou, H.; Xue, F.; Li, Y.; Gong, S.; Li, Y.; Zhou, Y. Exploiting Low-Level Representations for Ultra-Fast Road Segmentation. IEEE Trans. Intell. Transp. Syst. 2024, 25, 9909–9919. [Google Scholar] [CrossRef]
  17. Gu, S.; Yang, J.; Kong, H. A Cascaded LiDAR-Camera Fusion Network for Road Detection. In Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2021, Xi’an, China, 30 May–5 June 2021; pp. 13308–13314. [Google Scholar]
  18. Chen, Z.; Zhang, J.; Tao, D. Progressive LiDAR adaptation for road detection. IEEE/CAA J. Autom. Sin. 2019, 6, 693–702. [Google Scholar] [CrossRef]
  19. Chang, Y.; Xue, F.; Sheng, F.; Liang, W.; Ming, A. Fast Road Segmentation via Uncertainty-aware Symmetric Network. In Proceedings of the 2022 International Conference on Robotics and Automation, ICRA 2022, Philadelphia, PA, USA, 23–27 May 2022; pp. 11124–11130. [Google Scholar]
  20. Wang, H.; Fan, R.; Cai, P.; Liu, M. SNE-RoadSeg+: Rethinking Depth-Normal Translation and Deep Supervision for Freespace Detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021, Prague, Czech Republic, 27 September –1 October 2021; pp. 1140–1145. [Google Scholar]
  21. Feng, Y.; Ma, Y.; Andreev, S.; Chen, Q.; Dvorkovich, A.V.; Pitas, I.; Fan, R. SNE-RoadSegV2: Advancing Heterogeneous Feature Fusion and Fallibility Awareness for Freespace Detection. IEEE Trans. Instrum. Meas. 2025, 74, 2512109. [Google Scholar] [CrossRef]
  22. Chen, S.; Han, T.; Zhang, C.; Liu, W.; Su, J.; Wang, Z.; Cai, G. Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes. arXiv 2024, arXiv:2409.07995. [Google Scholar]
  23. Li, J.; Zhang, Y.; Yun, P.; Zhou, G.; Chen, Q.; Fan, R. RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing. IEEE Trans. Intell. Veh. 2024, 9, 5163–5172. [Google Scholar] [CrossRef]
  24. Huang, J.; Li, J.; Jia, N.; Sun, Y.; Liu, C.; Chen, Q.; Fan, R. RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion. IEEE Trans. Intell. Veh. 2024, 10, 3156–3165. [Google Scholar] [CrossRef]
  25. Wang, H.; Fan, R.; Sun, Y.; Liu, M. Dynamic Fusion Module Evolves Drivable Area and Road Anomaly Detection: A Benchmark and Algorithms. IEEE Trans. Cybern. 2022, 52, 10750–10760. [Google Scholar] [CrossRef] [PubMed]
  26. Milli, E.; Erkent, Ö.; Yilmaz, A.E. Multi-Modal Multi-Task (3MT) Road Segmentation. IEEE Robot. Autom. Lett. 2023, 8, 5408–5415. [Google Scholar] [CrossRef]
  27. Sun, L.; Zhang, H.; Yin, W. Pseudo-LiDAR-Based Road Detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5386–5398. [Google Scholar] [CrossRef]
  28. Bayón-Gutiérrez, M.; García-Ordás, M.T.; Alaiz-Moretón, H.; Aveleira-Mata, J.; Rubio-Martín, S.; Benítez-Andrades, J.A. TEDNet: Twin Encoder Decoder Neural Network for 2D Camera and LiDAR Road Detection. Log. J. IGPL 2024, 33, jzae048. [Google Scholar] [CrossRef]
  29. Bayón-Gutiérrez, M.; Benítez-Andrades, J.A.; Rubio-Martín, S.; Aveleira-Mata, J.; Alaiz-Moretón, H.; García-Ordás, M.T. Roadway Detection Using Convolutional Neural Network Through Camera and LiDAR Data. In Proceedings of the Hybrid Artificial Intelligent Systems—17th International Conference, HAIS 2022, Salamanca, Spain, 5–7 September 2022; pp. 419–430. [Google Scholar]
  30. Ni, T.; Zhan, X.; Luo, T.; Liu, W.; Shi, Z.; Chen, J. UdeerLID+: Integrating LiDAR, Image, and Relative Depth with Semi-Supervised. arXiv 2024, arXiv:2409.06197. [Google Scholar]
  31. Gu, S.; Zhang, Y.; Tang, J.; Yang, J.; Kong, H. Road Detection through CRF based LiDAR-Camera Fusion. In Proceedings of the International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, 20–24 May 2019; pp. 3832–3838. [Google Scholar]
  32. Lyu, Y.; Bai, L.; Huang, X. Road Segmentation using CNN and Distributed LSTM. In Proceedings of the IEEE International Symposium on Circuits and Systems, ISCAS 2019, Sapporo, Japan, 26–29 May 2019; pp. 1–5. [Google Scholar]
  33. Lyu, Y.; Bai, L.; Huang, X. ChipNet: Real-Time LiDAR Processing for Drivable Region Segmentation on an FPGA. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66-I, 1769–1779. [Google Scholar] [CrossRef]
  34. Muñoz-Bulnes, J.; Fernández, C.; Parra, I.; Llorca, D.F.; Sotelo, M.Á. Deep fully convolutional networks with random data augmentation for enhanced generalization in road detection. In Proceedings of the 20th IEEE International Conference on Intelligent Transportation Systems, ITSC 2017, Yokohama, Japan, 16–19 October 2017; pp. 366–371. [Google Scholar]
  35. Fan, R.; Wang, H.; Cai, P.; Wu, J.; Bocus, M.J.; Qiao, L.; Liu, M. Learning Collision-Free Space Detection from Stereo Images: Homography Matrix Brings Better Data Augmentation. IEEE/ASME Trans. Mechatron. 2022, 27, 225–233. [Google Scholar] [CrossRef]
  36. Sun, J.; Kim, S.; Lee, S.; Kim, Y.; Ko, S. Reverse and Boundary Attention Network for Road Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Republic of Korea, 27–28 October 2019; pp. 876–885. [Google Scholar]
  37. Oeljeklaus, M. An integrated approach for traffic scene understanding from monocular cameras: Towards resource-constrained perception of environment representations with multi-task convolutional neural networks. PhD Thesis, Technical University of Dortmund, Dortmund, Germany, 2021. [Google Scholar]
  38. Li, H.; Zhang, Y.; Zhang, Y.; Li, H.; Sang, L. DCNv3: Towards Next Generation Deep Cross Network for CTR Prediction. arXiv 2024, arXiv:2407.13349. [Google Scholar]
  39. Lin, T.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  40. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
  41. Fan, R.; Wang, H.; Cai, P.; Liu, M. SNE-RoadSeg: Incorporating Surface Normal Information into Semantic Segmentation for Accurate Freespace Detection. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 340–356. [Google Scholar]
  42. Lacoste, A.; Luccioni, A.; Schmidt, V.; Dandres, T. Quantifying the Carbon Emissions of Machine Learning. arXiv 2019, arXiv:1910.09700. [Google Scholar]
Figure 1. Simplified architectural overview showing the interaction of the InternImage-XL backbone (with DCNv3 blocks), UPerNet decoder (with FPN and PSP modules), and Memory Bank system.
Figure 1. Simplified architectural overview showing the interaction of the InternImage-XL backbone (with DCNv3 blocks), UPerNet decoder (with FPN and PSP modules), and Memory Bank system.
Sensors 25 06600 g001
Figure 2. Comprehensive overview of our framework illustrating both architectural components and experiential learning processes. The upper panel demonstrates the technical integration of InternImage (XL) backbone, UPerNet decoder, and Human-like Memory Bank. The lower part demonstrates the memory system’s internal mechanisms, showcasing actual road experience categorization with emotional valences ranging from very positive (clear road scenarios) to negative (occluded conditions), episodic memory organization, semantic clustering, and the continuous learning feedback loop that enables experiential knowledge accumulation throughout training and inference phases.
Figure 2. Comprehensive overview of our framework illustrating both architectural components and experiential learning processes. The upper panel demonstrates the technical integration of InternImage (XL) backbone, UPerNet decoder, and Human-like Memory Bank. The lower part demonstrates the memory system’s internal mechanisms, showcasing actual road experience categorization with emotional valences ranging from very positive (clear road scenarios) to negative (occluded conditions), episodic memory organization, semantic clustering, and the continuous learning feedback loop that enables experiential knowledge accumulation throughout training and inference phases.
Sensors 25 06600 g002
Figure 3. Detailed illustration of the memory usage, showing episodic memory retrieval, attention-based pattern matching, and memory-guided feature enhancement processes. ★ and the dashed red ellipse indicate most similar.
Figure 3. Detailed illustration of the memory usage, showing episodic memory retrieval, attention-based pattern matching, and memory-guided feature enhancement processes. ★ and the dashed red ellipse indicate most similar.
Sensors 25 06600 g003
Figure 4. Representative samples illustrating the diversity of training data, including input images and corresponding ground-truth road area masks across various environmental conditions and road configurations. Zoom in for a better view.
Figure 4. Representative samples illustrating the diversity of training data, including input images and corresponding ground-truth road area masks across various environmental conditions and road configurations. Zoom in for a better view.
Sensors 25 06600 g004
Figure 5. Evaluation visualization on the KITTI road dataset showing perspective view results (odd rows) and corresponding Bird’s Eye View (BEV) analysis (even rows). The color coding represents segmentation quality: green areas indicate true positives (correctly identified road regions), red areas denote false negatives (missed road areas), and blue areas correspond to false positives (incorrectly classified road regions). The BEV representation provides a comprehensive spatial assessment of segmentation accuracy across diverse road geometries and environmental conditions. Zoom in for a better view.
Figure 5. Evaluation visualization on the KITTI road dataset showing perspective view results (odd rows) and corresponding Bird’s Eye View (BEV) analysis (even rows). The color coding represents segmentation quality: green areas indicate true positives (correctly identified road regions), red areas denote false negatives (missed road areas), and blue areas correspond to false positives (incorrectly classified road regions). The BEV representation provides a comprehensive spatial assessment of segmentation accuracy across diverse road geometries and environmental conditions. Zoom in for a better view.
Sensors 25 06600 g005
Figure 6. Precision-recall curves across KITTI road categories (computed by official evaluation server): (a) UM road, (b) UMM road, (c) UU road, and (d) URBAN road.
Figure 6. Precision-recall curves across KITTI road categories (computed by official evaluation server): (a) UM road, (b) UMM road, (c) UU road, and (d) URBAN road.
Sensors 25 06600 g006
Figure 7. Qualitative comparison of MA versus baseline segmentation performance. From left: input images, ground-truth, MA predictions, baseline predictions, and difference analysis. Green regions indicate successful road recovery through memory guidance, blue regions show false positive reduction, and black regions represent comparable performance. Zoom in for a better view.
Figure 7. Qualitative comparison of MA versus baseline segmentation performance. From left: input images, ground-truth, MA predictions, baseline predictions, and difference analysis. Green regions indicate successful road recovery through memory guidance, blue regions show false positive reduction, and black regions represent comparable performance. Zoom in for a better view.
Sensors 25 06600 g007
Figure 8. Qualitative comparison of road segmentation results from various state-of-the-art methods on the KITTI road dataset. Green regions indicate predicted drivable areas, while red boxes highlight areas of segmentation difficulty or failure. Our approach demonstrates robust performance across diverse road scenarios. Zoom in for a better view.
Figure 8. Qualitative comparison of road segmentation results from various state-of-the-art methods on the KITTI road dataset. Green regions indicate predicted drivable areas, while red boxes highlight areas of segmentation difficulty or failure. Our approach demonstrates robust performance across diverse road scenarios. Zoom in for a better view.
Sensors 25 06600 g008
Table 1. Hyperparameter settings for training.
Table 1. Hyperparameter settings for training.
ParameterValue
OptimizerAdam
Learning rate 1.00 × 10 4
LR schedulerReduceLROnPlateau
Scheduler patience2
Scheduler factor0.1
Memory weight0.2
Loss functionCombined (Dice + BCE)
Epochs100
Batch size2
Image size640 × 640
Memory size200
Top-k memories9
Table 2. Performance comparison with state-of-the-art multimodal road segmentation methods on official KITTI road test set. ( ) indicates higher values are preferred, while ( ) signifies lower values are optimal. Results are arranged in descending order of performance based on the official KITTI leaderboard, with our method listed in the last row.
Table 2. Performance comparison with state-of-the-art multimodal road segmentation methods on official KITTI road test set. ( ) indicates higher values are preferred, while ( ) signifies lower values are optimal. Results are arranged in descending order of performance based on the official KITTI leaderboard, with our method listed in the last row.
MethodMaxF ( ) AP ( ) PRE ( ) REC ( ) FPR ( ) FNR ( )
DiPFormer [22]0.97570.92940.97340.97790.01470.0221
RoadFormer+ [24]0.97560.93740.97430.97690.01420.0231
SNE-RoadSegV2 [21]0.97550.93980.97570.97530.01340.0247
UdeerLID+ [30]0.97550.93980.97460.97650.01400.0235
RoadFormer [23]0.97500.93850.97160.97840.01570.0216
SNE-RoadSeg+ [20]0.97500.93980.97410.97580.01430.0242
Pseudo-LiDAR [27]0.97420.94090.97300.97540.01490.0246
Evi-RoadSeg [15]0.97080.93540.96570.97590.01910.0241
PLARD [18]0.97030.94030.97190.96880.01540.0312
LRDNet+ [14]0.96950.92220.96880.97020.01720.0298
USNet [19]0.96890.93250.96510.97270.01940.0273
LRDNet (L) [14]0.96870.91910.96730.97010.01810.0299
DFM-RTFNet [25]0.96780.94050.96620.96930.01870.0307
SNE-RoadSeg [41]0.96750.94070.96900.96610.01700.0339
LRDNet(S) [14]0.96740.92540.96790.96690.01760.0331
3MT-RoadSeg [26]0.96600.93900.96460.96730.01950.0327
TEDNet [28]0.94620.93050.94280.94960.03170.0504
CLRD [29]0.94200.92660.94250.94140.03160.0586
CLCFNet [17]0.96380.90850.96380.96390.01990.0361
LFD-RoadSeg [16]0.95210.93710.95350.95080.02560.0492
Ours0.96660.93950.96460.96870.01960.0313
Table 3. Comparison with state-of-the-art single-modality road segmentation methods on the official KITTI road test set. Results are presented in descending order according to performance on the official KITTI leaderboard, with our method listed in the final row.
Table 3. Comparison with state-of-the-art single-modality road segmentation methods on the official KITTI road test set. Results are presented in descending order according to performance on the official KITTI leaderboard, with our method listed in the final row.
MethodMaxF ( ) AP ( ) PRE ( ) REC ( ) FPR ( ) FNR ( )
RBANet [36]0.96300.89720.95140.97500.02750.0250
CLCFNet (LiDAR) [17]0.95970.90610.96120.95820.02130.0418
LC-CRF [31]0.95680.88340.93620.97830.03670.0217
Hadamard-FCN [37]0.94850.91480.94810.94890.02860.0511
HA-DeepLabv3+ [35]0.94830.93240.94770.94890.02880.0511
DEEP-DIG [34]0.93980.93650.94260.93690.03140.0631
LFD-RoadSeg [16]0.93490.92190.93460.93520.02130.0648
RoadNet3 [32]0.92950.91930.93320.92580.02160.0742
ChipNet [33]0.92910.84950.90980.94910.03060.0509
Ours0.96660.93950.96460.96870.01960.0313
( ) indicates that higher values are better; ( ) indicates that lower values are better.
Table 4. Performance comparison of the KITTI official evaluation across categories.
Table 4. Performance comparison of the KITTI official evaluation across categories.
BenchmarkMaxF ( ) AP ( ) PRE ( ) REC ( ) FPR ( ) FNR ( )
UM road0.96550.93580.96460.96640.01610.0336
UMM road0.97460.95570.97070.97860.03250.0214
UU road0.95370.92760.95080.95660.01610.0434
Urban road0.96660.93950.96460.96870.01960.0313
( ) indicates that higher values are better; ( ) indicates that lower values are better.
Table 5. Comparative analysis of different loss function formulations on MA training effectiveness.
Table 5. Comparative analysis of different loss function formulations on MA training effectiveness.
Loss FunctionMaxF ( ) AP ( ) PRE ( ) REC ( ) FPR ( ) FNR ( )
BCE loss0.99050.99950.99200.98910.00240.0109
Dice loss0.98730.99410.98800.98660.00360.0134
Combined (BCE + Dice)0.99050.99950.99120.98980.00260.0102
( ) indicates that higher values are better; ( ) indicates that lower values are better.
Table 6. Evaluation of pretrained weight initialization versus training from scratch on the MA framework performance.
Table 6. Evaluation of pretrained weight initialization versus training from scratch on the MA framework performance.
InitializationMaxF ( ) AP ( ) PRE ( ) REC ( ) FPR ( ) FNR ( )
Training from scratch0.98960.99910.99050.98870.00280.0113
Pretrained InternImage0.98930.99920.99080.98780.00270.0122
( ) indicates that higher values are better; ( ) indicates that lower values are better.
Table 7. Comprehensive performance comparison on the R2D dataset demonstrating cross-domain generalization capabilities of our approach versus established methodologies.
Table 7. Comprehensive performance comparison on the R2D dataset demonstrating cross-domain generalization capabilities of our approach versus established methodologies.
MethodMaxF ( ) PRE ( ) REC ( )
SNE-RoadSeg [41]0.95050.94500.9561
LRDNet+ [14]0.94590.93820.9538
LRDNet (L) [14]0.94060.94620.9350
LRDNet (S) [14]0.93730.93250.9421
USNet [19]0.93660.93100.9423
RBANet [36]0.93290.93540.9305
DFM-RTFNet [25]0.92980.92750.9321
3MT-RoadSeg [26]0.92870.93120.9263
TEDNet [28]0.91560.90890.9225
CLCFNet [17]0.91450.92010.9090
Ours0.94900.95450.9436
( ) indicates that higher values are better.
Table 8. Detailed performance analysis on the Cityscapes dataset revealing memory system effectiveness across diverse urban road scenarios and architectural environments.
Table 8. Detailed performance analysis on the Cityscapes dataset revealing memory system effectiveness across diverse urban road scenarios and architectural environments.
MethodMaxF ( ) PRE ( ) REC ( )
SNE-RoadSeg [41]0.92750.92900.9261
USNet [19]0.92690.92010.9337
LRDNet+ [14]0.92650.92280.9302
LRDNet (L) [14]0.92470.90980.9401
HA-DeepLabv3+ [35]0.92330.92770.9189
LRDNet (S) [14]0.91760.88050.9580
DFM-RTFNet [25]0.91340.91560.9112
3MT-RoadSeg [26]0.90890.91230.9055
RBANet [36]0.89820.90140.8950
TEDNet [28]0.89450.89760.8914
CLCFNet [17]0.89230.88450.9003
CLRD [29]0.88670.89010.8834
Ours0.91890.92590.9120
( ) indicates that higher values are better.
Table 9. Computational efficiency comparison in terms of parameters and floating-point operations.
Table 9. Computational efficiency comparison in terms of parameters and floating-point operations.
ModelParams. (M)FLOPs (G)
LRDNet+28.5336
LRDNet (L)19.5173
SNE-RoadSeg201.31950.2
USNet30.778.2
PLARD76.91147.6
RBANet42.1156.8
Ours358476
Table 10. Investigation of memory influence strength on segmentation performance, revealing optimal weighting for memory-guided feature enhancement.
Table 10. Investigation of memory influence strength on segmentation performance, revealing optimal weighting for memory-guided feature enhancement.
Memory WeightMaxF ( ) AP ( ) PRE ( ) REC ( ) FPR ( ) FNR ( )
λ = 0.1 0.99050.99920.99140.98960.00260.0104
λ = 0.3 0.99000.99910.99080.98920.00270.0108
λ = 0.5 0.99070.99950.99180.98950.00240.0105
( ) indicates that higher values are better; ( ) indicates that lower values are better.
Table 11. Performance sensitivity analysis across different Memory Bank capacities, demonstrating optimal size selection for balancing experiential coverage.
Table 11. Performance sensitivity analysis across different Memory Bank capacities, demonstrating optimal size selection for balancing experiential coverage.
Memory SizeMaxF ( ) AP ( ) PRE ( ) REC ( ) FPR ( ) FNR ( )
50 memories0.98960.99940.99100.98820.00270.0118
100 memories0.98930.99940.99040.98820.00290.0118
200 memories0.98990.99930.99110.98860.00260.0114
( ) indicates that higher values are better; ( ) indicates that lower values are better.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shafiq, S.; Khan, A.A.; Shao, J. MemRoadNet: Human-like Memory Integration for Free Road Space Detection. Sensors 2025, 25, 6600. https://doi.org/10.3390/s25216600

AMA Style

Shafiq S, Khan AA, Shao J. MemRoadNet: Human-like Memory Integration for Free Road Space Detection. Sensors. 2025; 25(21):6600. https://doi.org/10.3390/s25216600

Chicago/Turabian Style

Shafiq, Sidra, Abdullah Aman Khan, and Jie Shao. 2025. "MemRoadNet: Human-like Memory Integration for Free Road Space Detection" Sensors 25, no. 21: 6600. https://doi.org/10.3390/s25216600

APA Style

Shafiq, S., Khan, A. A., & Shao, J. (2025). MemRoadNet: Human-like Memory Integration for Free Road Space Detection. Sensors, 25(21), 6600. https://doi.org/10.3390/s25216600

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop