You are currently viewing a new version of our website. To view the old version click .
Vehicles
  • Article
  • Open Access

31 January 2025

Vehicle Localization in IoV Environments: A Vision-LSTM Approach with Synthetic Data Simulation

,
and
School of Artificial Intelligence, China University of Mining and Technology (Beijing), Beijing 100083, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Intelligent Connected Vehicles

Abstract

With the rapid development of the Internet of Vehicles (IoV) and autonomous driving technologies, robust and accurate visual pose perception has become critical for enabling smart connected vehicles. Traditional deep learning-based localization methods face persistent challenges in real-world vehicular environments, including occlusion, lighting variations, and the prohibitive cost of collecting diverse real-world datasets. To address these limitations, this study introduces a novel approach by combining Vision-LSTM (ViL) with synthetic image data generated from high-fidelity 3D models. Unlike traditional methods reliant on costly and labor-intensive real-world data, synthetic datasets enable controlled, scalable, and efficient training under diverse environmental conditions. Vision-LSTM enhances feature extraction and classification performance through its matrix-based mLSTM modules and advanced feature aggregation strategy, effectively capturing both global and local information. Experimental evaluations in independent target scenes with distinct features and structured indoor environments demonstrate significant performance gains, achieving matching accuracies of 91.25% and 95.87%, respectively, and outperforming state-of-the-art models. These findings underscore the innovative advantages of integrating Vision-LSTM with synthetic data, highlighting its potential to overcome real-world limitations, reduce costs, and enhance accuracy and reliability for connected vehicle applications such as autonomous navigation and environmental perception.

1. Introduction

The advent of the Internet of Vehicles (IoV) and autonomous driving technologies has revolutionized modern transportation systems, paving the way for smart connected vehicles to become integral components of intelligent mobility [1]. These vehicles rely heavily on robust and accurate perception systems to ensure safe navigation, real-time decision making, and efficient communication with surrounding infrastructure and other vehicles [2,3]. Visual pose perception, which determines the precise position and orientation of a vehicle based on visual data, has emerged as a critical enabling technology in this domain [4]. Its applications span autonomous navigation, obstacle detection, and vehicle-to-everything (V2X) communication, making it a cornerstone of modern transportation innovation [5].
However, the deployment of visual pose perception systems in real-world vehicular environments faces significant challenges [6]. Environmental factors such as occlusions, varying lighting conditions, and dynamic backgrounds can severely degrade the performance of traditional deep learning models. Furthermore, the acquisition of high-quality, large-scale image datasets under diverse real-world conditions is both time-consuming and expensive, particularly in scenarios requiring complex traffic simulations or adverse weather environments [7]. These limitations hinder the scalability and adaptability of existing systems in meeting the demands of next-generation transportation technologies.
To address these challenges, the use of synthetic image datasets generated from high-fidelity 3D models has gained increasing attention [8]. This approach offers several key advantages, including reduced data collection costs, the ability to simulate diverse conditions, and greater control over environmental variables. By leveraging synthetic datasets, researchers can train and validate models under scenarios that are otherwise difficult to replicate in real-world experiments [9]. Additionally, advancements in deep learning architectures such as Vision-LSTM (ViL) provide robust tools for feature extraction and matching, even in complex and dynamic environments [10]. ViL, with its enhanced memory and parallel processing capabilities, is particularly well-suited for visual localization tasks in vehicular scenarios.
In this study, we propose an active visual pose perception method that integrates synthetic image generation with the Vision-LSTM network to address the limitations of traditional approaches. Our method is validated through two experimental scenarios: independent target matching, which evaluates the system’s ability to recognize objects with distinct features, and corridor-like structured environments, which simulate indoor navigation challenges. These scenarios represent simplified yet critical conditions for testing the robustness and adaptability of visual localization systems. The experimental results demonstrate significant improvements in pose estimation accuracy, highlighting the potential for broader applications in IoV environments, such as autonomous navigation, V2V communication, and intelligent transportation systems.
The main contributions of this study are summarized as follows:
  • Active Visual Pose Perception: The proposed method integrates synthetic images generated from high-fidelity 3D models with Vision-LSTM to enhance visual pose estimation.
  • Cost-Effective Data Generation: By using synthetic data, this approach mitigates the challenges of collecting large-scale real-world datasets, making it a scalable solution for autonomous navigation.
  • Robustness in Complex Environments: The proposed method is shown to perform well in both independent object detection and structured indoor environments, demonstrating its applicability in autonomous vehicle systems.

3. Method

3.1. Construction of 3D Models

Synthetic image datasets play a pivotal role in advancing IoV-compatible systems. The datasets mimic real-world vehicular environments, including diverse lighting conditions, adverse weather, and occlusions. This capability not only reduces the dependency on costly real-world data collection but also ensures controlled experimentation for the development of IoV technologies.
The first step is to construct the 3D model of the localization scene. In this study, the selected 3D indoor model is derived from the BIM of the second floor of the Chemical and Environmental Building at China University of Mining and Technology (Beijing). The external contour of the scene is obtained from its floorplan dimensions. Using 3ds Max, the wireframe model is constructed; then, the details are added based on material texture maps to capture the fine features of the objects. This process allows for the acquisition of relevant geographic information data of the corridor. The model is rendered using the Corona Renderer to enhance its realism. The process is illustrated in Figure 1.
Figure 1. Model construction process.
Realistic object appearance depends on numerous low-level cues, including object shape, pose, surface color, reflectance, light-source position, spectral distribution, background scene attributes, and camera characteristics. The 3D model of the corridor includes specific measurement data of the entire building, material images for various parts, and orientation maps of different components. The height and other relevant information pertaining to the building were obtained from the school’s official website and on-site measurements. Material images and orientation maps of various parts were obtained by taking photographs and processing these images. The specific measurement data of the building interior are shown in Table 1.
Table 1. Specific information on building features.
The model textures primarily come from photographs. These photographs include features such as image segmentation, color correction, camera mapping, raster and vector painting, motion blur, depth of field, and support for stereoscopic effects. Image processing is then performed to remove impurities from the photographs and adjust parameters such as size, contrast, brightness, and sharpness. After processing, the photographs are converted into texture maps and stored in a texture library.

3.2. Active Visual Position Detection

Our detection method is divided into two main stages. In the offline stage, we collect corresponding synthetic image data based on the highly realistic 3D model. These image data are used to construct an offline image database for training purposes. In the online stage, by retrieving scenes similar to the current scene from a series of known scenes, we approximate the pose of the most similar reference image to the pose of the current observed image. This allows us to achieve object pose detection in the current scene. The detailed process flow is illustrated in Figure 2.
Figure 2. Diagram of active vision position detection.

3.3. Model Structure

We primarily use Vision-LSTM (ViL) as the backbone network for visual feature extraction. First, the two-dimensional image is divided into several patches of a given size. Each patch is then mapped to a one-dimensional vector through linear projection to form a token, and learnable positional embeddings are added to each patch token. Next, the image patch sequence is fed into the stacked 24 mLSTM blocks (as shown in Figure 3) for feature extraction. Each mLSTM block processes the sequence of tokens, ensuring robust spatial and temporal feature learning. After processing through the 24 mLSTM blocks, we apply bilateral concat pooling for feature aggregation. This pooling method is similar to the [CLS] and [AVG] pooling designs used in ViT. Bilateral concat pooling averages the first and last patch tokens in the sequence to form a new feature representation, which is then passed through a linear layer for classification. The specific structure is illustrated in Figure 3.
Figure 3. Vision-LSTM network structure.

3.4. ViL Encoder

The core of Vision-LSTM (ViL) is composed of alternating mLSTM (matrix LSTM) modules. mLSTM significantly enhances the model’s memory capacity and parallel processing capabilities by extending vector operations in traditional LSTM to matrix operations. The core innovation of mLSTM is the extension of traditional scalar memory cells to matrix memory cells, allowing mLSTM to store more information in matrix form, thereby increasing the model’s storage capacity.
In mLSTM, each state is no longer a single vector but a matrix, enabling it to capture more complex data relationships and patterns within a single time step. mLSTM uses covariance update rules to store and retrieve information. This rule is implemented by storing key–value pairs as rows or columns of a matrix, thereby improving retrieval separability and the signal-to-noise ratio. The design of mLSTM is inspired by Bidirectional Associative Memories (BAMs), which use matrix multiplication for information storage and retrieval.
mLSTM is particularly well-suited for handling large-scale datasets and tasks that require highly complex data pattern recognition. Unlike sLSTM, the design of mLSTM allows for fully parallelized processing because it eliminates the recurrent connections (memory mixing) between hidden layers. This enables mLSTM to be more efficiently trained and inferred on modern hardware.
The input gate and forget gate of mLSTM can use exponential activation functions, while the output gate continues to use the sigmoid function. These gating mechanisms allow the model to control the inflow and forgetting of information effectively.
The cell state update equation is given by Equation (1):
C t = f t C t 1 + i t v t k t
In Equation (1), C t is the current cell state, f t is the forget-gate activation at time step t, i t is the input-gate activation at time step t, v t is the value vector at time step t, and k t is the key vector at time step t.
The normalizer state update equation is expressed as Equation (2):
n t = f t n t 1 + i t k t
In Equation (2), n t is the current normalizer state, f t is the forget-gate activation at time step t, n t 1 is the previous normalizer state, k t is the key vector at time step t, and i t is the input-gate activation at time step t.
The hidden state equation is given by Equation (3):
h t = o t h ~ t ,               h ~ t = C t q t / m a x { n t q t , 1 }
In Equation (3), h t is the current hidden state, o t is the output-gate activation at time step t, h ~ t is the intermediate hidden state, C t is the cell state at time step t, q t is the query vector at time step t, n t is the current normalizer state, and the term m a x { n t q t , 1 } ensures numerical stability.
The gate computations are defined by Equations (4)–(6):
q t = W q x t + b q
k t = 1 d W k x t + b k
v t = W v x t + b v
In Equations (4)–(6), q t , k t , and v t are the query, key, and value vectors, respectively, at time step t; W q , W k , and W v are the corresponding weight matrices; b q , b k , and b v are the biases; x t is the input vector at time step t; and d is the dimensionality of the model.
The input gate is computed as shown in Equation (7):
i t = exp i ~ t ,           i ~ t = w t x t + b i
In Equation (7), i t is the input-gate activation, i ~ t is the intermediate input-gate value, w t is the weight vector for the input gate, x t is the input vector, and b i is the bias.
The forget gate is calculated as shown in Equation (8):
f t = σ f ~ t O R exp f ~ t ,             f ~ t = w t x t + b f
In Equation (8), f t is the forget-gate activation, f ~ t is the intermediate forget-gate value, w t is the weight vector for the forget gate, x t is the input vector, and b f is the bias.
The output gate is defined as shown in Equation (9):
o t = σ o ~ t ,            o ~ t = W o x t + b o
In this equation, o t is the output-gate activation, o ~ t is the intermediate output-gate value, W o is the matrix for the output gate, x t is the input vector, and b o is the bias.
Integrating the improved mLSTM layer into residual blocks allows for more efficient handling of complex sequential data. The specific structure is shown in Figure 4.
Figure 4. mLSTM block structure.
As shown in Figure 4, the input token sequence first passes through a normalization layer. The normalized sequence is then linearly projected into two vectors of dimension E. For odd-numbered mLSTM blocks, vector x directly passes through the mLSTM layer to generate xforwardx, which is then gated using y and concatenated with the input token sequence. For even-numbered mLSTM blocks, vector x is flipped before passing through the mLSTM layer, then flipped again to generate xforward, which is also gated using y and concatenated with the input token sequence.
Odd-numbered mLSTM blocks process the input tokens from top-left to bottom-right, while even-numbered mLSTM blocks process them from bottom-right to top-left, as shown in Figure 5. Finally, a bidirectional average pooling operation is performed by taking the average of the first and last positions of the sequence to generate a new feature representation. This operation captures the global information of the input data. The pseudo-code for the mLSTM algorithm is in Algorithm 1.
Algorithm 1. mLSTM Block Process:
Input: token sequence T l : ( B , M , D )
Output: new sequence T k : B , M , D
Model Aggregation
1: /*normalize the input sequence Tl*/
2:      T l N o r m ( T l )
3:      x : ( B , M , E ) L i n e a r x ( T l )
4:      y : ( B , M , E ) L i n e a r y ( T l )
5:      y ( B , M , E ) S i L U ( y )
6:      /*process with different direction*/
7:      for i in range(depth)
8:                if i % 2 == 0
9:                        x 1 : ( B , M , E ) f l i p ( x )
10                         x 2 : B , M , E m L S T M ( x 1 )
11:                        x : B , M , E f l i p x 2
12:                else:
13:                        x : B , M , E m L S T M ( x )
14:      end for
15:      /*get gated zo */
16:      z : ( B , M , E ) x y
17:      /*Bilateral Concat */
18:      z z : , 0 + z : , 1 / 2
19:    /* residual connection */
20: T k : B , M , D z + T l
21:   end for
Figure 5. Visualization of traversal paths.

4. Experimentation and Evaluation

4.1. Datasets and Training Setups

To better align with IoV-specific challenges, we reframed our experimental scenarios to simulate real-world constraints commonly encountered by connected vehicles, such as limited space. Although the experiments were conducted in a structured corridor environment, we specifically chose this setup to reflect real-world scenarios like parking garages, tunnels, or narrow urban streets where navigation is hindered by confined spaces and complex traffic conditions. This simulation allows for testing of the robustness of visual localization systems in constrained environments, which are prevalent in IoV settings. By replicating these real-world challenges, we aim to demonstrate the applicability and effectiveness of our system under conditions that closely mirror the challenges of vehicle localization in IoV environments.
The experimental data used in this study are divided into two main parts: a virtual image dataset and a real image dataset, which are used for model training and testing, respectively. In order to simulate the feature-matching requirements in different scenarios, we designed two experimental scenarios: an independent object scenario and a complex object environment. The virtual image dataset is generated by 3D modeling tools (e.g., 3ds Max). Taking the corridor on the second floor of the Chemical Environment Building as an example, we constructed a high-fidelity 3D simulation model using architectural blueprints and simulated the light reflection, shadows, and other characteristics of the actual environment by adding material maps. In addition, to verify the effectiveness of the virtual image generation method, we used an iPhone 13 Pro to capture real image data in the same environment.
In the independent object scene, we acquired virtual images of a single target in four directions, with 500 images in each direction, generating a total of 2000 images for training. The real image data were acquired by controlling the viewing angle and offset angle (±10°) with 480 high-resolution images for testing, as shown in Figure 6. In the complex environment scene, we further introduced the virtual images of the corridor background to simulate multiple targets and complex lighting conditions. A total of five orientations of image data were collected, each with a 10° left–right swing from the object’s center axis and a height of 1–1.2 m. A total of 360 virtual images with a resolution of 512 × 512 were collected in each direction, totaling 1800 images; 450 real images were collected at the same time as controls, as shown in Figure 7.
Figure 6. Independent object datasets.
Figure 7. Dataset for the modeling of drinking fountains in a corridor.
The hardware platform for the experiments comprised 256-core NVIDIA GPUs and dual-core NVIDIA CPUs (NVIDIA, Santa Clara, CA, USA), and the software environments were Python 3.8, PyTorch 1.10, and CUDA 12.3. The specific device and software tools parameters are shown in Table 2. Model training was performed using the AdamW optimizer, with the initial learning rate set to 0.001 and adjusted by using the cosine annealing strategy. The training period was 150, and the batch size was 16. To enhance the generalization ability of the model, we also added data enhancement strategies such as random cropping and level flipping in data processing. The specific model training parameters are shown in Table 3.
Table 2. Experimental devices and software tools.
Table 3. Main hyperparameters for model training.

4.2. Design of Experiments

The experimental design is divided into three parts: feature-matching validation for independent object scenes, robustness evaluation for complex environment scenes, and validation of synthetic data generalization and fine tuning of effectiveness. Each experiment used virtual image data to train the model and test it on real image data. The performance advantages of the ViL model and the potential of virtual images in practical applications are explored through comparative analysis in different scenes.
In Experiment 1, we selected a water dispenser as an independent object and simulated the feature changes of the object through multi-view images generated from virtual data. The core objective of the experiment was to verify whether ViL is able to accurately match independent target images under conditions of low background interference. We used real image data to evaluate the model and performed comparative analysis with other mainstream models (e.g., Swin Transformer, MobileViT, etc.) on metrics such as top-1 accuracy, precision, recall, and F1 score.
In Experiment 2, on the other hand, considered complex environmental scenarios to explore the performance of the model under multi-target and complex background interference conditions. Taking the corridor of the Chemical Environment Building as the test scene, we used virtual data to simulate the actual situation, including factors such as light change and occlusion. The real image data introduced in the test contain multiple feature points, such as the water dispenser and background wall, and the robustness of ViL was verified by multi-model comparison analysis. In addition, we designed a background dependency analysis to explore the impact of environmental factors on feature-matching accuracy.
In Experiment 3, we focused on the generalization ability of synthetic data and the effectiveness of fine tuning with real data. The model was first trained solely on synthetic data generated from virtual environments, then evaluated on real-world data. To further enhance the model’s performance, a fine-tuning process was introduced using a small subset (10–20%) of real-world data to adapt the model to real-world conditions. Specifically, we froze the lower-level feature extraction layers of the pre-trained model while fine tuning the higher-level layers and classifier. The experiment evaluated the model’s performance before and after fine tuning on metrics such as top-1 accuracy, precision, recall, and F1 score. Comparative analysis highlights the significance of fine tuning in bridging the domain gap between synthetic and real data.
These three experiments not only cover multi-scenario requirements ranging from simple to complex but also systematically verify the applicability of ViL and its optimization capability for feature- matching tasks by introducing multiple comparison models and detailed metric analysis.

4.3. Analysis of Experimental Results

4.3.1. Analysis of Experimental Results (Independent Object Scene)

In the separate object scene, ee conducted a comprehensive comparison of ViL with the latest mainstream models, which include EfficientNetV2-S, Swin Transformer V2-Tiny, ConvNeXt V2, MobileViT, and MAE (Masked Autoencoder), among others. Table 4 shows the results of performance comparison for each model.
Table 4. Comparative data for each network of the independent marker model.
In the independent object-matching scenario, the experimental results show that Vision-LSTM (ViL) performs the best among all the compared models, with a top-1 accuracy of 91.25%, demonstrating its significant advantages in feature extraction and matching tasks. The ViL loss--accuracy curve is shown in Figure 8. The mLSTM module, the core of ViL, enhances the adaptability to complex-texture targets by modeling the relationship between global features through matricization operations and, at the same time, maintains high matching accuracy in samples with large changes in viewpoint. In particular, the F1 score reaches 90.96%, which further validates the comprehensiveness and robustness of ViL in feature modeling.
Figure 8. The ViL loss–accuracy curve(the Independent Object Scene).
Swin Transformer V2-Tiny ranks second, with a 90.15% top-1 accuracy. Its dynamic window division strategy optimizes local feature extraction and global information fusion and performs stably on samples with uniform feature distributions. However, the windowing mechanism is slightly less capable of global modeling for long-range targets or complex texture targets and is, thus, slightly inferior to ViL for samples with large viewpoint offsets.
MAE (Masked Autoencoder) is ranked fourth, with a top-1 accuracy of 87.15%. This result shows that although MAE performs well on self-supervised learning tasks (e.g., image reconstruction), its design mainly focuses on the complementation of occluded regions rather than the learning of global structures. In feature extraction tasks, MAE’s ability to model local details is slightly insufficient. In addition, MAE’s feature extraction relies on a large-scale pre-training process, and its ability to model fine details for small-sample datasets was not as good as that of ViL in this experiment.
EfficientNetV2-S and ConvNeXt V2 is closer achieved top-1 accuracies of 89.42% and 88.96%, respectively.EfficientNetV2-S improves computational efficiency through the composite scaling strategy, but its feature extraction ability is more oriented toward local structures, and there is still room for improvement in global information modeling. ConvNeXt V2, as an improved version of a convolutional network, improves the local feature extraction efficiency by optimizing the design of convolutional kernels, but its performance is slightly inferior to that the Transformer-based Swin V2 and ViL models when matching samples with complex texture and multi-view changes.
The lightweight MobileViT model performs relatively weakly in this scenario, with a top-1 accuracy of only 86.29%.The design of MobileViT focuses on applications in low-computing-resource scenarios, and its lightweight transformer structure has limitations for high-complexity feature extraction of independent targets. Although MobileViT is suitable for resource-constrained deployment scenarios, its ability to integrate global information is limited in tasks that require high-precision matching. Specific model comparison data are shown in Figure 9.
Figure 9. Independent object scene: model comparison.
In summary, the performance of ViL proves its excellent fine-grained feature modeling capability and its ability to capture global dependency in independent object scenes. This ability makes it highly applicable in real-world applications such as industrial inspection and automated equipment recognition. Meanwhile, although models such as MAE perform well in self-supervised learning tasks, they need to be further optimized for local modeling in specific feature-matching tasks before they can approach the performance level of ViL. In the future, we can try to combine the self-supervised feature of MAE with the mLSTM module design of ViL to explore more efficient global–local feature combination models.

4.3.2. Analysis of Experimental Results (Complex Environment Scene)

We similarly evaluated the performance of a variety of mainstream models in a complex background environment, and the results are shown in Table 5.
Table 5. Comparative data for each network om the water dispenser marker model.
In complex background environments, ViL, again, outperforms other comparison models, with a top-1 accuracy of 95.87%; its performance advantage is even more pronounced. The ViL loss--accuracy curve is shown in Figure 10. This is primarily due to the mLSTM module’s superior ability to model global information. Environmental scenes often contain distractions such as background complexity, illumination variations, and multiple targets. ViL effectively captures global dependencies between features through matricization operations while suppressing the impact of background noise. This allows ViL to accurately recognize and match target features despite background complexities. As shown in the heatmap comparisons between real and synthetic images, ViL successfully identifies and prioritizes key regions in both scenarios, demonstrating its robustness in feature selection. Additionally, the feature point-matching comparison between real and synthetic images further highlights ViL’s consistent ability to match key features, even with variations in image type and background conditions. Feature point matching between a real image and synthesized image is shown in Figure 11.
Figure 10. The ViL loss–accuracy curve (the complex environment scene).
Figure 11. Feature point matching between a real image and synthesized image.
Swin Transformer V2-Tiny and MobileViT rank second and third, with top-1 accuracies of 94.56% and 91.43%, respectively. Both models perform relatively well in complex environments, but as demonstrated in the heatmap comparisons, Swin Transformer V2-Tiny optimizes the fusion of local and global information using its dynamic window division strategy. However, Swin Transformer V2-Tiny struggles slightly with certain background complexities, as shown in the heatmap, where it fails to fully focus on some important features. MobileViT, while lightweight, has difficulty processing high-resolution inputs and is less effective than ViL in handling complex multi-target and noisy background scenarios. The heatmap comparison shows that MobileViT is less capable of focusing on the most important features in cluttered environments, leading to a decrease in overall matching accuracy.
Lightweight models such as MobileViT perform moderately, with a top-1 accuracy of 91.43%. MobileViT struggles with high-resolution inputs, and although its lightweight design enhances efficiency, it lacks the capacity to comprehensively model complex backgrounds and multi-target features. On the other hand, MAE (Masked Autoencoder), which enhances self-supervised learning effectiveness, does not focus on lightweight design. While MAE improves representation learning in an unsupervised manner, its performance is less adaptable to dynamic or complex scenes, which results in lower accuracy in such environments. Specific model comparison data are shown in Figure 12.
Figure 12. Complex environment scene: model comparison.
While ViL excels in feature modeling, its computational complexity—especially due to the mLSTM-based architecture—may make it unsuitable for real-time applications, particularly in resource-constrained environments like drones or mobile robots. The mLSTM module, while capturing global dependencies, requires extensive matrix operations and the processing of long sequences, leading to high memory demands and slower inference speeds. Additionally, ViL’s parameter count (23M) results in significant memory consumption, which may cause performance bottlenecks in embedded systems or real-time inference tasks.
To address these challenges, we introduce ViL-T, an optimized lightweight version designed to be more suitable for resource-constrained environments. Specifically, ViL-T reduces the latent dimension from 384 to 192 to lower computational complexity. The model architecture is also optimized to retain much of ViL’s performance while reducing memory usage and computational demand. Moreover, ViL-T’s parameter count is reduced from 23M to 6M, significantly lowering memory consumption and computational resource requirements, making it more suitable for deployment in embedded platforms and real-time inference, especially in drones and mobile robots, where fast response times and low power consumption are critical. The ViL loss--accuracy curve is shown in Figure 13.
Figure 13. The ViL-T loss–accuracy curve.
Although ViL-T experiences a slight drop in top-1 accuracy (92.47%), it still outperforms MobileViT, particularly in complex background conditions and multi-target recognition. Despite having a smaller parameter count (5.6M), MobileViT struggles with complex backgrounds and noise suppression, leading to a lower top-1 accuracy of 91.43%, which is inferior to that of ViL-T. A comparison of the characteristic heat maps of the different networks is shown in Figure 14 and Figure 15.
Figure 14. Perceived performance of ViL on a real image (left) and synthesized image (right).
Figure 15. Perceived performance of Swin Transformer V2-Tiny, MobileViT Swin Transformer V2-Tiny (left), and MobileViT (right).
Thus, while ViL-T has slightly lower accuracy compared to ViL, its reduced latent dimension and optimized architecture effectively lower memory usage and computational complexity. It continues to provide efficient performance in embedded devices and real-time applications, making it well-suited for tasks requiring complex background processing and multi-target detection.

4.3.3. Analysis of Experimental Results (Fine-Tuning Model)

The data in Table 6 comparing the synthetic data only model and the synthetic data + fine-tuning model highlights the significant advantages of incorporating real-world data into the training process. The synthetic data only model achieved strong performance in structured and controlled environments, with a top-1 accuracy of 95.87%, precision of 95.62%, recall of 95.81%, and F1 score of 95.71%. However, its performance degraded when tested on real-world data, primarily due to the domain gap between synthetic and real data. Synthetic datasets often lack the fine-grained textures, lighting variations, and background complexities found in natural images, limiting the model’s ability to generalize to unstructured scenarios.
Table 6. Comparative data for synthetic data only and synthetic data + fine-tuning models.
In contrast, the synthetic data + fine-tuning model demonstrated substantial improvements across all metrics after being fine-tuned with a small subset (10–20%) of real-world data. The top-1 accuracy increased to 97.23%, precision improved to 96.91%, recall rose to 97.10%, and the F1 score reached 97.00%. These results reflect the effectiveness of fine tuning in bridging the domain gap. By freezing the lower-level feature extraction layers and updating the higher-level layers and classifier, the model retained the general features learned from synthetic data while adapting to real-world characteristics such as texture details, lighting variations, and environmental occlusions.
This comparison underscores the importance of fine tuning as a critical step in enhancing the model’s generalization ability and robustness in real-world applications. While the synthetic data only model provides a strong baseline for pre-training, it is the incorporation of fine tuning that enables the model to adapt to diverse, real-world conditions and significantly outperform mainstream models such as Swin Transformer and MobileViT. The findings confirm that combining pre-training on synthetic data with fine tuning using real-world data is a practical and effective approach for tasks requiring domain adaptation.

5. Conclusions

The experimental results demonstrate that Vision-LSTM (ViL) significantly outperforms state-of-the-art deep learning models in both independent object scenes and complex background environments. This performance advantage is primarily attributed to the mLSTM module’s effective modeling of high-dimensional feature relationships and the enhanced data diversity provided by virtual image generation. However, further research is needed to evaluate ViL’s applicability in real-world IoV scenarios involving dynamic traffic systems and complex conditions.
Virtual image generation plays a critical role in feature-matching tasks, but its adaptability under extreme conditions such as low-light environments, severe occlusion, and high mobility requires improvement. Future research could integrate techniques such as Domain Adaptation (DA) and Generative Adversarial Networks (GANs) to bridge the gap between synthetic and real-world data, thereby improving the model’s robustness and accuracy in complex IoV scenarios.
The current experiments focus on specific scenarios such as indoor environments and static backgrounds, which limits the model’s generalizability to more diverse and dynamic IoV environments. Expanding the diversity of training data, such as by generating 3D models and datasets that include urban, rural, and various traffic conditions, could significantly improve the robustness and applicability of the model, addressing real-world vehicle localization challenges more comprehensively.
Although ViL demonstrates strong feature extraction capabilities, its performance remains limited in highly dynamic environments with rapidly changing objects or backgrounds and in GPS-denied conditions. Future research should explore sensor fusion and advanced data integration techniques to address these limitations.
As IoV technologies evolve, privacy and security issues related to the processing of visual and positional data are becoming increasingly important. Future efforts should prioritize privacy-preserving methods such as federated learning or encrypted data processing while also examining legal and regulatory frameworks. Looking forward, ViL’s application in Vehicle-to-Infrastructure (V2I) systems and traffic coordination platforms holds great potential for optimizing traffic flow, enhancing infrastructure integration, and supporting autonomous navigation, contributing to smarter and more interconnected transportation networks.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L. and J.J.; formal analysis, J.J.; data curation, J.J.; writing—original draft preparation, J.J.; writing—review and editing, Y.L. and Z.T.; supervision, Y.L. and Z.T.; project administration, Y.L. and Z.T.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 52364017 and the National Natural Science Foundation of China under grant number 52074305.

Data Availability Statement

Data available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Dey, K.C.; Rayamajhi, A.; Chowdhury, M.; Bhavsar, P.; Martin, J. Vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication in a heterogeneous wireless network–Performance evaluation. Transp. Res. Part C Emerg. Technol. 2016, 68, 168–184. [Google Scholar] [CrossRef]
  2. Zhu, X.G.; Hua, Q.Z.; Yan, W.J.; Guo, Z.W.; Yu, K.P. A Vehicle-Road Urban Sensing Framework for Collaborative Content Delivery in Free way-Oriented Vehicular Networks. IEEE Sens. J. 2024, 24, 5662–5674. [Google Scholar] [CrossRef]
  3. Dai, S.H.; Li, S.K.; Tang, H.C.; Ning, X.; Fang, F.; Fu, Y.X.; Wang, Q.L.; Cheng, L. MARP: A Cooperative Multiagent DRL System for Connected Autonomous Vehicle Platooning. IEEE Internet Things J. 2024, 11, 32454–32463. [Google Scholar] [CrossRef]
  4. Yi, S.; Zhang, H.; Liu, K. V2IViewer: Towards Efficient Collaborative Perception via Point Cloud Data Fusion and Vehicle-to-Infrastructure Communications. IEEE Trans. Netw. Sci. Eng. 2024, 11, 6219–6230. [Google Scholar] [CrossRef]
  5. Biswas, A.; Wang, H.C. Autonomous Vehicles Enabled by the Integration of IoT, Edge Intelligence, 5G, and Blockchain. Sensors 2023, 23, 60. [Google Scholar] [CrossRef]
  6. Liu, T.; Sun, D.; Bi, C.K.; Sun, Y.; Chen, S.M. Dynamic-Scene-Graph-Supported Visual Understanding of Autonomous Driving Scenarios. In Proceedings of the 17th Pacific Visualization Conference (PacificVis), Tokyo, Japan, 23–26 April 2024; IEEE Computer Soc: Tokyo, Japan, 2024; pp. 82–91. [Google Scholar]
  7. Song, Z.H.; He, Z.M.; Li, X.Y.; Ma, Q.M.; Ming, R.B.; Mao, Z.Q.; Pei, H.X.; Peng, L.H.; Hu, J.M.; Yao, D.Y.; et al. Synthetic Datasets for Autonomous Driving: A Survey. IEEE Trans. Intell. Veh. 2024, 9, 1847–1864. [Google Scholar] [CrossRef]
  8. Acharya, D.; Tatli, C.J.; Khoshelham, K. Synthetic-real image domain adaptation for indoor camera pose. ISPRS J. Photogramm. Remote Sens. 2023, 202, 405–421. [Google Scholar] [CrossRef]
  9. Su, H.; Qi, C.R.; Li, Y.; Guibas, L.J. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3D model views. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2686–2694. [Google Scholar]
  10. Alkin, B.; Beck, M.; Poeppel, K.; Hochreiter, S.; Brandstetter, J. Vision-LSTM: xLSTM as Generic Vision Backbone. arXiv 2024, arXiv:2406.04303. [Google Scholar]
  11. Hyeon, J.; Jang, B.; Choi, H.; Kim, J.; Kim, D.; Doh, N. Photo-realistic 3D model based accurate visual positioning system for large-scale indoor spaces. Eng. Appl. Artif. Intell. 2023, 123, 106256. [Google Scholar] [CrossRef]
  12. Xu, W.L.; Zhou, G.Y.; Zhou, Y.Z.; Zou, Z.B.; Wang, J.L.; Wu, W.F.; Li, X.M. A Vision-Based Tactile Sensing System for Multimodal Contact Information Perception via Neural Network. IEEE Trans. Instrum. Meas. 2024, 73, 11. [Google Scholar] [CrossRef]
  13. Dong, J.; Noreikis, M.; Xiao, Y.; Ylä-Jääski, A. ViNav: A vision-based indoor navigation system for smartphones. IEEE Trans. Mob. Comput. 2018, 18, 1461–1475. [Google Scholar] [CrossRef]
  14. Peng, J.S.; Chen, D.H.; Yang, Q.; Yang, C.J.; Xu, Y.; Qin, Y. Visual SLAM Based on Object Detection Network: A Review. CMC-Comput. Mater. Contin. 2023, 77, 3209–3236. [Google Scholar] [CrossRef]
  15. Ahmed, M.F.; Masood, K.; Fremont, V.; Fantoni, I. Active slam: A review on last decade. Sensors 2023, 23, 8097. [Google Scholar] [CrossRef] [PubMed]
  16. Peng, J.; Yang, Q.; Chen, D.; Yang, C.; Xu, Y.; Qin, Y. Dynamic SLAMVisual Odometry Based on Instance Segmentation: A Comprehensive Review. Comput. Mater. Contin. 2024, 78, 1. [Google Scholar]
  17. Yasuda, Y.D.V.; Martins, L.E.G.; Cappabianco, F.A.M. Autonomous Visual Navigation for Mobile Robots: A Systematic Literature Review. ACM Comput. Surv. 2020, 53, 34. [Google Scholar] [CrossRef]
  18. Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
  19. Laskar, Z.; Melekhov, I.; Kalia, S.; Kannala, J. Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 929–938. [Google Scholar]
  20. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  21. Ashraf, M.H.; Jabeen, F.; Alghamdi, H.; Zia, M.S.; Almutairi, M.S. HVD-Net: A Hybrid Vehicle Detection Network for Vision-Based Vehicle Tracking and Speed Estimation. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 19. [Google Scholar] [CrossRef]
  22. Gao, R.; Tian, Y.; Ye, F.; Luo, G.; Bian, K.; Wang, Y.; Wang, T.; Li, X. Sextant: Towards ubiquitous indoor localization service by photo-taking of the environment. IEEE Trans. Mob. Comput. 2015, 15, 460–474. [Google Scholar] [CrossRef]
  23. Liu, Z.; Xiong, J.; Ma, Y.; Liu, Y. Scene recognition for device-free indoor localization. IEEE Sens. J. 2023, 23, 6039–6049. [Google Scholar] [CrossRef]
  24. Alarfaj, M.; Su, Z.; Liu, R.; Al-Humam, A.; Liu, H. Image-tag-based indoor localization using end-to-end learning. Int. J. Distrib. Sens. Netw. 2021, 17, 15501477211052371. [Google Scholar] [CrossRef]
  25. Xu, S.; Chou, W.; Dong, H. A robust indoor localization system integrating visual localization aided by CNN-based image retrieval with Monte Carlo localization. Sensors 2019, 19, 249. [Google Scholar] [CrossRef]
  26. Ha, I.; Kim, H.; Park, S.; Kim, H. Image retrieval using BIM and features from pretrained VGG network for indoor localization. Build. Environ. 2018, 140, 23–31. [Google Scholar] [CrossRef]
  27. Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
  28. Acharya, D.; Khoshelham, K.; Winter, S. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS J. Photogramm. Remote Sens. 2019, 150, 245–258. [Google Scholar] [CrossRef]
  29. Shinde, S.S.; Tarchi, D. Collaborative Reinforcement Learning for Multi-Service Internet of Vehicles. IEEE Internet Things J. 2023, 10, 2589–2602. [Google Scholar] [CrossRef]
  30. Qi, J.; Liu, Y.L.; Ling, Y.C.; Xu, B.; Dong, Z.J.; Sun, Y.F. Research on an Intelligent Computing Offloading Model for the Internet of Vehicles Based on Blockchain. IEEE Trans. Netw. Serv. Manag. 2022, 19, 3908–3918. [Google Scholar] [CrossRef]
  31. Xie, Q.; Ding, Z.X.; Zheng, P.P. Provably Secure and Anonymous V2I and V2V Authentication Protocol for VANETs. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7318–7327. [Google Scholar] [CrossRef]
  32. Fan, W.H.; Su, Y.; Liu, J.; Li, S.M.; Huang, W.; Wu, F.; Liu, Y.A. Joint Task Offloading and Resource Allocation for Vehicular Edge Computing Based on V2I and V2V Modes. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4277–4292. [Google Scholar] [CrossRef]
  33. Chung, Y.C.; Chang, H.Y.; Chang, R.Y.; Chung, W.H. Deep Reinforcement Learning-Based Resource Allocation for Cellular V2X Communications. In Proceedings of the 97th IEEE Vehicular Technology Conference (VTC-Spring), Florence, Italy, 20–23 June 2023; IEEE: Florence, Italy, 2023. [Google Scholar]
  34. Wang, B.Y.; Han, Y.; Wang, S.Y.; Tian, D.; Cai, M.J.; Liu, M.; Wang, L.J. A Review of Intelligent Connected Vehicle Cooperative Driving Development. Mathematics 2022, 10, 31. [Google Scholar] [CrossRef]
  35. Tian, D.; Li, J.B.; Lei, J.Y. Multi-sensor information fusion in Internet of Vehicles based on deep learning: A review. Neurocomputing 2025, 614, 18. [Google Scholar] [CrossRef]
  36. Shiri, F.M.; Perumal, T.; Mustapha, N.; Mohamed, R. A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU. arXiv 2023, arXiv:2305.17473. [Google Scholar]
  37. Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
  38. Beck, M.; Poeppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.