Vehicle Localization in IoV Environments: A Vision-LSTM Approach with Synthetic Data Simulation

Yi Liu; Jiade Jiang; Zijian Tian

doi:10.3390/vehicles7010012

,

and

School of Artificial Intelligence, China University of Mining and Technology (Beijing), Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Vehicles2025, 7(1), 12;https://doi.org/10.3390/vehicles7010012

This article belongs to the Special Issue Intelligent Connected Vehicles

Version Notes

Order Reprints

Abstract

With the rapid development of the Internet of Vehicles (IoV) and autonomous driving technologies, robust and accurate visual pose perception has become critical for enabling smart connected vehicles. Traditional deep learning-based localization methods face persistent challenges in real-world vehicular environments, including occlusion, lighting variations, and the prohibitive cost of collecting diverse real-world datasets. To address these limitations, this study introduces a novel approach by combining Vision-LSTM (ViL) with synthetic image data generated from high-fidelity 3D models. Unlike traditional methods reliant on costly and labor-intensive real-world data, synthetic datasets enable controlled, scalable, and efficient training under diverse environmental conditions. Vision-LSTM enhances feature extraction and classification performance through its matrix-based mLSTM modules and advanced feature aggregation strategy, effectively capturing both global and local information. Experimental evaluations in independent target scenes with distinct features and structured indoor environments demonstrate significant performance gains, achieving matching accuracies of 91.25% and 95.87%, respectively, and outperforming state-of-the-art models. These findings underscore the innovative advantages of integrating Vision-LSTM with synthetic data, highlighting its potential to overcome real-world limitations, reduce costs, and enhance accuracy and reliability for connected vehicle applications such as autonomous navigation and environmental perception.

Keywords:

IoV; 3D model; Vision-LSTM; virtual image; active visual position perception

1. Introduction

The advent of the Internet of Vehicles (IoV) and autonomous driving technologies has revolutionized modern transportation systems, paving the way for smart connected vehicles to become integral components of intelligent mobility [1]. These vehicles rely heavily on robust and accurate perception systems to ensure safe navigation, real-time decision making, and efficient communication with surrounding infrastructure and other vehicles [2,3]. Visual pose perception, which determines the precise position and orientation of a vehicle based on visual data, has emerged as a critical enabling technology in this domain [4]. Its applications span autonomous navigation, obstacle detection, and vehicle-to-everything (V2X) communication, making it a cornerstone of modern transportation innovation [5].

However, the deployment of visual pose perception systems in real-world vehicular environments faces significant challenges [6]. Environmental factors such as occlusions, varying lighting conditions, and dynamic backgrounds can severely degrade the performance of traditional deep learning models. Furthermore, the acquisition of high-quality, large-scale image datasets under diverse real-world conditions is both time-consuming and expensive, particularly in scenarios requiring complex traffic simulations or adverse weather environments [7]. These limitations hinder the scalability and adaptability of existing systems in meeting the demands of next-generation transportation technologies.

To address these challenges, the use of synthetic image datasets generated from high-fidelity 3D models has gained increasing attention [8]. This approach offers several key advantages, including reduced data collection costs, the ability to simulate diverse conditions, and greater control over environmental variables. By leveraging synthetic datasets, researchers can train and validate models under scenarios that are otherwise difficult to replicate in real-world experiments [9]. Additionally, advancements in deep learning architectures such as Vision-LSTM (ViL) provide robust tools for feature extraction and matching, even in complex and dynamic environments [10]. ViL, with its enhanced memory and parallel processing capabilities, is particularly well-suited for visual localization tasks in vehicular scenarios.

In this study, we propose an active visual pose perception method that integrates synthetic image generation with the Vision-LSTM network to address the limitations of traditional approaches. Our method is validated through two experimental scenarios: independent target matching, which evaluates the system’s ability to recognize objects with distinct features, and corridor-like structured environments, which simulate indoor navigation challenges. These scenarios represent simplified yet critical conditions for testing the robustness and adaptability of visual localization systems. The experimental results demonstrate significant improvements in pose estimation accuracy, highlighting the potential for broader applications in IoV environments, such as autonomous navigation, V2V communication, and intelligent transportation systems.

The main contributions of this study are summarized as follows:

Active Visual Pose Perception: The proposed method integrates synthetic images generated from high-fidelity 3D models with Vision-LSTM to enhance visual pose estimation.
Cost-Effective Data Generation: By using synthetic data, this approach mitigates the challenges of collecting large-scale real-world datasets, making it a scalable solution for autonomous navigation.
Robustness in Complex Environments: The proposed method is shown to perform well in both independent object detection and structured indoor environments, demonstrating its applicability in autonomous vehicle systems.

2. Related Work

2.1. Image Vision-Based Localization

Vision-based localization methods have garnered significant attention due to their rich environmental information and independence from additional infrastructure [11]. These methods are mainly categorized into appearance-based detection and pose-based detection [12]. Appearance-based detection compares the current image with previously stored image feature databases, while pose-based detection estimates the camera’s pose coordinates through image-based pose regression [13].

SLAM (simultaneous localization and mapping) is a method by which robots autonomously locate themselves and simultaneously map the environment during navigation [14]. It can be subdivided into localization and mapping. Localization estimates the robot’s pose relative to the current map, while mapping reconstructs the environment using visual–inertial odometry and laser sensors on the robot [15]. Modern SLAM algorithms are primarily categorized into LiDAR-based and vision-based techniques. Compared to LiDAR, cameras are more cost-effective, and their weight, size, and power consumption are better suited for modern devices. Additionally, vision-based SLAM can provide more detailed environmental information than LiDAR-based SLAM, leading researchers to believe that vision-based SLAM has greater application value [16]. However, from an environmental adaptability perspective, vision sensors are susceptible to changes in lighting, texture information, and rapid motion, which can affect the accuracy of localization and mapping in complex situations.

With the rise of artificial intelligence, cloud computing, and big data technologies, autonomous navigation and localization techniques are rapidly evolving [17]. The development of deep learning, in particular, has brought about new solutions for image-based localization. Deep learning-based localization has seen significant advancements. For instance, PoseNet, as proposed in Ref. [18], uses a CNN to estimate the absolute camera pose. By modifying parts of GoogLeNet, PoseNet can regress the 6-DOF camera pose from a single RGB image in an end-to-end manner, utilizing Structure from Motion (SfM) to obtain images with known poses. Another notable method is introduced in Ref. [19], where a new deep learning-based camera relocalization approach is employed. This method uses a CNN to localize a given query image by first retrieving similar database images, then predicting the camera’s relative pose based on the existing image data in the database. Ref. [20] tackles the challenge of large-scale visual place recognition, which involves quickly and accurately identifying the location of a given query photo. The authors developed a CNN architecture that was trained end-to-end for the place recognition task.

One of the key challenges with CNN-based methods is their sensitivity to variations in lighting and dynamic backgrounds [21]. Vision-LSTM overcomes this by employing alternating mLSTM blocks. These blocks process image patches in both row-wise and reverse row-wise directions, ensuring robust spatial and temporal representations. This bidirectional processing enables ViL to adapt to rapid environmental changes such as moving vehicles or fluctuating lighting in IoV scenarios. Moreover, the matrix memory within mLSTM allows it to retain and update relevant information over long sequences, significantly improving its resilience to occlusions and disruptions.

Alternative approaches have also been explored. Ref. [22] proposes an indoor localization method using environmental reference objects such as logos. An image-matching algorithm automatically identifies the selected reference objects in photos. Ref. [23] presents a scene recognition model using an MPSO-BP neural network capable of accurately identifying LOS and NLOS areas. Meanwhile, Ref. [24] introduces an image tag-based indoor localization system that enhances system robustness through end-to-end learning, overcoming issues related to large tag–camera distances or low image resolution. Ref. [25] proposes a multi-sensor indoor localization system featuring coarse and fine localization. During the offline phase, a two-dimensional geo-tagged image database for image retrieval is constructed.

However, these methods all require the prior construction of an image database. Having a high-quality, large-scale image database is crucial for subsequent localization steps. Preparing training data is time-consuming and labor-intensive. In real-world environments, factors such as lighting, background, and viewpoint can result in poor-quality image data, affecting the accuracy of subsequent localization steps.

Three-dimensional models have been widely used in the field of computer vision, capable of generating numerous synthetic image datasets to train deep learning models when real training data are scarce [9]. For instance, Ref. [26] introduced a novel indoor localization method based on images using Building Information Models (BIMs) and Convolutional Neural Networks (CNNs). This approach constructs datasets using rendered BIM images. Similarly, Ref. [27] proposed a Visual Positioning System (VPS) called KR-Net, which utilizes 3D models to supplement traditional databases for large indoor spaces.

In Ref. [28], synthetic images derived from 3D indoor models with known camera poses were used to fine-tune pre-trained Deep Convolutional Neural Networks (DCNNs), eliminating the need for the implementation of structure from motion (SfM) methods. Furthermore, Ref. [8] proposed a domain adaptation method based on a Generative Adversarial Network (GAN) framework, training on a set of unpaired synthetic and real images to generate realistic synthetic images and synthetic real images. These images were then used to train current deep learning-based camera pose regression networks. Unlike GAN-enhanced methods, which primarily focus on domain adaptation, our approach ensures a diverse and comprehensive training foundation. The matrix-based architecture of mLSTM enables robust feature learning from synthetic data, helping ViL generalize effectively across both real-world and synthetic domains, thereby bridging the gap between controlled training environments and the complexities of the real world.

2.2. Internet of Vehicles and Visual Localization

The Internet of Vehicles (IoV) represents a transformative shift in modern transportation systems, enabling seamless interaction between vehicles (V2V), vehicles and infrastructure (V2I), and vehicles and external networks (V2X) [29]. IoV extends the capabilities of autonomous driving by providing real-time data exchange, improving decision-making processes, and enhancing overall road safety and traffic efficiency [12]. This dynamic ecosystem relies on robust localization and communication technologies to support its applications in autonomous navigation, traffic management, and connected infrastructure [30].

IoV enables vehicles to access high-definition maps, real-time traffic updates, and environmental data from roadside units, facilitating accurate and reliable localization. For example, V2I communications provide real-time alerts about road hazards or construction zones, while V2V interactions help vehicles coordinate their movements to avoid collisions [31]. These technologies complement visual localization methods, enhancing their robustness in challenging environments such as urban canyons or areas with poor GPS coverage. By integrating IoV with visual localization, autonomous vehicles can achieve improved accuracy, particularly in dynamic and complex scenarios [32].

In the IoV communication architecture, the integration of V2V communication and V2I communication modules enables vehicles to make efficient real-time decisions in dynamic and complex traffic environments [33]. This bidirectional communication mechanism not only enhances vehicles’ environmental perception but also facilitates collaborative work between vehicles and infrastructure through real-time data exchange. Specifically, vehicles obtain dynamic information about surrounding vehicles through V2V communication and receive data from traffic lights and roadside sensors via V2I communication [34]. This collaborative approach improves vehicle path planning capabilities in complex scenarios, reduces traffic congestion, and prevents potential collision risks.

Despite their potential, IoV technologies still face critical gaps in enabling precise localization under real-world conditions [35]. Key challenges include the following:

Dynamic Environments: Urban canyons, poor GPS coverage, and rapidly changing traffic conditions introduce uncertainties that degrade the accuracy of traditional visual localization methods.
Data Limitations: High-quality, large-scale real-world datasets required for training are expensive and labor-intensive to collect, limiting the scalability of existing systems.
Insufficient Generalization: Models trained solely on real-world or synthetic datasets often fail to generalize effectively across diverse IoV scenarios, reducing their robustness.

By addressing these limitations through the integration of synthetic datasets and Vision-LSTM architectures, this research provides a unified, robust, and efficient framework for visual localization in IoV environments. This approach enables IoV systems to train deep learning algorithms on diverse datasets, reducing the dependency on expensive and labor-intensive real-world data collection. Synthetic datasets allow for controlled experiments under various lighting, weather, and traffic conditions, significantly enhancing the adaptability of visual localization methods in IoV environments. The incorporation of Vision-LSTM networks further strengthens this integration, as their ability to model sequential and contextual data aligns well with the temporal nature of IoV communication [10].

By leveraging the full potential of IoV, autonomous vehicles can achieve unprecedented levels of accuracy, safety, and efficiency, paving the way for a more connected and intelligent mobility ecosystem.

2.3. Extended Long Short-Term Memory

Long Short-Term Memory (LSTM) is an improved Recurrent Neural Network (RNN) model initially proposed by Hochreiter and Schmidhuber to address the issue of RNNs being unable to handle long-term dependencies [36]. LSTM achieves this by introducing gating mechanisms. Gates are a way to selectively allow information to pass through, composed of sigmoid neural network layers and pointwise multiplication operations. The sigmoid layer outputs a number between 0 and 1, describing the extent to which each component should be allowed to pass. A value of 0 means “let nothing through”, while a value of 1 means “let everything through” [37].

LSTM introduces three gating units: the input gate, the forget gate, and the output gate. These gating units control the flow of information and the forgetting process, effectively addressing the vanishing gradient problem. The input gate determines which information from the current input should be added to the cell state. The forget gate decides what information should be discarded from the cell state. Finally, the output gate decides what part of the cell state should be output as the hidden state. This architecture allows LSTM networks to maintain and regulate information over long sequences, making them powerful for tasks involving time-series data and sequential dependencies [37].

Despite the significant success of Long Short-Term Memory (LSTM) networks, there are still some limitations. First, LSTM cannot modify storage decisions; second, its storage capacity is limited, as information must be compressed into scalar cell states; third, due to mixed memory, it lacks parallelism and requires sequential processing [38]. To address these limitations, Maximilian et al. proposed Extended Long Short-Term Memory (xLSTM), which extends LSTM into multiple variants, namely sLSTM and mLSTM, each of which is optimized for specific performance and functional needs to handle various complex sequential data problems.

sLSTM (scalar LSTM) introduces a scalar update mechanism on top of the traditional LSTM. This design optimizes the gating mechanism by providing fine-grained control over the internal memory cells, making it more suitable for handling sequences with subtle temporal variations. sLSTM typically employs exponential gating and normalization techniques to improve the stability and accuracy of the model in processing long sequences of data. This allows sLSTM to maintain low computational complexity while delivering performance comparable to that of more complex models, making it particularly useful in resource-constrained environments or applications requiring rapid response.

mLSTM (matrix LSTM) enhances the memory capacity and parallel processing capabilities of the model by extending the vector operations in traditional LSTM to matrix operations. In mLSTM, each state is not a single vector but a matrix, enabling it to capture more complex data relationships and patterns within a single time step. mLSTM is especially suitable for handling large-scale datasets or tasks requiring the recognition of highly complex data patterns. Additionally, the design of mLSTM supports highly parallelized processing, which not only improves computational efficiency but also allows the model to scale better to large datasets.

By integrating sLSTM and mLSTM into residual blocks, the xLSTM architecture is constructed. This integration leverages the strengths of both sLSTM and mLSTM, addressing the limitations of traditional LSTM and enhancing its ability to process diverse and complex sequential data [38].

In their research, Benedikt et al. adopted the mLSTM module as a core component, adapting mLSTM (a type of autoregressive model) for the computer vision field to form the backbone network known as Vision-LSTM (ViL) [10]. Vision-LSTM (ViL) exhibits a linear relationship between computational and memory complexity concerning sequence length, making it particularly effective for high-resolution image tasks. Additionally, compared to Vision Transformer (ViT), ViL significantly reduces computational complexity.

By leveraging the capabilities of mLSTM, ViL can efficiently process and analyze high-resolution images, maintaining performance without the exponential increase in computational demands typically associated with longer sequences. This makes ViL a promising approach for various computer vision applications, providing a more resource-efficient alternative to models like ViT while maintaining or even enhancing performance in critical tasks.

Vision-LSTM, as proposed in this study, is tailored to meet the critical demands of connected vehicle applications. The model combines robust feature extraction capabilities with scalable synthetic data generation, enhancing its applicability in vehicle-to-everything (V2X) communication scenarios. These scenarios include autonomous navigation, obstacle detection, and real-time decision making within the Internet of Vehicles (IoV) ecosystems. By leveraging Vision-LSTM, connected vehicles can achieve higher precision in localization and feature recognition, addressing the challenges posed by dynamic and constrained vehicular environments.

3. Method

3.1. Construction of 3D Models

Synthetic image datasets play a pivotal role in advancing IoV-compatible systems. The datasets mimic real-world vehicular environments, including diverse lighting conditions, adverse weather, and occlusions. This capability not only reduces the dependency on costly real-world data collection but also ensures controlled experimentation for the development of IoV technologies.

The first step is to construct the 3D model of the localization scene. In this study, the selected 3D indoor model is derived from the BIM of the second floor of the Chemical and Environmental Building at China University of Mining and Technology (Beijing). The external contour of the scene is obtained from its floorplan dimensions. Using 3ds Max, the wireframe model is constructed; then, the details are added based on material texture maps to capture the fine features of the objects. This process allows for the acquisition of relevant geographic information data of the corridor. The model is rendered using the Corona Renderer to enhance its realism. The process is illustrated in Figure 1.

Figure 1. Model construction process.

Realistic object appearance depends on numerous low-level cues, including object shape, pose, surface color, reflectance, light-source position, spectral distribution, background scene attributes, and camera characteristics. The 3D model of the corridor includes specific measurement data of the entire building, material images for various parts, and orientation maps of different components. The height and other relevant information pertaining to the building were obtained from the school’s official website and on-site measurements. Material images and orientation maps of various parts were obtained by taking photographs and processing these images. The specific measurement data of the building interior are shown in Table 1.

Table 1. Specific information on building features.

The model textures primarily come from photographs. These photographs include features such as image segmentation, color correction, camera mapping, raster and vector painting, motion blur, depth of field, and support for stereoscopic effects. Image processing is then performed to remove impurities from the photographs and adjust parameters such as size, contrast, brightness, and sharpness. After processing, the photographs are converted into texture maps and stored in a texture library.

3.2. Active Visual Position Detection

Our detection method is divided into two main stages. In the offline stage, we collect corresponding synthetic image data based on the highly realistic 3D model. These image data are used to construct an offline image database for training purposes. In the online stage, by retrieving scenes similar to the current scene from a series of known scenes, we approximate the pose of the most similar reference image to the pose of the current observed image. This allows us to achieve object pose detection in the current scene. The detailed process flow is illustrated in Figure 2.

Figure 2. Diagram of active vision position detection.

3.3. Model Structure

We primarily use Vision-LSTM (ViL) as the backbone network for visual feature extraction. First, the two-dimensional image is divided into several patches of a given size. Each patch is then mapped to a one-dimensional vector through linear projection to form a token, and learnable positional embeddings are added to each patch token. Next, the image patch sequence is fed into the stacked 24 mLSTM blocks (as shown in Figure 3) for feature extraction. Each mLSTM block processes the sequence of tokens, ensuring robust spatial and temporal feature learning. After processing through the 24 mLSTM blocks, we apply bilateral concat pooling for feature aggregation. This pooling method is similar to the [CLS] and [AVG] pooling designs used in ViT. Bilateral concat pooling averages the first and last patch tokens in the sequence to form a new feature representation, which is then passed through a linear layer for classification. The specific structure is illustrated in Figure 3.

Figure 3. Vision-LSTM network structure.

3.4. ViL Encoder

The core of Vision-LSTM (ViL) is composed of alternating mLSTM (matrix LSTM) modules. mLSTM significantly enhances the model’s memory capacity and parallel processing capabilities by extending vector operations in traditional LSTM to matrix operations. The core innovation of mLSTM is the extension of traditional scalar memory cells to matrix memory cells, allowing mLSTM to store more information in matrix form, thereby increasing the model’s storage capacity.

In mLSTM, each state is no longer a single vector but a matrix, enabling it to capture more complex data relationships and patterns within a single time step. mLSTM uses covariance update rules to store and retrieve information. This rule is implemented by storing key–value pairs as rows or columns of a matrix, thereby improving retrieval separability and the signal-to-noise ratio. The design of mLSTM is inspired by Bidirectional Associative Memories (BAMs), which use matrix multiplication for information storage and retrieval.

mLSTM is particularly well-suited for handling large-scale datasets and tasks that require highly complex data pattern recognition. Unlike sLSTM, the design of mLSTM allows for fully parallelized processing because it eliminates the recurrent connections (memory mixing) between hidden layers. This enables mLSTM to be more efficiently trained and inferred on modern hardware.

The input gate and forget gate of mLSTM can use exponential activation functions, while the output gate continues to use the sigmoid function. These gating mechanisms allow the model to control the inflow and forgetting of information effectively.

The cell state update equation is given by Equation (1):

C_{t} = f_{t} C_{t - 1} + i_{t} v_{t} k_{t}^{⊺}

(1)

In Equation (1),

C_{t}

is the current cell state,

f_{t}

is the forget-gate activation at time step t,

i_{t}

is the input-gate activation at time step t,

v_{t}

is the value vector at time step t, and

k_{t}

is the key vector at time step t.

The normalizer state update equation is expressed as Equation (2):

n_{t} = f_{t} n_{t - 1} + i_{t} k_{t}^{⊺}

(2)

In Equation (2),

n_{t}

is the current normalizer state,

f_{t}

is the forget-gate activation at time step t,

n_{t - 1}

is the previous normalizer state,

k_{t}

is the key vector at time step t, and

i_{t}

is the input-gate activation at time step t.

The hidden state equation is given by Equation (3):

h_{t} = o_{t} ⊙ {\tilde{h}}_{t}, {\tilde{h}}_{t} = C_{t} q_{t} / m a x {|n_{t}^{⊺} q_{t}|, 1}

(3)

In Equation (3),

h_{t}

is the current hidden state,

o_{t}

is the output-gate activation at time step t,

{\tilde{h}}_{t}

is the intermediate hidden state,

C_{t}

is the cell state at time step t,

q_{t}

is the query vector at time step t,

n_{t}

is the current normalizer state, and the term

m a x {|n_{t}^{⊺} q_{t}|, 1}

ensures numerical stability.

The gate computations are defined by Equations (4)–(6):

q_{t} = W_{q} x_{t} + b_{q}

(4)

k_{t} = \frac{1}{\sqrt{d}} W_{k} x_{t} + b_{k}

(5)

v_{t} = W_{v} x_{t} + b_{v}

(6)

In Equations (4)–(6),

q_{t}

,

k_{t}

, and

v_{t}

are the query, key, and value vectors, respectively, at time step t;

W_{q}

,

W_{k}

, and

W_{v}

are the corresponding weight matrices;

b_{q}

,

b_{k}

, and

b_{v}

are the biases;

x_{t}

is the input vector at time step t; and d is the dimensionality of the model.

The input gate is computed as shown in Equation (7):

i_{t} = \exp ({\tilde{i}}_{t}), {\tilde{i}}_{t} = w_{t}^{⊺} x_{t} + b_{i}

(7)

In Equation (7),

i_{t}

is the input-gate activation,

{\tilde{i}}_{t}

is the intermediate input-gate value,

w_{t}

is the weight vector for the input gate,

x_{t}

is the input vector, and

b_{i}

is the bias.

The forget gate is calculated as shown in Equation (8):

f_{t} = σ ({\tilde{f}}_{t}) O R \exp ({\tilde{f}}_{t}), {\tilde{f}}_{t} = w_{t}^{⊺} x_{t} + b_{f}

(8)

In Equation (8),

f_{t}

is the forget-gate activation,

{\tilde{f}}_{t}

is the intermediate forget-gate value,

w_{t}

is the weight vector for the forget gate,

x_{t}

is the input vector, and

b_{f}

is the bias.

The output gate is defined as shown in Equation (9):

o_{t} = σ ({\tilde{o}}_{t}), {\tilde{o}}_{t} = W_{o} x_{t} + b_{o}

(9)

In this equation,

o_{t}

is the output-gate activation,

{\tilde{o}}_{t}

is the intermediate output-gate value,

W_{o}

is the matrix for the output gate,

x_{t}

is the input vector, and

b_{o}

is the bias.

Integrating the improved mLSTM layer into residual blocks allows for more efficient handling of complex sequential data. The specific structure is shown in Figure 4.

Figure 4. mLSTM block structure.

As shown in Figure 4, the input token sequence first passes through a normalization layer. The normalized sequence is then linearly projected into two vectors of dimension E. For odd-numbered mLSTM blocks, vector x directly passes through the mLSTM layer to generate x_forwardx, which is then gated using y and concatenated with the input token sequence. For even-numbered mLSTM blocks, vector x is flipped before passing through the mLSTM layer, then flipped again to generate x_forward, which is also gated using y and concatenated with the input token sequence.

Odd-numbered mLSTM blocks process the input tokens from top-left to bottom-right, while even-numbered mLSTM blocks process them from bottom-right to top-left, as shown in Figure 5. Finally, a bidirectional average pooling operation is performed by taking the average of the first and last positions of the sequence to generate a new feature representation. This operation captures the global information of the input data. The pseudo-code for the mLSTM algorithm is in Algorithm 1.

Algorithm 1. mLSTM Block Process:

Input: token sequence

T_{l} : (B, M, D)

Output: new sequence

T_{k}

:

(B, M, D)

Model Aggregation

1: /*normalize the input sequence T_l*/

2:

T_{l}^{’} \leftarrow N o r m (T_{l})

3:

x : (B, M, E) \leftarrow {L i n e a r}^{x} (T_{l}^{’})

4:

y : (B, M, E) \leftarrow {L i n e a r}^{y} (T_{l}^{’})

5:

y^{'} (B, M, E) \leftarrow S i L U (y)

6: /*process with different direction*/

7: for

i

in range(depth)

8: if

i

% 2 == 0

9:

x_{1} : (B, M, E) \leftarrow f l i p (x)

10

x_{2} : (B, M, E) \leftarrow m L S T M {(x}_{1}

)

11:

x^{'} : (B, M, E) \leftarrow f l i p (x_{2})

12: else:

13:

x^{'} : (B, M, E) \leftarrow m L S T M (x

)

14: end for

15: /*get gated z_o */

16:

z : (B, M, E) \leftarrow x^{'} ⨀ y^{'}

17: /*Bilateral Concat */

18:

z^{'} \leftarrow (z [:, 0] + z [:, - 1]) / 2

19: /* residual connection */

20:

T_{k}

:

(B, M, D) \leftarrow z + T_{l}

21: end for

Figure 5. Visualization of traversal paths.

4. Experimentation and Evaluation

4.1. Datasets and Training Setups

To better align with IoV-specific challenges, we reframed our experimental scenarios to simulate real-world constraints commonly encountered by connected vehicles, such as limited space. Although the experiments were conducted in a structured corridor environment, we specifically chose this setup to reflect real-world scenarios like parking garages, tunnels, or narrow urban streets where navigation is hindered by confined spaces and complex traffic conditions. This simulation allows for testing of the robustness of visual localization systems in constrained environments, which are prevalent in IoV settings. By replicating these real-world challenges, we aim to demonstrate the applicability and effectiveness of our system under conditions that closely mirror the challenges of vehicle localization in IoV environments.

The experimental data used in this study are divided into two main parts: a virtual image dataset and a real image dataset, which are used for model training and testing, respectively. In order to simulate the feature-matching requirements in different scenarios, we designed two experimental scenarios: an independent object scenario and a complex object environment. The virtual image dataset is generated by 3D modeling tools (e.g., 3ds Max). Taking the corridor on the second floor of the Chemical Environment Building as an example, we constructed a high-fidelity 3D simulation model using architectural blueprints and simulated the light reflection, shadows, and other characteristics of the actual environment by adding material maps. In addition, to verify the effectiveness of the virtual image generation method, we used an iPhone 13 Pro to capture real image data in the same environment.

In the independent object scene, we acquired virtual images of a single target in four directions, with 500 images in each direction, generating a total of 2000 images for training. The real image data were acquired by controlling the viewing angle and offset angle (±10°) with 480 high-resolution images for testing, as shown in Figure 6. In the complex environment scene, we further introduced the virtual images of the corridor background to simulate multiple targets and complex lighting conditions. A total of five orientations of image data were collected, each with a 10° left–right swing from the object’s center axis and a height of 1–1.2 m. A total of 360 virtual images with a resolution of 512 × 512 were collected in each direction, totaling 1800 images; 450 real images were collected at the same time as controls, as shown in Figure 7.

Figure 6. Independent object datasets.

Figure 7. Dataset for the modeling of drinking fountains in a corridor.

The hardware platform for the experiments comprised 256-core NVIDIA GPUs and dual-core NVIDIA CPUs (NVIDIA, Santa Clara, CA, USA), and the software environments were Python 3.8, PyTorch 1.10, and CUDA 12.3. The specific device and software tools parameters are shown in Table 2. Model training was performed using the AdamW optimizer, with the initial learning rate set to 0.001 and adjusted by using the cosine annealing strategy. The training period was 150, and the batch size was 16. To enhance the generalization ability of the model, we also added data enhancement strategies such as random cropping and level flipping in data processing. The specific model training parameters are shown in Table 3.

Table 2. Experimental devices and software tools.

Table 3. Main hyperparameters for model training.

4.2. Design of Experiments

The experimental design is divided into three parts: feature-matching validation for independent object scenes, robustness evaluation for complex environment scenes, and validation of synthetic data generalization and fine tuning of effectiveness. Each experiment used virtual image data to train the model and test it on real image data. The performance advantages of the ViL model and the potential of virtual images in practical applications are explored through comparative analysis in different scenes.

In Experiment 1, we selected a water dispenser as an independent object and simulated the feature changes of the object through multi-view images generated from virtual data. The core objective of the experiment was to verify whether ViL is able to accurately match independent target images under conditions of low background interference. We used real image data to evaluate the model and performed comparative analysis with other mainstream models (e.g., Swin Transformer, MobileViT, etc.) on metrics such as top-1 accuracy, precision, recall, and F1 score.

In Experiment 2, on the other hand, considered complex environmental scenarios to explore the performance of the model under multi-target and complex background interference conditions. Taking the corridor of the Chemical Environment Building as the test scene, we used virtual data to simulate the actual situation, including factors such as light change and occlusion. The real image data introduced in the test contain multiple feature points, such as the water dispenser and background wall, and the robustness of ViL was verified by multi-model comparison analysis. In addition, we designed a background dependency analysis to explore the impact of environmental factors on feature-matching accuracy.

In Experiment 3, we focused on the generalization ability of synthetic data and the effectiveness of fine tuning with real data. The model was first trained solely on synthetic data generated from virtual environments, then evaluated on real-world data. To further enhance the model’s performance, a fine-tuning process was introduced using a small subset (10–20%) of real-world data to adapt the model to real-world conditions. Specifically, we froze the lower-level feature extraction layers of the pre-trained model while fine tuning the higher-level layers and classifier. The experiment evaluated the model’s performance before and after fine tuning on metrics such as top-1 accuracy, precision, recall, and F1 score. Comparative analysis highlights the significance of fine tuning in bridging the domain gap between synthetic and real data.

These three experiments not only cover multi-scenario requirements ranging from simple to complex but also systematically verify the applicability of ViL and its optimization capability for feature- matching tasks by introducing multiple comparison models and detailed metric analysis.

4.3. Analysis of Experimental Results

4.3.1. Analysis of Experimental Results (Independent Object Scene)

In the separate object scene, ee conducted a comprehensive comparison of ViL with the latest mainstream models, which include EfficientNetV2-S, Swin Transformer V2-Tiny, ConvNeXt V2, MobileViT, and MAE (Masked Autoencoder), among others. Table 4 shows the results of performance comparison for each model.

Table 4. Comparative data for each network of the independent marker model.

In the independent object-matching scenario, the experimental results show that Vision-LSTM (ViL) performs the best among all the compared models, with a top-1 accuracy of 91.25%, demonstrating its significant advantages in feature extraction and matching tasks. The ViL loss--accuracy curve is shown in Figure 8. The mLSTM module, the core of ViL, enhances the adaptability to complex-texture targets by modeling the relationship between global features through matricization operations and, at the same time, maintains high matching accuracy in samples with large changes in viewpoint. In particular, the F1 score reaches 90.96%, which further validates the comprehensiveness and robustness of ViL in feature modeling.

Figure 8. The ViL loss–accuracy curve(the Independent Object Scene).

Swin Transformer V2-Tiny ranks second, with a 90.15% top-1 accuracy. Its dynamic window division strategy optimizes local feature extraction and global information fusion and performs stably on samples with uniform feature distributions. However, the windowing mechanism is slightly less capable of global modeling for long-range targets or complex texture targets and is, thus, slightly inferior to ViL for samples with large viewpoint offsets.

MAE (Masked Autoencoder) is ranked fourth, with a top-1 accuracy of 87.15%. This result shows that although MAE performs well on self-supervised learning tasks (e.g., image reconstruction), its design mainly focuses on the complementation of occluded regions rather than the learning of global structures. In feature extraction tasks, MAE’s ability to model local details is slightly insufficient. In addition, MAE’s feature extraction relies on a large-scale pre-training process, and its ability to model fine details for small-sample datasets was not as good as that of ViL in this experiment.

EfficientNetV2-S and ConvNeXt V2 is closer achieved top-1 accuracies of 89.42% and 88.96%, respectively.EfficientNetV2-S improves computational efficiency through the composite scaling strategy, but its feature extraction ability is more oriented toward local structures, and there is still room for improvement in global information modeling. ConvNeXt V2, as an improved version of a convolutional network, improves the local feature extraction efficiency by optimizing the design of convolutional kernels, but its performance is slightly inferior to that the Transformer-based Swin V2 and ViL models when matching samples with complex texture and multi-view changes.

The lightweight MobileViT model performs relatively weakly in this scenario, with a top-1 accuracy of only 86.29%.The design of MobileViT focuses on applications in low-computing-resource scenarios, and its lightweight transformer structure has limitations for high-complexity feature extraction of independent targets. Although MobileViT is suitable for resource-constrained deployment scenarios, its ability to integrate global information is limited in tasks that require high-precision matching. Specific model comparison data are shown in Figure 9.

Figure 9. Independent object scene: model comparison.

In summary, the performance of ViL proves its excellent fine-grained feature modeling capability and its ability to capture global dependency in independent object scenes. This ability makes it highly applicable in real-world applications such as industrial inspection and automated equipment recognition. Meanwhile, although models such as MAE perform well in self-supervised learning tasks, they need to be further optimized for local modeling in specific feature-matching tasks before they can approach the performance level of ViL. In the future, we can try to combine the self-supervised feature of MAE with the mLSTM module design of ViL to explore more efficient global–local feature combination models.

4.3.2. Analysis of Experimental Results (Complex Environment Scene)

We similarly evaluated the performance of a variety of mainstream models in a complex background environment, and the results are shown in Table 5.

Table 5. Comparative data for each network om the water dispenser marker model.

In complex background environments, ViL, again, outperforms other comparison models, with a top-1 accuracy of 95.87%; its performance advantage is even more pronounced. The ViL loss--accuracy curve is shown in Figure 10. This is primarily due to the mLSTM module’s superior ability to model global information. Environmental scenes often contain distractions such as background complexity, illumination variations, and multiple targets. ViL effectively captures global dependencies between features through matricization operations while suppressing the impact of background noise. This allows ViL to accurately recognize and match target features despite background complexities. As shown in the heatmap comparisons between real and synthetic images, ViL successfully identifies and prioritizes key regions in both scenarios, demonstrating its robustness in feature selection. Additionally, the feature point-matching comparison between real and synthetic images further highlights ViL’s consistent ability to match key features, even with variations in image type and background conditions. Feature point matching between a real image and synthesized image is shown in Figure 11.

Figure 10. The ViL loss–accuracy curve (the complex environment scene).

Figure 11. Feature point matching between a real image and synthesized image.

Swin Transformer V2-Tiny and MobileViT rank second and third, with top-1 accuracies of 94.56% and 91.43%, respectively. Both models perform relatively well in complex environments, but as demonstrated in the heatmap comparisons, Swin Transformer V2-Tiny optimizes the fusion of local and global information using its dynamic window division strategy. However, Swin Transformer V2-Tiny struggles slightly with certain background complexities, as shown in the heatmap, where it fails to fully focus on some important features. MobileViT, while lightweight, has difficulty processing high-resolution inputs and is less effective than ViL in handling complex multi-target and noisy background scenarios. The heatmap comparison shows that MobileViT is less capable of focusing on the most important features in cluttered environments, leading to a decrease in overall matching accuracy.

Lightweight models such as MobileViT perform moderately, with a top-1 accuracy of 91.43%. MobileViT struggles with high-resolution inputs, and although its lightweight design enhances efficiency, it lacks the capacity to comprehensively model complex backgrounds and multi-target features. On the other hand, MAE (Masked Autoencoder), which enhances self-supervised learning effectiveness, does not focus on lightweight design. While MAE improves representation learning in an unsupervised manner, its performance is less adaptable to dynamic or complex scenes, which results in lower accuracy in such environments. Specific model comparison data are shown in Figure 12.

Figure 12. Complex environment scene: model comparison.

While ViL excels in feature modeling, its computational complexity—especially due to the mLSTM-based architecture—may make it unsuitable for real-time applications, particularly in resource-constrained environments like drones or mobile robots. The mLSTM module, while capturing global dependencies, requires extensive matrix operations and the processing of long sequences, leading to high memory demands and slower inference speeds. Additionally, ViL’s parameter count (23M) results in significant memory consumption, which may cause performance bottlenecks in embedded systems or real-time inference tasks.

To address these challenges, we introduce ViL-T, an optimized lightweight version designed to be more suitable for resource-constrained environments. Specifically, ViL-T reduces the latent dimension from 384 to 192 to lower computational complexity. The model architecture is also optimized to retain much of ViL’s performance while reducing memory usage and computational demand. Moreover, ViL-T’s parameter count is reduced from 23M to 6M, significantly lowering memory consumption and computational resource requirements, making it more suitable for deployment in embedded platforms and real-time inference, especially in drones and mobile robots, where fast response times and low power consumption are critical. The ViL loss--accuracy curve is shown in Figure 13.

Figure 13. The ViL-T loss–accuracy curve.

Although ViL-T experiences a slight drop in top-1 accuracy (92.47%), it still outperforms MobileViT, particularly in complex background conditions and multi-target recognition. Despite having a smaller parameter count (5.6M), MobileViT struggles with complex backgrounds and noise suppression, leading to a lower top-1 accuracy of 91.43%, which is inferior to that of ViL-T. A comparison of the characteristic heat maps of the different networks is shown in Figure 14 and Figure 15.

Figure 14. Perceived performance of ViL on a real image (left) and synthesized image (right).

Figure 15. Perceived performance of Swin Transformer V2-Tiny, MobileViT Swin Transformer V2-Tiny (left), and MobileViT (right).

Thus, while ViL-T has slightly lower accuracy compared to ViL, its reduced latent dimension and optimized architecture effectively lower memory usage and computational complexity. It continues to provide efficient performance in embedded devices and real-time applications, making it well-suited for tasks requiring complex background processing and multi-target detection.

4.3.3. Analysis of Experimental Results (Fine-Tuning Model)

The data in Table 6 comparing the synthetic data only model and the synthetic data + fine-tuning model highlights the significant advantages of incorporating real-world data into the training process. The synthetic data only model achieved strong performance in structured and controlled environments, with a top-1 accuracy of 95.87%, precision of 95.62%, recall of 95.81%, and F1 score of 95.71%. However, its performance degraded when tested on real-world data, primarily due to the domain gap between synthetic and real data. Synthetic datasets often lack the fine-grained textures, lighting variations, and background complexities found in natural images, limiting the model’s ability to generalize to unstructured scenarios.

Table 6. Comparative data for synthetic data only and synthetic data + fine-tuning models.

In contrast, the synthetic data + fine-tuning model demonstrated substantial improvements across all metrics after being fine-tuned with a small subset (10–20%) of real-world data. The top-1 accuracy increased to 97.23%, precision improved to 96.91%, recall rose to 97.10%, and the F1 score reached 97.00%. These results reflect the effectiveness of fine tuning in bridging the domain gap. By freezing the lower-level feature extraction layers and updating the higher-level layers and classifier, the model retained the general features learned from synthetic data while adapting to real-world characteristics such as texture details, lighting variations, and environmental occlusions.

This comparison underscores the importance of fine tuning as a critical step in enhancing the model’s generalization ability and robustness in real-world applications. While the synthetic data only model provides a strong baseline for pre-training, it is the incorporation of fine tuning that enables the model to adapt to diverse, real-world conditions and significantly outperform mainstream models such as Swin Transformer and MobileViT. The findings confirm that combining pre-training on synthetic data with fine tuning using real-world data is a practical and effective approach for tasks requiring domain adaptation.

5. Conclusions

The experimental results demonstrate that Vision-LSTM (ViL) significantly outperforms state-of-the-art deep learning models in both independent object scenes and complex background environments. This performance advantage is primarily attributed to the mLSTM module’s effective modeling of high-dimensional feature relationships and the enhanced data diversity provided by virtual image generation. However, further research is needed to evaluate ViL’s applicability in real-world IoV scenarios involving dynamic traffic systems and complex conditions.

Virtual image generation plays a critical role in feature-matching tasks, but its adaptability under extreme conditions such as low-light environments, severe occlusion, and high mobility requires improvement. Future research could integrate techniques such as Domain Adaptation (DA) and Generative Adversarial Networks (GANs) to bridge the gap between synthetic and real-world data, thereby improving the model’s robustness and accuracy in complex IoV scenarios.

The current experiments focus on specific scenarios such as indoor environments and static backgrounds, which limits the model’s generalizability to more diverse and dynamic IoV environments. Expanding the diversity of training data, such as by generating 3D models and datasets that include urban, rural, and various traffic conditions, could significantly improve the robustness and applicability of the model, addressing real-world vehicle localization challenges more comprehensively.

Although ViL demonstrates strong feature extraction capabilities, its performance remains limited in highly dynamic environments with rapidly changing objects or backgrounds and in GPS-denied conditions. Future research should explore sensor fusion and advanced data integration techniques to address these limitations.

As IoV technologies evolve, privacy and security issues related to the processing of visual and positional data are becoming increasingly important. Future efforts should prioritize privacy-preserving methods such as federated learning or encrypted data processing while also examining legal and regulatory frameworks. Looking forward, ViL’s application in Vehicle-to-Infrastructure (V2I) systems and traffic coordination platforms holds great potential for optimizing traffic flow, enhancing infrastructure integration, and supporting autonomous navigation, contributing to smarter and more interconnected transportation networks.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L. and J.J.; formal analysis, J.J.; data curation, J.J.; writing—original draft preparation, J.J.; writing—review and editing, Y.L. and Z.T.; supervision, Y.L. and Z.T.; project administration, Y.L. and Z.T.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 52364017 and the National Natural Science Foundation of China under grant number 52074305.

Data Availability Statement

Data available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dey, K.C.; Rayamajhi, A.; Chowdhury, M.; Bhavsar, P.; Martin, J. Vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication in a heterogeneous wireless network–Performance evaluation. Transp. Res. Part C Emerg. Technol. 2016, 68, 168–184. [Google Scholar] [CrossRef]
Zhu, X.G.; Hua, Q.Z.; Yan, W.J.; Guo, Z.W.; Yu, K.P. A Vehicle-Road Urban Sensing Framework for Collaborative Content Delivery in Free way-Oriented Vehicular Networks. IEEE Sens. J. 2024, 24, 5662–5674. [Google Scholar] [CrossRef]
Dai, S.H.; Li, S.K.; Tang, H.C.; Ning, X.; Fang, F.; Fu, Y.X.; Wang, Q.L.; Cheng, L. MARP: A Cooperative Multiagent DRL System for Connected Autonomous Vehicle Platooning. IEEE Internet Things J. 2024, 11, 32454–32463. [Google Scholar] [CrossRef]
Yi, S.; Zhang, H.; Liu, K. V2IViewer: Towards Efficient Collaborative Perception via Point Cloud Data Fusion and Vehicle-to-Infrastructure Communications. IEEE Trans. Netw. Sci. Eng. 2024, 11, 6219–6230. [Google Scholar] [CrossRef]
Biswas, A.; Wang, H.C. Autonomous Vehicles Enabled by the Integration of IoT, Edge Intelligence, 5G, and Blockchain. Sensors 2023, 23, 60. [Google Scholar] [CrossRef]
Liu, T.; Sun, D.; Bi, C.K.; Sun, Y.; Chen, S.M. Dynamic-Scene-Graph-Supported Visual Understanding of Autonomous Driving Scenarios. In Proceedings of the 17th Pacific Visualization Conference (PacificVis), Tokyo, Japan, 23–26 April 2024; IEEE Computer Soc: Tokyo, Japan, 2024; pp. 82–91. [Google Scholar]
Song, Z.H.; He, Z.M.; Li, X.Y.; Ma, Q.M.; Ming, R.B.; Mao, Z.Q.; Pei, H.X.; Peng, L.H.; Hu, J.M.; Yao, D.Y.; et al. Synthetic Datasets for Autonomous Driving: A Survey. IEEE Trans. Intell. Veh. 2024, 9, 1847–1864. [Google Scholar] [CrossRef]
Acharya, D.; Tatli, C.J.; Khoshelham, K. Synthetic-real image domain adaptation for indoor camera pose. ISPRS J. Photogramm. Remote Sens. 2023, 202, 405–421. [Google Scholar] [CrossRef]
Su, H.; Qi, C.R.; Li, Y.; Guibas, L.J. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3D model views. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2686–2694. [Google Scholar]
Alkin, B.; Beck, M.; Poeppel, K.; Hochreiter, S.; Brandstetter, J. Vision-LSTM: xLSTM as Generic Vision Backbone. arXiv 2024, arXiv:2406.04303. [Google Scholar]
Hyeon, J.; Jang, B.; Choi, H.; Kim, J.; Kim, D.; Doh, N. Photo-realistic 3D model based accurate visual positioning system for large-scale indoor spaces. Eng. Appl. Artif. Intell. 2023, 123, 106256. [Google Scholar] [CrossRef]
Xu, W.L.; Zhou, G.Y.; Zhou, Y.Z.; Zou, Z.B.; Wang, J.L.; Wu, W.F.; Li, X.M. A Vision-Based Tactile Sensing System for Multimodal Contact Information Perception via Neural Network. IEEE Trans. Instrum. Meas. 2024, 73, 11. [Google Scholar] [CrossRef]
Dong, J.; Noreikis, M.; Xiao, Y.; Ylä-Jääski, A. ViNav: A vision-based indoor navigation system for smartphones. IEEE Trans. Mob. Comput. 2018, 18, 1461–1475. [Google Scholar] [CrossRef]
Peng, J.S.; Chen, D.H.; Yang, Q.; Yang, C.J.; Xu, Y.; Qin, Y. Visual SLAM Based on Object Detection Network: A Review. CMC-Comput. Mater. Contin. 2023, 77, 3209–3236. [Google Scholar] [CrossRef]
Ahmed, M.F.; Masood, K.; Fremont, V.; Fantoni, I. Active slam: A review on last decade. Sensors 2023, 23, 8097. [Google Scholar] [CrossRef] [PubMed]
Peng, J.; Yang, Q.; Chen, D.; Yang, C.; Xu, Y.; Qin, Y. Dynamic SLAMVisual Odometry Based on Instance Segmentation: A Comprehensive Review. Comput. Mater. Contin. 2024, 78, 1. [Google Scholar]
Yasuda, Y.D.V.; Martins, L.E.G.; Cappabianco, F.A.M. Autonomous Visual Navigation for Mobile Robots: A Systematic Literature Review. ACM Comput. Surv. 2020, 53, 34. [Google Scholar] [CrossRef]
Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
Laskar, Z.; Melekhov, I.; Kalia, S.; Kannala, J. Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 929–938. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ashraf, M.H.; Jabeen, F.; Alghamdi, H.; Zia, M.S.; Almutairi, M.S. HVD-Net: A Hybrid Vehicle Detection Network for Vision-Based Vehicle Tracking and Speed Estimation. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 19. [Google Scholar] [CrossRef]
Gao, R.; Tian, Y.; Ye, F.; Luo, G.; Bian, K.; Wang, Y.; Wang, T.; Li, X. Sextant: Towards ubiquitous indoor localization service by photo-taking of the environment. IEEE Trans. Mob. Comput. 2015, 15, 460–474. [Google Scholar] [CrossRef]
Liu, Z.; Xiong, J.; Ma, Y.; Liu, Y. Scene recognition for device-free indoor localization. IEEE Sens. J. 2023, 23, 6039–6049. [Google Scholar] [CrossRef]
Alarfaj, M.; Su, Z.; Liu, R.; Al-Humam, A.; Liu, H. Image-tag-based indoor localization using end-to-end learning. Int. J. Distrib. Sens. Netw. 2021, 17, 15501477211052371. [Google Scholar] [CrossRef]
Xu, S.; Chou, W.; Dong, H. A robust indoor localization system integrating visual localization aided by CNN-based image retrieval with Monte Carlo localization. Sensors 2019, 19, 249. [Google Scholar] [CrossRef]
Ha, I.; Kim, H.; Park, S.; Kim, H. Image retrieval using BIM and features from pretrained VGG network for indoor localization. Build. Environ. 2018, 140, 23–31. [Google Scholar] [CrossRef]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
Acharya, D.; Khoshelham, K.; Winter, S. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS J. Photogramm. Remote Sens. 2019, 150, 245–258. [Google Scholar] [CrossRef]
Shinde, S.S.; Tarchi, D. Collaborative Reinforcement Learning for Multi-Service Internet of Vehicles. IEEE Internet Things J. 2023, 10, 2589–2602. [Google Scholar] [CrossRef]
Qi, J.; Liu, Y.L.; Ling, Y.C.; Xu, B.; Dong, Z.J.; Sun, Y.F. Research on an Intelligent Computing Offloading Model for the Internet of Vehicles Based on Blockchain. IEEE Trans. Netw. Serv. Manag. 2022, 19, 3908–3918. [Google Scholar] [CrossRef]
Xie, Q.; Ding, Z.X.; Zheng, P.P. Provably Secure and Anonymous V2I and V2V Authentication Protocol for VANETs. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7318–7327. [Google Scholar] [CrossRef]
Fan, W.H.; Su, Y.; Liu, J.; Li, S.M.; Huang, W.; Wu, F.; Liu, Y.A. Joint Task Offloading and Resource Allocation for Vehicular Edge Computing Based on V2I and V2V Modes. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4277–4292. [Google Scholar] [CrossRef]
Chung, Y.C.; Chang, H.Y.; Chang, R.Y.; Chung, W.H. Deep Reinforcement Learning-Based Resource Allocation for Cellular V2X Communications. In Proceedings of the 97th IEEE Vehicular Technology Conference (VTC-Spring), Florence, Italy, 20–23 June 2023; IEEE: Florence, Italy, 2023. [Google Scholar]
Wang, B.Y.; Han, Y.; Wang, S.Y.; Tian, D.; Cai, M.J.; Liu, M.; Wang, L.J. A Review of Intelligent Connected Vehicle Cooperative Driving Development. Mathematics 2022, 10, 31. [Google Scholar] [CrossRef]
Tian, D.; Li, J.B.; Lei, J.Y. Multi-sensor information fusion in Internet of Vehicles based on deep learning: A review. Neurocomputing 2025, 614, 18. [Google Scholar] [CrossRef]
Shiri, F.M.; Perumal, T.; Mustapha, N.; Mohamed, R. A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU. arXiv 2023, arXiv:2305.17473. [Google Scholar]
Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
Beck, M.; Poeppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]

Figure 1. Model construction process.

Figure 2. Diagram of active vision position detection.

Figure 3. Vision-LSTM network structure.

Figure 4. mLSTM block structure.

Figure 5. Visualization of traversal paths.

Figure 6. Independent object datasets.

Figure 7. Dataset for the modeling of drinking fountains in a corridor.

Figure 8. The ViL loss–accuracy curve(the Independent Object Scene).

Figure 9. Independent object scene: model comparison.

Figure 10. The ViL loss–accuracy curve (the complex environment scene).

Figure 11. Feature point matching between a real image and synthesized image.

Figure 12. Complex environment scene: model comparison.

Figure 13. The ViL-T loss–accuracy curve.

Figure 14. Perceived performance of ViL on a real image (left) and synthesized image (right).

Figure 15. Perceived performance of Swin Transformer V2-Tiny, MobileViT Swin Transformer V2-Tiny (left), and MobileViT (right).

Table 1. Specific information on building features.

Category	Number	Length (m)	Width (m)	Height (m)
Corridor	1	36.8	1.8	3.1
Laboratory door	2	0.81	0.055	2.02
Office doors	15	1.56	0.08	2.02
Water dispenser	1	0.55	0.19	1.34

Table 2. Experimental devices and software tools.

	GPU: 256-core NVIDIA GPU
Device	GPU Max Frequency: 1.3 GHz
	CPU: Dual-core NVIDIA CPU
	CUDA 12.3
Software Tools	Pytorch
	Python 3.8

Table 3. Main hyperparameters for model training.

Phase	Parameter	Value
	Backbone	ViL
	Optimizer	AdamW
Model training	Batch size	16
	Depth	24
	Epochs	150
	Patch size	16

Table 4. Comparative data for each network of the independent marker model.

Method	TOP-1 Acc.	Mean Precision	Mean Recall	Mean F1 Score	Parameters/M
ViL	91.25%	90.82%	91.12%	90.96%	23
Swin Transformer V2-Tiny	90.15%	89.87%	90.12%	89.99%	29
EfficientNetV2-S	89.42%	89.33%	89.45%	89.39%	22
ConvNeXt V2	88.96%	88.75%	88.92%	88.84%	89
MAE	87.15%	87.03%	87.09%	87.06%	86
MobileViT	86.29%	86.38%	86.12%	86.25%	5.6

Table 5. Comparative data for each network om the water dispenser marker model.

Method	TOP-1 Acc.	Mean Precision	Mean Recall	Mean F1 Score	Parameters/M
ViL	95.87%	95.62%	95.81%	95.71%	23
Swin Transformer V2-Tiny	94.56%	94.38%	94.49%	94.43%	29
ConvNeXt V2	94.15%	94.02%	94.09%	94.05%	89
EfficientNetV2-S	93.20%	93.11%	93.18%	93.14%	22
MAE	92.71%	92.58%	92.63%	92.60%	86
ViL-T	92.47%	92.36%	92.45%	92.339%	6
MobileViT	91.43%	91.12%	91.38%	91.25%	5.6

Table 6. Comparative data for synthetic data only and synthetic data + fine-tuning models.

Training Strategy	TOP-1 Acc.	Mean Precision	Mean Recall	Mean F1 Score
Synthetic Data Only	95.87%	95.62%	95.81%	95.71%
Synthetic Data + Fine Tuning	97.23%	96.91%%	97.10%	97.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Vehicle Localization in IoV Environments: A Vision-LSTM Approach with Synthetic Data Simulation

Abstract

1. Introduction

2. Related Work

2.1. Image Vision-Based Localization

2.2. Internet of Vehicles and Visual Localization

2.3. Extended Long Short-Term Memory

3. Method

3.1. Construction of 3D Models

3.2. Active Visual Position Detection

3.3. Model Structure

3.4. ViL Encoder

4. Experimentation and Evaluation

4.1. Datasets and Training Setups

4.2. Design of Experiments

4.3. Analysis of Experimental Results

4.3.1. Analysis of Experimental Results (Independent Object Scene)

4.3.2. Analysis of Experimental Results (Complex Environment Scene)

4.3.3. Analysis of Experimental Results (Fine-Tuning Model)

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics