Few-Shot Learning with Multimodal Fusion for Efficient Cloud–Edge Collaborative Communication

Gao, Bo; Liu, Xing; Zhou, Quan

doi:10.3390/electronics14040804

Open AccessArticle

Few-Shot Learning with Multimodal Fusion for Efficient Cloud–Edge Collaborative Communication

by

Bo Gao

¹,

Xing Liu

^1,* and

Quan Zhou

²

¹

School of Sino-German Robotics, Shenzhen Institute of Information Technology, Shenzhen 518172, China

²

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(4), 804; https://doi.org/10.3390/electronics14040804

Submission received: 8 January 2025 / Revised: 12 February 2025 / Accepted: 18 February 2025 / Published: 19 February 2025

(This article belongs to the Special Issue Computation Offloading for Mobile-Edge/Fog Computing)

Download

Browse Figures

Versions Notes

Abstract

As demand for high-capacity, low-latency communication rises, mmWave systems are essential for enabling ultra-high-speed transmission in fifth-generation mobile communication technology (5G) and upcoming 6G networks, especially in dynamic, data-scarce environments. However, deploying mmWave systems in dynamic environments presents significant challenges, especially in beam selection, where limited training data and environmental variability hinder optimal performance. In such scenarios, computation offloading has emerged as a key enabler, allowing computationally intensive tasks to be shifted from resource-constrained edge devices to powerful cloud servers, thereby reducing latency and optimizing resource utilization. This paper introduces a novel cloud–edge collaborative approach integrating few-shot learning (FSL) with multimodal fusion to address these challenges. By leveraging data from diverse modalities—such as red-green-blue (RGB) images, radar signals, and light detection and ranging (LiDAR)—within a cloud–edge architecture, the proposed framework effectively captures spatiotemporal features, enabling efficient and accurate beam selection with minimal data requirements. The cloud server is tasked with computationally intensive training, while the edge node focuses on real-time inference, ensuring low-latency decision making. Experimental evaluations confirm the model’s robustness, achieving high beam selection accuracy under one-shot and five-shot conditions while reducing computational overhead. This study highlights the potential of combining cloud–edge collaboration with FSL and multimodal fusion for next-generation wireless networks, paving the way for scalable, intelligent, and adaptive mmWave communication systems.

Keywords:

cloud–edge collaborative computing; computation offloading; sensing-aided communication; multimodal fusion; deep learning

1. Introduction

Millimeter-wave (mmWave) communication systems have emerged as a key enabler for next-generation wireless networks, including fifth-generation mobile communication technology (5G) and 6G networks, by offering significant spectral resources and extensive bandwidth [1,2,3]. However, the propagation of mmWave signals is significantly affected by severe attenuation, reflection, and blockage during transmission, which complicates the communication channel and creates a dynamic environment [4,5]. These characteristics pose substantial challenges for reliable communication, particularly in scenarios where real-time processing and low-latency decision making are critical.

To mitigate these issues, technologies such as large-scale multiple-input multiple-output (MIMO) antenna arrays and beamforming techniques have been developed. These methods improve spectral efficiency and overall system performance by focusing signal energy in specific directions [6,7]. However, the effectiveness of beamforming hinges on accurate beam selection, which remains a critical bottleneck. Beam selection involves determining the optimal beam direction from a predefined codebook to maximize communication reliability and efficiency [8]. Traditional beam selection methods often require substantial computational resources and high-quality training data, creating challenges in resource-constrained environments such as mobile edge devices [9].

To address these challenges, computation offloading has emerged as a promising approach, where computationally intensive tasks, such as beam selection, are offloaded from resource-limited edge devices to powerful cloud servers [10]. There have been recent advancements in artificial intelligence (AI)-assisted techniques for solving edge computing problems, particularly in vehicular networks. For instance, non-orthogonal multiple-access (NOMA)-assisted secure offloading, combined with asynchronous deep reinforcement learning (ADRL), has been demonstrated to improve resource allocation and network security in dynamic vehicular edge computing environments [11]. These approaches allow for more efficient use of available resources and enable intelligent decision making in real time, considering various factors such as network dynamics and security constraints. By distributing computational workloads, cloud–edge collaboration ensures low-latency inference at the edge while leveraging the cloud for complex training and optimization tasks [12]. This paradigm enables mmWave systems to adapt to dynamic environments without overwhelming local resources.

Advancements in multimodal sensor technologies further enhance the potential of computation offloading by integrating data from diverse sources such as red-green-blue (RGB) cameras, radar, and light detection and ranging (LiDAR). Multimodal data fusion provides a richer representation of the communication environment, enhancing adaptability and robustness under dynamic conditions [13,14]. Through cloud–edge architectures, multimodal data can be processed collaboratively, enabling more accurate beam selection and reducing reliance on a single data source [15,16].

Despite these advancements, achieving high-accuracy beam selection in data-scarce environments remains a significant challenge. Few-shot learning (FSL) is a machine learning approach that excels in environments with limited labeled data and offers a promising solution [17]. Within a cloud–edge framework, FSL enables models to generalize to new tasks with only a few training samples, reducing the need for large-scale data collection at the edge while maintaining high inference accuracy [18]. This combination of FSL, multimodal fusion, and computation offloading paves the way for scalable, low-latency, and adaptive mmWave communication systems capable of addressing the demands of next-generation wireless networks.

The contributions of this work are as follows:

We propose a novel framework that integrates few-shot learning with multimodal data fusion to achieve robust beam selection under data-limited conditions. By leveraging complementary features from RGB images, radar signals, and LiDAR data, the model effectively captures spatiotemporal features, enabling efficient and adaptive communication.
The multimodal fusion approach enhances the model’s robustness against environmental variability and noise. By combining few-shot learning with multimodal inputs, the proposed method can quickly adapt to new scenarios, maintaining high accuracy in dynamic communication environments.
By reducing the dependence on large-scale labeled datasets, the proposed framework optimizes computational efficiency, enabling real-time beam selection and satisfying the low-latency requirements of mmWave communication systems.

2. Related Works

2.1. Beam Selection Methods

Beam selection methods for mmWave communication systems typically rely on large-scale datasets and optimization algorithms, including signal feature-based prediction, codebook search, and radar-assisted techniques [19,20,21]. While these methods can achieve high precision in controlled settings with abundant data and computational resources [7,22], their real-world application is constrained by environmental variability, user mobility, and high computational demands [23].

For instance, traditional codebook-based beam selection methods enable faster beam alignment but often struggle with flexibility in dynamic environments [24]. Exhaustive search techniques, although precise, incur significant computational overhead. Recent advancements have explored radar- and vision-assisted methods, which integrate auxiliary sensor data with signal features to improve performance, particularly in non-line-of-sight (NLOS) scenarios [25,26].

However, these traditional approaches face limitations in highly dynamic communication environments. Frequent updates to training data are required to maintain accuracy, which can be time-intensive and costly. Additionally, environmental factors, such as weather, obstructions, and interference, introduce further complexity, degrading the effectiveness of signal-based methods [27].

2.2. Multimodal Fusion

Multimodal fusion integrates information from multiple sources, such as RGB images, radar, and LiDAR, to offer a more comprehensive representation of complex environments, for example, in urban scenarios where different sensor inputs complement each other [28,29]. By leveraging the strengths of each modality, fusion enhances system robustness and adaptability, especially in dynamic settings.

For example, RGB images capture visual details, radar provides velocity and range information, and LiDAR delivers precise three-dimensional (3D) spatial data. Combining these modalities mitigates the limitations of relying on a single data source, enabling more accurate and robust decision making [30]. In mmWave systems, multimodal fusion is particularly effective in urban environments, where multipath propagation and channel dynamics are prevalent [31].

Fusion approaches include early fusion, late fusion, and hybrid fusion. Early fusion combines features from different modalities at the input stage, allowing for shared representation learning but increasing feature dimensionality [32]. Late fusion aggregates independent predictions from each modality, often through weighted averaging or voting. While computationally efficient, it may miss intricate cross-modal interactions [33]. Hybrid fusion integrates both feature-level and decision-level fusion, balancing interaction modeling with computational efficiency [34].

2.3. Few-Shot Learning

FSL is designed to address scenarios with limited training data. Techniques like meta-stochastic gradient descent (Meta-SGD), Model-Agnostic Meta-Learning (MAML), and fine-tuning have demonstrated success across various domains [35,36,37]. Meta-SGD optimizes both model parameters and learning rates, enabling rapid adaptation but requiring high training complexity [35]. MAML learns a set of initial parameters that adapt efficiently to new tasks with minimal updates, although it incurs computational overhead due to repeated gradient calculations [36]. Fine-tuning transfers knowledge from a pre-trained model to a target task, performing well when the target domain shares similarities with the source domain [37]. In the context of mmWave communication, FSL reduces the reliance on large-scale datasets, enabling efficient beam selection with minimal labeled data. This makes it a compelling solution for dynamic and resource-constrained environments.

3. System Model

3.1. Problem Formulation

As illustrated in Figure 1, we propose a mmWave MIMO communication system, where the communication task is distributed between the edge BS and the cloud server. The edge server is responsible for handling real-time, local data processing, while the cloud server provides computational resources for more complex, centralized tasks. The edge BS interacts with the cloud server to ensure optimal beam selection under the constraints of limited data and varying environmental conditions. The collaboration between the edge and cloud servers allows the system to efficiently adapt to dynamic communication environments. The proposed architecture focuses on achieving accurate beam selection under few-sample conditions at the edge base station.

Specifically, the observed signal,

Y (t)

, is influenced by a dynamic channel

H (t)

and additive white Gaussian noise

η (t)

. The relationship can be expressed as

Y (t) = H (t) \cdot u (t) + η (t),

(1)

where

u (t)

represents the transmitted signal. In highly dynamic mmWave environments, predicting the channel state

H (t)

and selecting an optimal beam with limited data pose significant challenges. Traditional deep learning methods often demand extensive annotated datasets, which are impractical to collect in such scenarios.

To address these challenges, this work integrates FSL with multimodal data fusion. By utilizing limited labeled data across multiple modalities (e.g., RGB images, radar signals, and LiDAR), the method creates a rich representation of the environment, capturing crucial spatial, temporal, and contextual features for precise beam selection.

The proposed system includes a base station equipped with mmWave antenna arrays and various sensors, such as 3D LiDAR, RGB cameras, and frequency-modulated continuous-wave (FMCW) radar. These sensors provide real-time data, including visual and spatial information, as well as velocity measurements. The radar operates by transmitting chirped signals starting at frequency

f_{s}

and linearly increasing to

f_{s} + t

. The received radar signal is represented as

Z_{r} (t) = G_{c} G_{t} e^{j (2 ν τ t + 2 f_{s} τ - ν τ^{2})},

(2)

where

G_{c}

denotes the channel gain,

G_{t}

is the transmitter gain, and

τ

is the round-trip time of the radar signal.

Radar sampling at a rate of

f_{s}

produces N samples per period. For a radar system with M receiving antennas and B chirps, the collected raw data form a matrix of dimensions

M \times N \times B

. A key feature, the range-angle

R_{a}

, is derived using a two-dimensional Fourier transform:

R_{a} = F_{2 D} (\sum_{b = 1}^{B} \int \int Z_{r} (m, n, b) e^{- 2 π j (m u + n v)} d m d n) .

(3)

On the user side, a single-antenna transmitter is employed. The base station utilizes N OFDM subcarriers and applies beamforming using a predefined codebook

G = {g_{d}}_{d = 1}^{D}

, where

g_{d} \in C^{M \times 1}

is the precode beam vector and D is the total number of available beamforming vectors. The channel between the base station and the user at time t is given by

h_{n} (t) = \sqrt{\frac{N_{t} N_{r}}{N_{c} N_{p}}} \sum_{i = 1}^{N_{c}} \sum_{j = 1}^{N_{p}} δ ϕ_{t} (i, j) ϕ_{r} (i, j),

(4)

where

N_{t}

and

N_{r}

denote the numbers of transmitting and receiving antennas,

N_{c}

is the number of scattering clusters,

N_{p}

is the number of propagation paths,

δ

is the path loss factor, and

ϕ_{t}

and

ϕ_{r}

are the response vectors of the antenna array. The received signal at the base station is expressed as

y_{n} (t) = h_{n}^{T} (t) g_{d} (t) x + s_{n} (t),

(5)

where

x \in C

represents the transmitted signal with power

E [| x |^{2}] = P

and

s_{n} (t)

denotes Gaussian noise

N (0, σ^{2})

.

Additionally, we formulate the beam selection optimization problem in mmWave communication systems, where the goal is to select the optimal beam from a predefined beam codebook in a dynamic environment with limited training data. The problem can be mathematically expressed as follows:

Maximize G_{beam} = \sum_{d = 1}^{D} {|h_{d}^{T} \cdot g_{d}|}^{2}

(6)

where

G_{beam}

represents the beamforming gain,

h_{d}

is the channel state vector corresponding to the d-th beam, and

g_{d}

is the beamforming vector from the codebook. The objective is to maximize the beamforming gain by selecting the optimal beam

g_{d}

based on the current channel conditions. The challenge lies in selecting the optimal beam in environments with limited data and high variability, which is addressed by the integration of few-shot learning and multimodal fusion. The objective of this optimization problem is to maximize the beamforming gain, thus improving the communication quality by selecting the most suitable beam direction from the predefined codebook.

3.2. Cloud–Edge Collaboration Framework

The proposed algorithm employs a few-shot learning strategy alongside a multimodal data fusion architecture. As illustrated in Figure 2a, the framework consists of the components outlined below.

3.2.1. Multimodal Input Layer

As described in Table 1, The proposed system utilizes multimodal data from RGB cameras, radar, and LiDAR sensors, where RGB data capture visual context, radar provides velocity and distance measurements, and LiDAR offers precise 3D spatial information, collectively enriching the model’s understanding of the environment for accurate beam selection. Each modality captures a different aspect of the environment—visual, spatial, velocity, and location information—all of which are processed to form a unified input representation. Preprocessing steps include normalizing and synchronizing the data to maintain temporal consistency across modalities. Let the input data from the i-th modality be denoted as

X_{i}

, where

X_{i} = {x_{i}^{k} ∣ t = 1, \dots, K},

(7)

where k represents the time step and K is the total number of time steps considered for temporal consistency.

3.2.2. CNN for Feature Extraction

The CNN module is leveraged to extract spatiotemporal features from the image, LiDAR, and radar data. The three-dimensional convolutions capture both spatial and temporal dynamics, crucial for identifying movement patterns and environmental changes. Let

F_{i}

represent the feature map extracted from the i-th modality. The feature extraction process can be expressed as

F_{i} = ReLU (BN (W_{CNN} * X_{i} + b_{CNN}))

(8)

where

W_{CNN}

and

b_{CNN}

denote the weights and biases of the convolutional layer and ∗ represents the 3D convolution operation. Batch normalization (BN) and ReLU activation functions are applied to ensure stability and efficiency during training.

3.2.3. Transformer Module for Multimodal Fusion

The extracted features from each modality,

F_{i}

, are processed through a Transformer module designed for cross-modal attention. This module effectively captures the relationships and dependencies between different modalities, enabling the fusion of complementary information. The attention mechanism is defined as

Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

(9)

where

Q

,

K

, and

V

are the query, key, and value matrices derived from the features

F_{i}

and

d_{k}

is the dimensionality of the key vectors. The Transformer module aggregates these attention outputs to compute the fused multimodal representation:

F_{fused} = Transformer (F_{1}, F_{2}, \dots, F_{n}),

(10)

where n represents the number of modalities.

3.2.4. Few-Shot Learning Module

The core of the few-shot learning strategy involves a relation network that measures the similarity between the current test sample

X_{test}

and a set of labeled support samples

{X_{support}^{k} ∣ k = 1, \dots, K}

. This metric-based approach computes the relation scores for beam selection. The relation score,

r_{k}

, for the k-th support sample is computed as

r_{k} = σ (W_{rel} \cdot concat (F_{test}, F_{support}^{k}) + b_{rel}),

(11)

where

σ

denotes the sigmoid activation function,

W_{rel}

and

b_{rel}

are the weights and biases of the relation network, and concat represents the concatenation of test and support features.

3.2.5. Beam Selection Layer

Based on the relation scores

{r_{k} ∣ k = 1, \dots, K}

computed in the previous step, the system selects the beam direction corresponding to the support sample with the highest relation score:

{Beam}_{selected} = \arg \max_{k} r_{k} .

(12)

This ensures that the system selects the beam direction that maximizes the similarity with the current communication environment.

The proposed algorithm integrates a few-shot learning strategy alongside a multimodal data fusion architecture and incorporates a cloud–edge collaborative mechanism. In this framework, computationally intensive training tasks are offloaded to the cloud server, while the edge node performs efficient inference by comparing test samples with a few labeled support samples to determine the optimal beam codeword. This design minimizes the computational overhead at the edge and ensures real-time responsiveness.

3.2.6. Cloud–Edge Collaboration Mechanism

The cloud–edge collaboration mechanism consists of two distinct phases:

Cloud Training Phase: The cloud server trains the multimodal fusion model and the relation network on the global dataset to generate a pre-trained model.
Edge Inference Phase: The edge node receives the pre-trained model from the cloud and uses it to compare local test samples against the support set, determining the most suitable beam codeword for the given scenario.

Let the global dataset,

D_{global}

, consist of multimodal data,

X_{i}

, and their corresponding labels,

Y_{i}

. The cloud server optimizes the model parameters,

Θ_{global}

, by minimizing the following training loss:

L_{cloud} = \frac{1}{| D_{global} |} \sum_{(X_{i}, Y_{i}) \in D_{global}} ℓ (M_{global} (X_{i}), Y_{i}),

(13)

where

M_{global}

represents the multimodal fusion and relation network model and

ℓ (\cdot, \cdot)

is the cross-entropy loss.

Then, the trained parameters,

Θ_{global}

, are transmitted to the edge node for inference. At the edge, the node does not perform any additional training but instead utilizes the received model to perform fast comparisons between the test sample,

X_{test}

, and the support set,

D_{support} = {(X_{support}^{c}, Y_{support}^{c}) ∣ c = 1, \dots, C}

.

3.3. Training and Optimization

Optimization of the Few-Shot Component

The few-shot learning component of the proposed framework efficiently incorporates new support samples, avoiding the need for extensive retraining. This approach significantly reduces computational overhead while maintaining real-time responsiveness, as described in Algorithm 1. The proposed multimodal few-shot learning algorithm for beam prediction utilizes a structured dataset divided into three subsets: a training set (

S_{t}

), a support set (

S_{s}

), and a test set (

S_{n}

). The training set is used to pre-train the model on classes distinct from those in the support and test sets, while the support set provides labeled examples for the few-shot learning process, and the test set contains query samples used for evaluation. In a C-way Q-shot configuration, the support set contains Q samples for each of the C classes, and the input data dimensions are formatted as

(C \times Q + Q) \times m \times 256 \times 256

, where m represents the number of input modalities. This structured data format ensures efficient representation and processing within the relation network, enabling the model to adapt effectively to dynamic communication environments.

Algorithm 1: Training procedure of C-way one-shot

To optimize the relation network, the mean squared error (MSE) loss function is employed. This loss function minimizes the squared difference between the predicted output and the true labels, ensuring accurate and reliable predictions for beam selection tasks. The loss is defined as

L (Y_{n}, M (S_{n}, S_{s})) = \frac{1}{2 N} \sum_{n = 1}^{N} {∥ Y_{n} - M (S_{n}, S_{s}) ∥}^{2},

(14)

where

S_{n}

and

S_{s}

represent the test and support datasets,

Y_{n}

denotes the ground-truth labels for the test samples,

M (S_{n}, S_{s})

is the predicted output from the relation network, and N is the total number of test samples. By minimizing this loss, the model learns to generalize effectively under few-shot conditions, making it robust for beam selection tasks in dynamic scenarios.

4. Experiments and Discussion

4.1. Dataset and Settings

The proposed framework was evaluated using the large-scale multimodal dataset provided by DeepSense 6G, for which data are collected using a range of advanced sensors and equipment, as follows:

mmWave Receiver: The system incorporates a 60 GHz phased array receiver equipped with a 16-element Uniform Linear Array (ULA), and the half-power beamwidth is 5°. This receiver supports a 64-beam codebook, enabling adaptive beamforming. It dynamically selects the optimal beam for communication, with the system outputting a 64-element vector that corresponds to the receiver power at each beam. This setup enables detailed 3D radar measurements, capturing essential information about the communication environment.
Radar System: The radar used in the system is a FMCW radar. It features three transmit (Tx) antennas and four receive (Rx) antennas, operating within a frequency range of 76–81 GHz. The radar has a 4 GHz bandwidth, allowing a maximum range of approximately 100 m with a range resolution of 60 cm. The radar collects 3D complex I/Q radar measurements, which are stored in the format 4 × 256 × 128 (number of Rx antennas × samples per chirp × chirps per frame), providing rich environmental data for beam prediction.
LiDAR: A LiDAR system is also employed for data collection, with a range of up to 40 m and a resolution of 3 cm. It features a 360° field of view (FoV) and provides point cloud sampling data. These data are crucial for detailed environmental mapping, thereby enhancing the overall dataset for beam prediction tasks.

Scenes 31–34, as outlined in Table 2, are taken from the DeepSense 6G dataset, representing street-level vehicle-to-infrastructure communication under various conditions. These scenarios include both daytime and nighttime settings to account for different environmental factors that influence beam selection and prediction performance. Each scene in this dataset provides valuable multimodal data, including GPS coordinates, RGB images, and radar and LiDAR sensor readings, alongside power vector measurements obtained from beamforming with a 64-beam codebook.

To ensure the integrity of few-shot learning, the dataset was partitioned into two subsets. Odd-numbered code words were allocated to the training set, while even-numbered code words were used for both the support and test sets. This partitioning ensured that there was no overlap between the training and evaluation datasets, enabling a rigorous assessment of the model’s performance. Each sample in the dataset consisted of five consecutive time-series observations collected across RGB, LiDAR, and radar modalities. The data sequence included frames from the current time step and the four preceding steps

[f - 4, \dots, f]

. Each modality’s data were resized to a uniform dimension of

[5, 256, 256]

to facilitate preprocessing. Using a 32-beam codebook for beam training, a power vector of

1 \times 32

was generated for each sample, allowing for efficient evaluation of beam prediction accuracy under the proposed framework.

4.2. Performance Evaluation

The performance of the proposed framework was assessed under one-shot and five-shot configurations for a 32-way beam selection task. As shown in Figure 3, the accuracy of the model progressively improved with the number of training iterations. Under the one-shot configuration, the model began with a relatively low accuracy but stabilized between 48% and 55% as the training iterations reached 100. The early stages exhibited pronounced oscillations, reflecting the inherent challenge of generalizing with minimal support samples. In contrast, the five-shot configuration showed significantly better performance. The model started at a higher accuracy baseline and converged more rapidly, stabilizing between 60% and 70%. The additional support samples provided in the five-shot setup enhanced the model’s robustness and generalization capabilities, highlighting the importance of leveraging additional support data in few-shot learning tasks.

4.2.1. Impact of Different Transformer Modules on Beam Prediction Accuracy

Under few-shot conditions, the number of Transformer modules significantly affects beam prediction accuracy. Transformer modules play a vital role in capturing dependencies and relationships within multimodal data, enabling the model to effectively integrate contextual information across modalities such as RGB images, radar, and LiDAR. However, selecting an appropriate number of Transformer modules is critical for balancing feature extraction capacity and generalization performance.

As shown in Figure 4, the experimental results indicate that with two Transformer modules, the model achieved Top-1, Top-2, and Top-3 accuracies of 58.15%, 75.13%, and 80.25%, respectively. This suggests that a moderate number of Transformer modules effectively captures spatiotemporal features from multimodal inputs without introducing excessive model complexity. The configuration with two modules performed particularly well in the Top-3 task, indicating its suitability for complex tasks where capturing broad contextual relationships is essential.

In contrast, increasing the number of Transformer modules to three led to a notable decrease in performance, with the Top-1, Top-2, and Top-3 accuracies dropping to 52.76%, 71.45%, and 75.92%, respectively. This decline in accuracy may have resulted from overfitting, as the model’s complexity exceeded the capacity of the limited training data available under the few-shot conditions. The additional Transformer module likely caused the model to capture spurious patterns in the support set, failing to generalize well to the test dataset.

Using four Transformer modules provided a slight improvement over the three-module configuration, yielding Top-1, Top-2, and Top-3 accuracies of 48.12%, 68.07%, and 70.88%, respectively. However, this improvement was marginal compared to the configuration with two modules, indicating diminishing returns in accuracy as more Transformer modules were added. This effect was especially pronounced in the Top-1 and Top-2 tasks, suggesting that additional layers may not meaningfully enhance the model’s ability to capture relevant patterns in the data.

4.2.2. Impact of Convolutional Layer Count on Beam Prediction Accuracy

As illustrated in Figure 5, under few-shot conditions, the number of 3D convolutional layers significantly influenced beam prediction accuracy. The performance metrics reveal a clear trend as the convolutional layer count increased. Specifically, with only two convolutional layers, the model achieved Top-1, Top-2, and Top-3 accuracies of 58.15%, 75.12%, and 80.28%, respectively. These results indicate a foundational capability in feature extraction but leave considerable room for improvement.

When the model was enhanced with three convolutional layers, an observable boost in performance was observed, with the Top-1 accuracy rising to 56.23%, alongside Top-2 and Top-3 accuracies of 73.45% and 76.21%. This increase suggests that adding an additional layer facilitates a more nuanced understanding of spatial and temporal features within the data, enhancing the model’s predictive capabilities.

The most notable improvement occurred with the implementation of four convolutional layers, which achieved a Top-1 accuracy of 52.62%, Top-2 accuracy of 70.76%, and Top-3 accuracy of 72.32%. This peak performance underscores the critical role of depth in convolutional neural networks, as additional layers facilitate more complex feature extraction and representation. The findings indicate that increasing the number of convolutional layers not only improves overall accuracy but also significantly enhances the model’s ability to discern intricate patterns in the data, leading to poor performance in the Top-3 prediction task. This trend highlights the necessity of balancing model complexity with available training data, particularly in few-shot learning scenarios.

4.2.3. Impact of Multimodal Fusion on Beam Prediction Accuracy

As illustrated in Figure 6, the application of multimodal fusion under few-shot conditions markedly enhanced beam prediction accuracy. When evaluating individual modalities, the model’s performance varied as follows: using RGB data yielded a Top-1 accuracy of 45.32%, while LiDAR and radar modalities achieved Top-1 accuracies of 48.23% and 50.18%, respectively. These figures reflect the limitations inherent in relying on a single modality, where the complexities of the environment can lead to misinterpretations in the data.

In contrast, the multimodal fusion approach demonstrated a substantial improvement in accuracy across all tasks, achieving Top-1, Top-2, and Top-3 accuracies of 58.12%, 75.13%, and 80.25%, respectively. The fusion method effectively capitalized on the strengths of each modality, integrating diverse data sources to create a more comprehensive understanding of the environment. This synergy not only enhanced predictive accuracy but also demonstrated resilience against weaknesses that might arise from any single data source.

The superior performance of multimodal fusion, particularly in the Top-3 task, highlights its adaptability in complex scenarios. By combining information from RGB, LiDAR, and radar, the model leverages complementary features, enabling it to better navigate ambiguous situations. This approach not only addresses the challenges posed by few-shot conditions but also sets a new benchmark for accuracy in beam prediction tasks, emphasizing the importance of multimodal strategies in advanced machine learning applications.

4.2.4. Comparison

To verify the effectiveness of the proposed algorithm, we conducted experiments by comparing the proposed algorithm with the following existing baseline algorithms:

Meta-SGD: Learns both the initial parameters and learning rates for each parameter, allowing the model to quickly adapt to new tasks. It is well suited for highly customized few-shot tasks.
MAML: Optimizes the initial parameters so that the model can quickly adapt with few updates under few-shot conditions, making it adaptable to different model architectures.
Fine-Tuning: Involves pre-training a model on a large dataset and then fine-tuning it on the target few-shot task. It performs best when the target task has similar characteristics to the source dataset.

As illustrated in Figure 7, the experimental results demonstrated the effectiveness of the proposed model under both one-shot and five-shot conditions. The proposed method demonstrated the best overall performance across all training-to-test ratios. Starting with an accuracy of 37.98% under the 1/9 ratio, it showed a rapid improvement as training data increased, reaching 88.37% under the 9/1 ratio. This significant performance boost highlights the method’s strong generalization ability, especially in leveraging multimodal data and few-shot learning techniques for robust beam selection.

For the baseline methods, Meta-SGD achieved the best overall performance among the three methods. Starting with an accuracy of 31.87% under the 1/9 ratio, it exhibited a steady improvement as the training data increased, ultimately stabilizing at an accuracy of 85.12% under the 9/1 ratio. This demonstrates Meta-SGD’s strong adaptability and its ability to effectively leverage the available training data to optimize performance. MAML also showed consistent improvement with increasing training data, starting slightly lower than Meta-SGD at an accuracy of 30.02% under the 1/9 ratio and reaching 83.08% under the 9/1 ratio. Although MAML’s performance was close to that of Meta-SGD, particularly in high training-to-test ratios, it slightly lagged behind Meta-SGD across all scenarios, indicating that it is less efficient in adapting to few-shot learning tasks. Fine-tuning started at the lowest accuracy of 28.12% under the 1/9 ratio but showed significant improvements with more training data. By the 9/1 ratio, it achieved an accuracy of 82.09%, approaching the performance of MAML. However, fine-tuning generally required more training data to reach comparable performance levels, reflecting its reliance on pre-training and its limited adaptability in few-shot learning scenarios.

As shown in Table 3 and Figure 8, Meta-SGD had a time complexity of

O (N \times Q \times D)

and required updates to both the initial parameters and learning rates for each task, where N is the number of data points; Q denotes the number of queries or samples per task; D representes the dimensionality of the input data. This resulted in a time cost of 0.223 s, making it more computationally expensive than some other methods. In comparison, MAML had a higher time complexity due to the need for multiple gradient updates per task, and the extra iteration factor E further increased the computational burden. This resulted in a time cost of 0.324 s, which was higher than that of Meta-SGD, making MAML more time-consuming overall. Fine-tuning, however, was the least time-consuming approach, with a time cost of just 0.085 s. This efficiency is due to the fact that fine-tuning only involves adjusting the pre-trained weights post-training, without requiring multiple iterations of parameter updates per task. As a result, it was much faster compared to Meta-SGD and MAML. In terms of space complexity, Meta-SGD and fine-tuning had lower space requirements as they do not store gradients or task-specific updates. On the other hand, MAML required extra memory for storing gradients and updates for each task, leading to a higher space complexity. Regarding convergence speed, Meta-SGD and MAML typically converged faster due to optimized initialization, but MAML’s convergence was more task-dependent because of the extra iterations. Fine-tuning typically converged more slowly (due to fewer initial updates) but excelled in its adaptability to new tasks, as shown by its efficient fine-tuning process with a relatively low time cost of 0.085 s.

These results underline the importance of sample size in model performance for beam prediction tasks. While all models showed improved accuracy with more training samples, the proposed model excelled in few-shot scenarios and maintained robust performance with larger sample sizes. The comparative analysis highlights that increasing the number of support samples significantly enhances prediction capabilities. For example, moving from the one-shot to the five-shot configuration led to noticeable improvements in accuracy, emphasizing the value of leveraging additional support data in practical applications.

5. Conclusions

This paper introduces a novel framework for beam selection in mmWave communication systems, integrating few-shot learning and multimodal fusion within a cloud–edge collaboration paradigm. By leveraging RGB, radar, and LiDAR data, the framework demonstrates robust beam selection capabilities under data-scarce conditions. The division of computational tasks between the cloud and edge ensures scalability and low-latency performance, making the framework well suited for dynamic communication environments. Experimental results showed significant performance improvements in both one-shot and five-shot configurations, with the proposed approach outperforming traditional methods in terms of accuracy and adaptability. Future work will focus on extending the framework to ultra-dense networks and incorporating additional sensor modalities to further enhance scalability and efficiency in next-generation wireless communication systems.

Author Contributions

B.G. proposed the research concept, collected and organized the data, developed the algorithms, and wrote the first draft of the manuscript, including the design of the experimental procedures. X.L. conducted the theoretical analysis, supervised the research process, and revised the manuscript. Q.Z. provided guidance on the thesis structure, reviewed and refined the manuscript, and proposed key solutions for the research challenges. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Funds of the National Natural Science Foundation of China (Grant No. 62303327 and No. 92467204).

Data Availability Statement

The data presented in this study are available from the corresponding author on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, G.; Du, J.; Yuan, X.; Zhang, K. Differential Privacy-Based Location Privacy Protection for Edge Computing Networks. Electronics 2024, 13, 3510. [Google Scholar] [CrossRef]
Rappaport, T.S.; Sun, S.; Mayzus, R.; Zhao, H.; Azar, Y.; Wang, K.; Wong, G.N.; Schulz, J.K.; Samimi, M.; Gutierrez, F. Millimeter wave mobile communications for 5G cellular: It will work! IEEE Access 2013, 1, 335–349. [Google Scholar] [CrossRef]
Pi, Z.; Khan, F. An introduction to millimeter-wave mobile broadband systems. IEEE Commun. Mag. 2011, 49, 101–107. [Google Scholar] [CrossRef]
Roh, W.; Seol, J.Y.; Park, J.; Lee, B.; Lee, J.; Kim, Y.; Cho, J.; Cheun, K.; Aryanfar, F. Millimeter-wave beamforming as an enabling technology for 5G cellular communications: Theoretical feasibility and prototype results. IEEE Commun. Mag. 2014, 52, 106–113. [Google Scholar] [CrossRef]
Rappaport, T.S.; Heath, R.W.; Daniels, R.C.; Murdock, J.N. Wireless Communications: Principles and Practice; Pearson Education: New York, NY, USA, 2014. [Google Scholar]
Heath, R.W.; González-Prelcic, N.; Rangan, S.; Roh, W.; Zhang, C. An overview of signal processing techniques for millimeter wave MIMO systems. IEEE J. Sel. Top. Signal Process. 2016, 10, 436–453. [Google Scholar] [CrossRef]
Andrews, J.G.; Buzzi, S.; Choi, W.; Hanly, S.V.; Lozano, A.; Soong, A.C.K.; Zhang, J.C. What will 5G be? IEEE J. Sel. Areas Commun. 2014, 32, 1065–1082. [Google Scholar] [CrossRef]
Alkhateeb, A.; Leus, G.; Heath, R.W. Channel estimation and hybrid precoding for millimeter wave cellular systems. IEEE J. Sel. Top. Signal Process. 2014, 8, 831–846. [Google Scholar] [CrossRef]
Kaur, J.; Khan, M.A.; Iftikhar, M.; Imran, M.; Haq, Q.E.U. Machine learning techniques for 5G and beyond. IEEE Access 2021, 9, 23472–23488. [Google Scholar] [CrossRef]
Fernando, N.; Shrestha, S.; Loke, S.W.; Lee, K. On Edge-Fog-Cloud Collaboration and Reaping Its Benefits: A Heterogeneous Multi-Tier Edge Computing Architecture. Future Internet 2025, 17, 22. [Google Scholar] [CrossRef]
Ju, Y.; Cao, Z.; Chen, Y.; Liu, L.; Pei, Q.; Mumtaz, S. NOMA-Assisted Secure Offloading for Vehicular Edge Computing Networks with Asynchronous Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 2627–2640. [Google Scholar] [CrossRef]
Van Anh Duong, D.; Akter, S.; Yoon, S. Task Offloading and Resource Allocation for Augmented Reality Applications in UAV-Based Networks Using a Dual Network Architecture. Electronics 2024, 13, 3590. [Google Scholar] [CrossRef]
Zhang, Y.; Osman, T.; Alkhateeb, A. Online beam learning with interference nulling for millimeter wave MIMO systems. IEEE Trans. Wirel. Commun. 2024, 23, 5109–5124. [Google Scholar] [CrossRef]
Elbir, A.M.; Mishra, K.V. A deep learning approach for hybrid beamforming in multi-cluster millimeter-wave MIMO. IEEE Trans. Veh. Technol. 2019, 68, 4132–4141. [Google Scholar] [CrossRef]
Dokhanchi, S.H.; Mysore, B.S.; Mishra, K.V.; Ottersten, B. A mmWave automotive joint radar-communications system. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 1241–1260. [Google Scholar] [CrossRef]
Elsanhoury, M.; Zhang, Y.; Farooq, M.U.; Otoum, Y. Precision positioning for smart logistics using ultra-wideband technology-based indoor navigation: A review. IEEE Access 2022, 10, 44413–44445. [Google Scholar] [CrossRef]
Cheng, J.; Hao, F.; He, F.; Liu, L.; Zhang, Q. Mixer-based semantic spread for few-shot learning. IEEE Trans. Multimed. 2023, 25, 191–202. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 4077–4087. [Google Scholar]
Sun, S.; Rappaport, T.S.; Heath, R.W., Jr.; Nix, A.; Rangan, S. Beamforming for millimeter-wave communications: An overview. IEEE Commun. Mag. 2018, 56, 124–131. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, W.; Wang, W.; Yang, L.; Zhang, W. Research challenges and opportunities of UAV millimeter-wave communications. IEEE Wirel. Commun. 2019, 26, 58–62. [Google Scholar] [CrossRef]
Lim, S.H.; Kim, S.; Shim, B.; Choi, J.W. Deep learning-based beam tracking for millimeter-wave communications under mobility. IEEE Trans. Commun. 2021, 69, 7458–7469. [Google Scholar] [CrossRef]
Alkhateeb, A.; Charan, G.; Osman, T.; Hredzak, A.; Srinivas, N. DeepMIMO: A generic dataset for mmWave and massive MIMO applications. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 9–13 December 2018; pp. 1–6. [Google Scholar]
González-Prelcic, J.; González-Prelcic, N.; Venugopal, K.; Heath, R.W. Channel Estimation and Hybrid Precoding for Frequency Selective Multiuser mmWave MIMO Systems. IEEE J. Sel. Top. Signal Process. 2018, 12, 353–367. [Google Scholar] [CrossRef]
Giordani, M.; Polese, M.; Roy, A.; Castor, D.; Zorzi, M. A tutorial on beam management for 3GPP NR at mmWave frequencies. IEEE Commun. Surv. Tutor. 2018, 21, 173–196. [Google Scholar] [CrossRef]
Niu, Y.; Li, Y.; Jin, D.; Su, L.; Vasilakos, A.V. A survey of millimeter wave communications (mmWave) for 5G: Opportunities and challenges. Wirel. Netw. 2015, 21, 2657–2676. [Google Scholar] [CrossRef]
Zhou, Q.; Gong, Y.; Nallanathan, A. Radar-Aided Beam Selection in MIMO Communication Systems: A Federated Transfer Learning Approach. IEEE Trans. Veh. Technol. 2024, 73, 12172–12177. [Google Scholar] [CrossRef]
Hur, S.; Kim, T.; Love, D.J.; Vook, J.; Ghosh, A. Millimeter wave beamforming for wireless backhaul and access in small cell networks. IEEE Trans. Commun. 2013, 61, 4391–4403. [Google Scholar] [CrossRef]
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 689–696. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
Kumar, A.; Stephen, K.; Sabitha, A.S. A systematic review on sensor fusion technology in autonomous vehicles. In Proceedings of the 2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 20–21 July 2023; pp. 42–48. [Google Scholar] [CrossRef]
Martin-Vega, F.J.; Aguayo-Torres, M.C.; Gomez, G.; Entrambasaguas, J.T.; Duong, T.Q. Key technologies, modeling approaches, and challenges for millimeter-wave vehicular communications. IEEE Commun. Mag. 2018, 56, 28–35. [Google Scholar] [CrossRef]
Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Calhoun, V.D.; Adali, T. Feature-based fusion of medical imaging data. IEEE Trans. Inf. Technol. Biomed. 2009, 13, 711–720. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Zhou, F.; Chen, F.; Li, H. Meta-SGD: Learning to learn quickly for few-shot learning. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 3681–3691. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montréal, QC, Canada, 8 December 2014; pp. 3320–3328. [Google Scholar]

Figure 1. The proposed beam selection for cloud–edge collaboration in MIMO communication systems.

Figure 2. Illustration of the proposed few-shot beam prediction model and its components. (a) The feature extraction module. (b) The proposed beam prediction model.

Figure 3. The accuracy of the proposed model under 1-shot and 5-shot conditions fluctuates with a training-to-test-set ratio of 1:1.

Figure 4. Accuracies of different Transformer modules (multi-attention layers) under 1-shot conditions.

Figure 5. The impact of the number of CNN layers on the results.

Figure 6. Accuracy of different models and multimodal fusion under 1-shot conditions.

Figure 7. Accuracies of the proposed approach and various baseline methods.

Figure 8. The proposed algorithm was evaluated against the baseline algorithms in terms of inference time for a single test data point following the training phase.

Table 1. Structure of the proposed model and dimension changes.

Module	Input Size	Output Size
Image Enh. and Segmentation	$3 \times 5 \times 256 \times 256$	$3 \times 5 \times 256 \times 256$
Point Cloud Filtering	$1 \times 5 \times 256 \times 256$	$1 \times 5 \times 256 \times 256$
Angle + Speed FFT	$2 \times 5 \times 256 \times 256$	$2 \times 5 \times 256 \times 256$
Embedding Layer	$64 \times 5 \times 128 \times 128$	$64 \times 5 \times 128 \times 128$
3D Module 1	$64 \times 5 \times 128 \times 128$	$64 \times 5 \times 64 \times 64$
3D Module 2	$64 \times 5 \times 64 \times 64$	$128 \times 5 \times 32 \times 32$
Transformer Layer	$128 \times 5 \times 32 \times 32$	$128 \times 5 \times 32 \times 32$
3D Module 3	$128 \times 5 \times 32 \times 32$	$256 \times 5 \times 16 \times 16$
Transformer Layer	$256 \times 5 \times 16 \times 16$	$512 \times 5 \times 8 \times 8$
Pooling Layer	$512 \times 5 \times 8 \times 8$	$5 \times 512$
Summation Operation	$5 \times 512$	512
Feature Extraction Module	Test Data	-
Feature Sharing Module	Support Data	-
Concatenate Layer	$512 + 512 = 1024$	1024
FC Layer 1	1024	512
FC Layer 2	512	256
FC Layer 3	256	128
FC Layer 4	128	32
Relation Score	32	1 (Scalar)

Table 2. The details of the few-shot experimental dataset.

Parameter	Assignment
Training set	Scenarios 31, 32, 33, 34
Support set	Scenarios 31, 32, 33, 34
Test set	Scenarios 31, 32, 33, 34
Number of samples	120,000
Number of support samples	One-shot: 1 × C, Q-shot: $Q \times C$
C-way classification	32 classes
Few-shot configurations	1-shot, 5-shot
Input data dimensions	$(C \times Q + Q) \times 5 \times 256 \times 256$

Table 3. Complexity analysis of the Meta-SGD, MAML, and fine-tuning algorithms.

Algorithm	Time Complexity	Space Complexity	Convergence Speed	Scalability
Meta-SGD	$O (N \times Q \times D)$	$O (N \times D)$	Fast	Moderate
MAML	$O (N \times Q \times D \times E)$	$O (N \times D \times E)$	Moderate, task-dependent	Medium, affected by task count
Fine-Tuning	$O (N \times D)$	$O (N \times D)$	Slower	High, adaptable to various tasks
Relation Networks	$O (N \times Q \times D^{2})$	$O (N \times D)$	Moderate	Moderate, adaptable to metric-based tasks

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, B.; Liu, X.; Zhou, Q. Few-Shot Learning with Multimodal Fusion for Efficient Cloud–Edge Collaborative Communication. Electronics 2025, 14, 804. https://doi.org/10.3390/electronics14040804

AMA Style

Gao B, Liu X, Zhou Q. Few-Shot Learning with Multimodal Fusion for Efficient Cloud–Edge Collaborative Communication. Electronics. 2025; 14(4):804. https://doi.org/10.3390/electronics14040804

Chicago/Turabian Style

Gao, Bo, Xing Liu, and Quan Zhou. 2025. "Few-Shot Learning with Multimodal Fusion for Efficient Cloud–Edge Collaborative Communication" Electronics 14, no. 4: 804. https://doi.org/10.3390/electronics14040804

APA Style

Gao, B., Liu, X., & Zhou, Q. (2025). Few-Shot Learning with Multimodal Fusion for Efficient Cloud–Edge Collaborative Communication. Electronics, 14(4), 804. https://doi.org/10.3390/electronics14040804

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Learning with Multimodal Fusion for Efficient Cloud–Edge Collaborative Communication

Abstract

1. Introduction

2. Related Works

2.1. Beam Selection Methods

2.2. Multimodal Fusion

2.3. Few-Shot Learning

3. System Model

3.1. Problem Formulation

3.2. Cloud–Edge Collaboration Framework

3.2.1. Multimodal Input Layer

3.2.2. CNN for Feature Extraction

3.2.3. Transformer Module for Multimodal Fusion

3.2.4. Few-Shot Learning Module

3.2.5. Beam Selection Layer

3.2.6. Cloud–Edge Collaboration Mechanism

3.3. Training and Optimization

Optimization of the Few-Shot Component

4. Experiments and Discussion

4.1. Dataset and Settings

4.2. Performance Evaluation

4.2.1. Impact of Different Transformer Modules on Beam Prediction Accuracy

4.2.2. Impact of Convolutional Layer Count on Beam Prediction Accuracy

4.2.3. Impact of Multimodal Fusion on Beam Prediction Accuracy

4.2.4. Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI