Multi-Channel Physical Feature Convolution and Tri-Branch Fusion Network for Automatic Modulation Recognition

Changkai Zhang; Junyi Luo; Kaibo Shi; Tao Liu; Chenyu Ling

doi:10.3390/electronics14244847

,

and

¹

School of Electronic Information and Electrical Engineering, Chengdu University, Chengdu 610106, China

²

Chengdu Kinyea Technologies Co., Ltd., Chengdu 610299, China

^*

Authors to whom correspondence should be addressed.

Electronics2025, 14(24), 4847;https://doi.org/10.3390/electronics14244847

Version Notes

Order Reprints

Abstract

Automatic modulation recognition (AMR) plays a critical role in intelligent wireless communication systems, particularly under conditions with a low signal-to-noise ratio (SNR) and complex channel environments. To address these challenges, this paper proposes a three-branch fusion network that integrates complementary features from the time, frequency, and spatial domains to enhance classification performance. The model consists of three specialized branches: a multi-channel convolutional branch designed to extract discriminative local features from multiple signal representations; a bidirectional long short-term memory (BiLSTM) branch capable of capturing long-range temporal dependencies; and a vision transformer (ViT) branch that processes constellation diagrams to exploit global structural information. To effectively merge these heterogeneous features, a path attention module is introduced to dynamically adjust the contribution of each branch, thereby achieving optimal feature fusion and improved recognition accuracy. Extensive experiments on the two popular benchmarks, RML2016.10a and RML2018.01a, show that the proposed model consistently outperforms baseline approaches. These results confirm the effectiveness and robustness of the proposed approach and highlight its potential for deployment in next-generation intelligent modulation recognition systems operating in realistic wireless communication environments.

Keywords:

automatic modulation recognition; convolutional neural network; multi-channel features; vision transformer; classification; time–frequency features

1. Introduction

AMR is a crucial process for determining the specific modulation format applied to a given signal, serving a pivotal function in diverse signal processing domains like cognitive communications, spectrum surveillance, and electronic warfare. In cognitive communication frameworks, AMR facilitates the automatic adjustment of modulation parameters by communication devices in response to varying channel conditions, enabling real-time adaptation to environmental changes. Conversely, in non-collaborative settings such as electronic warfare, AMR aids in identifying the modulation scheme of intercepted signals, thereby supporting demodulation, target recognition, and optimal interference management [1].

When faced with radio frequency signals of unknown characteristics, AMR plays a crucial role as an initial step, enabling devices to demodulate both standard and non-standard modulation schemes. Effective AMR techniques optimize the utilization of transmission medium resources, thereby improving the interference resilience of contemporary cognitive radio systems. Systems employing adaptive modulation schemes can utilize AMR to continuously monitor channel conditions, dynamically adjusting the modulation scheme to achieve optimal resource allocation [2,3,4]. The advancement of efficient AMR algorithms is considered a promising and essential approach to effectively address increasingly intricate application scenarios. As a result, the progress in these algorithms has drawn substantial interest from researchers and professionals in the field of communication systems. In conventional practice, AMR methods are typically classified into two major groups: likelihood-driven and feature-driven approaches. Likelihood-based approaches formulate AMC as a composite hypothesis testing task, in which multiple hypotheses are evaluated to represent various modulation schemes. These methods are grounded in the Bayesian framework, which guarantees optimal performance in terms of classification accuracy under the minimum error misclassification criterion [5,6]. Under the likelihood-based (LB) framework, the likelihood of the observed signal is evaluated for each possible modulation hypothesis, and the selection of the modulation type is carried out via a likelihood ratio test [7].

Feature-based (FB) methods focus on obtaining different features in the signal, which are then used for classification [8]. Common LB techniques include the ALRT [9,10], GLRT [11], and HLRT [12]. In the FB approach, recent advancements have incorporated techniques such as phase diagram entropy. For example, the authors of [13] use phase diagram entropy to effectively characterize modulations in underwater acoustic communication, enhancing recognition and classification accuracy. The authors of [14] applied this entropy method to spread spectrum modulation recognition, demonstrating its robustness in the presence of noise and interference. The authors of [15,16] introduced a Wavelet-based Adaptive Modulation Recognition Network (WAN) that synergistically integrates Digital Signal Processing (DSP) with deep learning, significantly improving modulation recognition in low-SNR conditions. The authors of [17] utilized compressive sensing to directly reconstruct signal features for digital modulation recognition, enhancing efficiency while maintaining accuracy. The authors of [18] proposed a modulation classification method that combines compressed sensing with high-order cumulants and cyclic spectrum, achieving high accuracy in cognitive radio signal recognition under low-SNR conditions.

However, traditional methods discussed above have notable limitations. They often lack sufficient discriminative power when dealing with complex models and are heavily reliant on predefined features [19]. To address these limitations, AI methodologies—especially machine learning (ML) and deep learning (DL)—have been widely adopted in recent years across computer vision, speech classification, and information retrieval. Methods grounded in machine learning learn informative representations directly from raw samples to encode task-relevant information. Over diverse application domains—computer vision [20], natural language processing [21], recommender systems [22], object detection [23], and anomaly detection [24,25]—they have demonstrated substantial success.

A variety of deep-learning families—CNNs [26], RNNs [27], GNNs [28], and Transformers [29]—have been introduced in recent years. Taken together, they have shown notable effectiveness in the previously discussed application areas. Then, deep-learning-based AMR techniques have attracted growing interest. Unlike traditional approaches that depend on manual feature extraction, modern deep models learn representations and perform classification simultaneously in a unified end-to-end system, facilitating automatic feature learning and classification within a single model. The specific meanings of the abbreviations used in this article are shown in Table 1.

Table 1. List of abbreviations and acronyms.

To strengthen architectural design, neural networks are now often optimized with deep learning methods; notable families are CNNs, RNNs, Transformers, and hybrid variants. For instance, in refs. [30,31], multi-layer CNNs are utilized to extract features from IQ signals for classification and recognition tasks. In [32], complex convolutional techniques are applied to extract multi-dimensional features, improving classification accuracy. The method in [33] relies on a dual-layer LSTM to derive AP features from the received signal and drive the classifier. Moreover, refs. [34,35,36] combine RNNs with CNNs to simultaneously capture temporal and spatial information, facilitating feature complementarity. In [37], a custom convolutional block is designed to capture spatiotemporal correlations within modulated signals through various asymmetric convolutional kernels. Finally, ref. [38] presents a dual-path network, where one path utilizes CNNs for time domain and spectral feature extraction, and the other employs LSTM to capture temporal dependencies.

In addition to optimizing network architectures, signal preprocessing is also a critical step in AMR [39]. A range of preprocessing strategies has been introduced, including the extraction of spectral features through specific processing methods [40], statistical features [41], constellation diagrams [42,43], SPWVD images, and IQ signals. Other approaches focus on extracting amplitude/phase features [38,44], and combining IQ signals with separate I and Q components. For instance, introduced a feature fusion method that integrates manual features, the SPWVD, and the BJD using CNNs, which significantly enhances the discriminative power of the extracted signal features.

Originally proposed by Vaswani et al., the Transformer architecture is based on attention mechanisms and was designed for NLP sequence-modeling tasks. Through concurrent cross-position attention—absent in classic recurrent or convolutional models—the Transformer effectively captures distant dependencies and global contextual information. In [45], a combination of CNN for spatial feature extraction and Transformer for modeling global dependencies was used to improve recognition performance. Similarly, ref. [19] employed CNNs for local feature extraction, followed by a Transformer to capture sequence dependencies, which are then input to a GNN for classification, yielding strong recognition reliability.

To adapt the Transformer framework for image-level categorization, the ViT [46] partitions each image into fixed-dimension patches, which are subsequently represented as tokens and fed into the Transformer as a sequence. By operating on this sequence of patch embeddings, ViT can capture global contextual relationships across the image, effectively overcoming the restricted receptive field inherent to traditional convolution-based approaches.

As evident from the literature, most current network models process signal inputs either as standalone IQ signals or as preprocessed IQ signals. Additionally, model architectures are predominantly implemented in a fusion format, where classification outputs are generated through simple concatenation. Given these limitations, along with the reliance on single-path architectures and fixed feature fusion strategies in existing modulation recognition methods, this paper proposes a new framework designed to improve the performance of modulation recognition. In the designed architecture, a CNN ingests multi-channel inputs containing physical prior features, while a Bi-LSTM layer is employed to capture two-way spatiotemporal correlations and modulation cycle characteristics, and a three-branch fusion pathway using ViT for constellation diagram.

The main contributions and innovations of this work are as follows:

Multi-channel physical prior feature fusion input: The original complex signal is converted into multiple physically meaningful channels, including IQ (real and imaginary parts), phase, phase difference, amplitude, second-order spectrum fast Fourier transform(FFT), and fourth-order spectrum (higher-order spectrum). Four CNN channels process the IQ signal, together with one original-signal channel, emphasizing physical interpretability rather than treating IQ or spectra in isolation.
Three-branch joint structure (CNN, BiLSTM, ViT): Each branch processes different features for final comprehensive discrimination. The CNN branch uses the above multi-channel inputs; the BiLSTM branch captures modulation periodicity and long-term temporal dependencies; and the IQ signal is converted into a constellation diagram for modulation-type identification via a Vision Transformer. Fusing signal- and image-based cues yields more accurate results.
Path attention fusion module: Instead of simple concatenation, this module adapts to varying feature dependencies across modulation types. It enables the model to adjust under diverse SNR conditions while ensuring stable feature quality. The three path outputs are adaptively weighted to generate the final feature vector for classification and identification.

The remainder of this paper is organized as follows. Section 2 introduces the signal and identification tasks. Section 3 introduces the architecture of each branch in the proposed three-branch architecture and the signal processing methods in CNN. In Section 4, the complete three-branch architecture is outlined. The experiments and their discussion are presented in Section 5, while Section 6 contains the conclusions.

2. Signal Model and Task

Modulation is a fundamental operation in wireless communication systems, in which the parameters of a high-frequency carrier are varied according to the information-bearing base band signal. By mapping the information onto the carrier, modulation enables reliable and accurate data transmission over practical communication channels. The choice of modulation format strongly influences spectral efficiency and implementation complexity, as well as the system’s robustness to noise and interference. In a conventional wireless link, the signal observed at the receiver can be modeled as follows:

x (t) = s (t) * h (t) + n (t),

(1)

where

s (t)

is the noise-free complex base-band envelope of the received signal; ∗ denotes the convolution operation;

n (t)

is Additive White Gaussian Noise with zero mean and variance

σ_{n}^{2}

;

h (t)

is the impulse response. The IQ representation is highly flexible. For mathematical operations, it simplifies computation and transformation processes; for hardware design, it enables compact architectures and efficient implementation, the signal can be expressed using orthogonal decomposition as

x (t) = I (t) + j Q (t)

, where

I (t) = Re [x (t)]

denotes the in-phase component, and

Q (t) = Im [x (t)]

represents the quadrature component. To analyze various modulation schemes, the transmitted signal can be represented as

s (t)

, where the characteristics of the modulated signal are identified under different conditions.

s (t) = [A_{m} \sum_{n} a_{n} g (t - n T_{s})] \cdot \cos (2 π (f_{c} + f_{m}) t + ϕ_{0} + ϕ_{m}),

(2)

In the parameter set,

A_{m}

refers to the modulation amplitude and

a_{n}

to the symbol sequence.

g (t)

denotes the pulse-shaping waveform within one symbol duration. The symbol period is denoted by

T_{s}

, with

f_{c}

and

f_{m}

specifying the carrier and modulation frequencies, and

ϕ_{0}

and

ϕ_{m}

indicating the initial and modulation phases.

For AMR tasks, the receiver discerns the modulation method of the sent signal by evaluating either the raw received waveform or related statistical measures. This process can be mathematically represented as follows.

y = \arg \max_{i \in {1, 2, \dots, N}} \Pr (y = i ∣ Q (r (t))),

(3)

Q (r (t))

represents the feature extraction function for the received signal,

P r (\cdot)

represents the posterior probability, and y is the final predicted class result [15]. Accurately identifying the modulation format in modulation recognition relies on extracting discriminative features from the received signal. The effectiveness of such algorithms is usually assessed through their classification accuracy and computational efficiency.

In practice, we do not reconstruct a full physical transmitter to synthesize

s (t)

within this work; instead, we directly use the complex base band IQ sequences provided by the public datasets RML2016.10a and RML2018.01a as realizations of the received signal

x (t)

. During the construction of these datasets, a noise-free complex base band signal

s (t)

is first generated under a predefined modulation scheme (e.g., BPSK, QPSK, 8PSK, 16QAM, 64QAM, etc.) using simulated or experimental communication chains, and then passed through channel impairments

h (t)

and corrupted by additive white Gaussian noise

n (t)

to obtain the received waveform

x (t) = s (t) * h (t) + n (t)

, with each sample annotated by its true modulation label. Here,

s (t)

can be regarded as the complex base band envelope corresponding to different modulation schemes; its discrete-time counterpart forms a zero-IF, normalized IQ sequence, and the subsequent multi-branch network discriminates among modulation types by exploiting the resulting differences in their temporal and IQ-plane features.

3. The Proposed Method

We outline the detail of the proposed three-branch modulation classification model in this section, which combines domain knowledge with deep learning-driven feature extraction. Specifically, the model consists of three parallel branches—CNN, BiLSTM, and ViT —each designed to capture unique and complementary features from raw IQ signals.

The CNN branch utilizes multiple handcrafted features derived from the IQ sequence, such as amplitude, phase, differential phase, and higher-order frequency components. These features are processed through multi-path convolutions to extract comprehensive representations from over both temporal and frequency domain representations. The BiLSTM branch processes the original IQ sequence directly, learning long-range temporal dependencies using a bidirectional recurrent structure. Simultaneously, the ViT branch converts the IQ signal into a 2D constellation diagram and employs Transformer-based self-attention mechanisms to capture global spatial relationships.

Each branch generates a high-level feature vector that represents distinct aspects of the input signal. These features are then integrated using a Path Attention Module (PAM), which ensures robust and accurate modulation classification across a variety of wireless communication scenarios.

3.1. IQ Signal Processing for Multi-Channel CNN

Wireless receivers often model incoming signals as complex numbers composed of in-phase (I) and quadrature (Q) elements. Amplitude and phase information—key to distinguishing various modulation formats—are jointly represented by these components. In this study, the IQ signal is modeled as a two-dimensional real-valued matrix

X_{IQ} \in R^{L \times 2}

, in which L stands for the temporal sequence length, and the columns map to the in-phase and quadrature components, respectively.

To further enhance the model’s discriminative ability, we compute several additional features derived from the original IQ sequence. Specifically, the amplitude is calculated as

A (t) = \sqrt{I {(t)}^{2} + Q {(t)}^{2}}

; the phase as

ϕ (t) = \arctan (\frac{Q (t)}{I (t)})

, and the instantaneous frequency, such as the phase difference, is given by

Δ ϕ (t) = ϕ (t) - ϕ (t - 1)

. These features are concatenated channel-wise to form a multi-channel time-series matrix

X_{CNN} \in R^{C \times L}

, where

C \in {5, 6, 7}

represents the number of derived features used. This enriched representation is then fed as input to the CNN branches.

3.2. Multi-Channel CNN

As can be seen in Figure 1, the proposed model employs a multi-branch CNN architecture, which processes a seven-channel input derived from the original complex IQ signal. These channels include the real and imaginary components, amplitude, phase, differential phase, and the real as well as imaginary components of the FFT spectrum. To capture a wide range of signal characteristics from both time and frequency domains, the model utilizes five parallel branches. The first branch employs three 1D convolutional stages to operate on five time domain features, where each stage includes batch normalization, a ReLU activation function, an SE mechanism, and max pooling. The SE module adaptively reweighs the relevance level of every feature channel to the target task, allowing the model to focus on modulation-relevant components while suppressing irrelevant or noisy features. This branch generates 256 output channels, with a temporal resolution reduced to

L / 8

.

Figure 1. Multi-channel CNN branch graph.

The second branch extracts frequency domain features using four convolutional layers applied to the FFT spectrum, producing 512 channels at a resolution of

L / 16

. The third and fourth branches handle higher-order spectral features, derived by applying the FFT to the squared and fourth-power IQ signals, respectively. These branches capture nonlinear signal structures and weak harmonics that are often overlooked in standard spectral analysis, which is particularly valuable for distinguishing complex modulation schemes. The fifth branch, in contrast, directly down-samples the original five time domain features without convolution, preserving the raw information.

To expose frequency domain cues to the CNN branch, each IQ segment is transformed by a fast Fourier transform (FFT). The resulting spectral magnitude and phase features are concatenated with time domain channels (I, Q, amplitude, phase, and phase difference) to form a unified multi-channel input. This design allows the convolutions to learn temporal and spectral patterns within a single representation, improving separability for modulation families that appear similar in the time domain but differ in their spectral signatures. In practice, the spectral channels also highlight bandwidth usage, line spectra, and weak harmonics at moderate-to-high SNRs, thereby strengthening the model’s robustness without incurring prohibitive computational cost.

Once the temporal dimensions of all branches are aligned, their outputs are concatenated into a unified 1797-channel feature map. A subsequent 1 × 1 convolution operation decreases the number of channels to 512, followed by global average pooling and flattening to produce the final CNN output, denoted as

F_{CNN}

. All convolutional layers in the model use the ReLU activation function to introduce nonlinearity. The modular structure of the model allows for easy expansion or modification, such as integrating attention mechanisms or cyclostationary feature extractors, which strengthens the model’s capability to perform across various wireless signal recognition scenarios.

The ReLU function, which was introduced in earlier research, has become the most widely adopted choice for modulation classification tasks. As opposed to traditional activation functions exemplified by sigmoid and tanh, ReLU provides faster training, improved classification accuracy, better generalization, and effectively addresses the vanishing gradient problem. By introducing sparsity into the hidden units, as shown in Equation (4), the ReLU function enhances the efficiency of representation and reinforces the model’s robustness and adaptability when applied to classification tasks.

f (x) = \max (0, x) = \{\begin{matrix} x, & if x \geq 0 \\ 0, & if x < 0 \end{matrix} .

(4)

This architecture leverages both physical signal priors and deep representation learning, allowing for reliable and precise classification across various modulation types. The dimensional representation of the process is summarized in Table 2.

Table 2. Multi-channel hierarchical output process table.

3.3. Vision Transformer

In this paper’s design, the second branch, ViT, as illustrated in Figure 2, processes the input through patch embedding and positional encoding applied to the constellation graph derived from the IQ sequence. The [CLS] token, an additional learnable global vector, is prepended to the input token sequence before each sample. This token participates in all computations throughout the process, allowing the transformer to incorporate information from the entire input. As a result, the token helps represent the overall feature output. The subsequent sections describe this process in detail, and the overall architecture is shown Figure 3.

Figure 2. ViT* (where the asterisk denotes the extra learnable [CLS] token) with CLS token.

Figure 3. ViT- based feature extraction framework.

The first four modules in the diagram primarily focus on converting the IQ signals into constellation diagrams. These diagrams are then transformed into sequences, with positional information added, in preparation for input into the transformer.

To capture the structural modulation characteristics, the raw IQ signal sequence is converted into a 2D constellation map. Each IQ pair(

I_{t}

,

Q_{t}

) is treated as a complex symbol and plotted onto a 2D grid. A 64 × 64 grayscale image is then created using normalized binning and density mapping, providing a visual representation that is suitable for input into the Vision Transformer branch.

The constellation map

X \in R^{H \times W}

is divided into non-overlapping patches, where each patch has a spatial size of

p \times p

. Consequently, the map is partitioned into

N = \frac{H W}{p^{2}}

patches. Let

X_{i} \in R^{p \times p}

denote the i-th patch, and

Flatten (\cdot)

be the vectorization operator that reshapes a patch into a column vector. Then, the patch embeddings are obtained as

x_{i} = Flatten (X_{i}) \in R^{p^{2}}, i = 1, 2, \dots, N,

(5)

The 64 × 64 image is sliced into 8 × 8 patches to become a patch sequence. Then, to enable the model to have sequence awareness, position encoding was introduced. The encoded sequences were concatenated, and a learnable token was positioned at the head of the sequence to generate the final global representation, resulting in a sequence length of 65.

Each patch is embedded by a linear mapping as:

z_{i} = E x_{i} + p_{i} \in R^{d},

(6)

where

E \in R^{d \times P^{2}}

is a learnable projection matrices;

p_{i} \in R^{d}

is position code of the ith patch; the construction of input sequences is achieved by adding learnable classification tokens:

Z_{o} = [z_{cls}; z_{1}; z_{2}; \dots; z_{N}] \in R^{(N + 1) \times d},

(7)

It is then input into the ViT and processed by the transformer encoder module. The structure of the ViT follows the same principles as the original transformer. Every transformer layer includes:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}),

(8)

for every head:

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}),

(9)

Through this design, each

{head}_{i}

concentrates on specific information patterns and can model relationships between positions. Once concatenated, these outputs are multiplied by matrix

W^{o}

(size

n_{h} d_{v} \times d

) to form the final matrix.

Multi-Head:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{n}) W^{o},

(10)

where Q, K, and V denote the query, key, and value matrices;

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are the learnable projection matrices for the i-th attention head; and

W^{o}

is the output projection matrix. The dimensionalities d and

d_{k}

represent the embedding dimension and the per-head projection dimension, respectively.

Multiple identical layers form the Transformer encoder, where each layer integrates a multi-head self-attention block with a feed-forward neural network. In order to counteract the vanishing gradient phenomenon prevalent in deep architectures, each sub-layer incorporates residual connections alongside layer normalization. The feed-forward sub-layer follows a standard fully connected network structure. As shown in Figure 4, by employing multi-head attention, every token in the sequence can access information from all other tokens, with computed weights quantifying the relevance of their positions. This process effectively captures dependencies across sequence elements. By distributing computations across multiple attention heads, each independently calculating attention weights, the model is enabled by the mechanism to attend to different characteristics of the sequence. The outputs from all heads are then aggregated, producing a rich, multi-perspective representation.

Figure 4. Illustration of the multi-head self-attention mechanism.

MLP (Multi-Layer Perceptron), which functions similarly to a feedforward network (FFN), is made up of two dense layers with a GELU nonlinearity in between. Consisting of two fully connected layers, this part applies the GELU activation after both layers to introduce nonlinearity. This notably strengthens the model’s capability to model intricate transformations of token embeddings. The structure of this component is as follows:

MLP (x) = GELU (x W_{1} + b_{1}) W_{2} + b_{2},

(11)

where

x \in R^{d}

denotes the input token embedding,

W_{1} \in R^{d \times d_{ff}}

and

W_{2} \in R^{d_{ff} \times d}

are learnable weight matrices,

b_{1}

and

b_{2}

are bias terms, and

d_{ff}

is the intermediate hidden dimension, which is typically set to four times the embedding dimension d (e.g.,

512 \to 2048

). For deep neural architectures, each sublayer is further equipped with residual connections and layer normalization to facilitate stable optimization and reliable gradient propagation. For deep neural architectures, each sublayer is designed with residual pathways and layer normalization, facilitating stable optimization and reliable gradient transmission. Specifically, the update rules for each encoder block are as follows:

Z^{'} = Z + MultiHead (LN (Z)) Z_{out} = Z^{'} + FFN (LN (Z^{'})),

(12)

where

L N (\cdot)

denotes layer normalization. The transformer’s output is a sequence of token embeddings enriched with contextual information. This design not only retains the original signal features but also alleviates issues like vanishing gradients in deeper networks. To obtain a global representation, the token at position 0 in the output sequence is extracted and used as the overall feature vector:

F_{Trans} = LN (Z_{out} [0]) \in R^{d},

(13)

This vector is subsequently used in the Path Attention Module (PAM) to align with the CNN and BiLSTM outputs. The specific dimensions of these are shown in Table 3.

Table 3. ViT-branch-specific input and output process table.

3.4. BiLSTM

The framework diagram of LSTM is shown in Figure 5. To effectively capture the long-term dependencies inherent in modulated signals, we integrate a BiLSTM network as the third feature extraction branch. Unlike traditional CNNs, which focus on local patterns, RNNs, particularly LSTMs, are well-suited for learning temporal structures within sequential data. The bidirectional architecture of the network enables it to process both past and future contextual information, making it particularly effective for recognizing modulation types with periodic or time-varying characteristics.

Figure 5. Architecture of the BiLSTM.

The input to this branch is the original IQ signal matrix

X_{IQ} \in R^{L \times 2}

, where

L = 128

denotes the sequence length, and the two channels indicate the in-phase and quadrature components. Next, the sequence is input into a BiLSTM layer configured with

h = 256

hidden units in each direction. The resulting forward and backward outputs are merged, yielding a combined temporal feature vector:

F_{BiLSTM} = [h_{t + 1}; {\overset{\leftarrow}{h}}_{t - 1}] \in R^{512},

(14)

where

h_{t + 1}

and

{\overset{\leftarrow}{h}}_{t - 1}

denote the final hidden states from the forward and backward directions.

This 512-dimensional feature vector captures the high-level temporal dynamics of the input signal, complementing the local spatial features extracted by the CNN branch and the global structural features obtained through the ViT branch. The BiLSTM features are then passed to the path attention module, where they are adaptively fused with the outputs from the other branches. The bidirectional structure of the LSTM improves the model’s capacity to learn transitions, repetitions, and time-varying frequency domain patterns. This makes BiLSTM particularly effective for identifying modulation schemes with distinct waveform cycles or timing characteristics, such as PSK and QAM, when compared to unidirectional LSTM.

Following the detailed explanation of each branch’s functionality, the overall architecture of the model is presented to complete the system design.

4. Three-Branch Fusion Classification Method

In AMR, variations in modulation type and SNR can lead to considerable differences in signal characterization. Relying on a single feature extraction method may result in limited robustness or insufficient generalization. To resolve this limitation, a multi-branch fusion network is introduced, allowing features to be extracted in collaboration across three distinct routes. This approach not only addresses the issue of robustness but also improves recognition accuracy. For instance, when a modulation type is identified in the multi-channel CNN branch and the same modulation type is later confirmed through the constellation diagram in the transformer, the agreement between the two results increases our confidence in the correctness of the recognition output.

The three feature vectors are then adaptively integrated using a Path Attention Module (PAM), which allows dynamic fusion based on the signal type and quality. With its multi-path design, the AMR system achieves a marked increase in robustness.

The main processing flow of the proposed three-branch fusion network can be summarized as follows:

IQ signal preprocessing: The raw complex base band IQ samples are normalized and reshaped into the required formats for sequence-based processing and constellation construction.
Parallel feature extraction: The preprocessed IQ sequence is fed simultaneously into three branches. The CNN branch extracts local multi-scale time–frequency features, the BiLSTM branch models long-range temporal dependencies and modulation periodicity, and the ViT branch takes the corresponding constellation image as input to learn global spatial structures based on the [CLS] token representation.
Feature aggregation: The high-level feature vectors produced by the three branches are concatenated into a joint feature representation, which collects complementary information from the time domain, sequence dynamics, and constellation space.
Adaptive path fusion: The concatenated feature is passed through the path attention module (PAM), which generates branch-wise weights and produces a fused feature by assigning different importance to the CNN, BiLSTM, and ViT representations.
Classification: The fused feature is fed into the final fully connected layers and softmax to output the posterior probabilities of all modulation types, and the class with the highest probability is taken as the predicted modulation label.

To provide a clearer overview of the proposed architecture, Figure 6 illustrates the overall structure of the three-branch fusion network. The IQ signal is simultaneously fed into the CNN, BiLSTM, and ViT branches to obtain three complementary feature representations, which are then fused and passed to the final classifier. The detailed implementation of each branch and the fusion module is further shown in Figure 7.

Figure 6. Overall structure of the proposed three-branch fusion network.

Figure 7. Detailed architecture of the three-branch feature extraction and fusion module.

4.1. Overall Framework Description

As depicted in Figure 7, the proposed AMR framework’s overall architecture is outlined. The specific details have been outlined in the previous section. The input IQ signal sequence, represented as the matrix

X_{IQ} \in R^{128 \times 2}

, is processed in parallel by three distinct feature extraction branches.

The CNN branch first transforms the IQ sequence into a multi-channel temporal representation by incorporating additional physically meaningful features, such as amplitude, phase, and derivative. This results in a signal tensor

X_{CNN} \in R^{C \times L}

, where C ranges from 5 to 7 channels. The branch then applies a series of 1D convolutional layers along the time axis to extract localized temporal patterns, followed by max-pooling and channel attention mechanisms. The output is a flattened feature vector

F_{CNN} \in R^{512}

.

The BiLSTM branch treats the original IQ sequence as a temporal input, learning long-range dependencies through bidirectional LSTM layers, resulting in a feature representation

F_{BiLSTM} \in R^{512}

.

The ViT branch transforms the IQ sequence into a 2D constellation diagram

X_{img} \in R^{1 \times 64 \times 64}

, which is then embedded into patch sequences and fed into Transformer encoders. The final token output serves as the global feature

F_{Trans} \in R^{512}

.

4.2. The Path Attention Module and Classifier

Unlike traditional fixed fusion methods, such as simple series or averaging, the Path Attention Module (PAM) is designed to adaptively assign weights to each branch according to the properties of the input signal. This adaptability enables the model to prioritize features with the highest informational value under varying modulation schemes and SNRs. For instance, frequency-sensitive modulations like FSK benefit more from features extracted by the CNN branch, while modulations with strong temporal structure, such as PSK, are more effectively captured by the BiLSTM branch.

To implement this adaptive mechanism, the outputs from the three branches are first passed through a shared, lightweight attention estimator that computes soft attention weights. These weights, denoted as

α

,

β

, and

γ

, are then used by the Path Attention Module to dynamically adjust the contribution of each path. The final representation is derived through the attention-weighted fusion of outputs from the three parallel branches.

F_{Final} = α \cdot F_{CNN} + β \cdot F_{BiLSTM} + γ \cdot F_{Trans}, α + β + γ = 1,

(15)

Firstly, the outputs of the three branches are spliced one by one to obtain

F_{cat} = [F_{CNN}; F_{BiLSTM}; F_{Trans}] \in R^{3 \times 512}

, in which we use an MLP and softmax to obtain the fusion weights,

[α, β, γ] = Softmax (MLP (F_{cat}))

; and finally, we obtain the outputs,

F_{Final} \in R^{512}

, which is used as the input to the classification layer.

Two fully connected layers were used in the classification layer. The first layer, FC1, reduces the dimensionality from 512 to 256, applying ReLU activation for further compression and nonlinear transformation. The output is then passed through the second layer, FC2. In the last step, the output is processed using the softmax activation function.

4.3. Loss Function

The entire network is trained jointly, where the softmax classifier output serves as the basis for computing the cross-entropy loss. Since all three branches contribute to the final feature vector

F_{Final}

, the loss is back-propagated through the path attention module and each branch. This allows the model to optimize jointly, ensuring that the classification performance guides the feature learning across the entire architecture.

To complete the optimization, We use the standard cross-entropy loss function:

L = - \sum_{i = 1}^{C} y_{i} \log ({\hat{y}}_{i}),

(16)

where y is the one-hot encoded true label and

{\hat{y}}_{i}

is the softmax output for class i; C is the number of categories. The model is trained to minimize this loss using stochastic gradient descent.

4.4. Data Flow and Representation

The data flows and internal transformations throughout the network can be summarized as follows in Table 4:

Table 4. Overall input–output dimensions of the proposed three-branch framework.

5. Experiments and Results

5.1. Dataset Description

We evaluate the effectiveness of automatic AMR algorithms on the RML2018.01a dataset [47], which is widely used for comparative studies due to its scale and diversity of channel conditions. The corpus covers multiple impairments—including additive white Gaussian noise (AWGN), carrier frequency offset (CFO), timing offset, and sampling rate offset (SRO) [15]—making performance assessment more challenging as the data volume increases. Unless otherwise specified, signals are provided as complex base band IQ sequences (zero-IF); therefore, the dataset does not define a fixed RF carrier or absolute modulation/symbol frequency, and all models operate on discrete-time, normalized samples. Each example in RML2018.01a contains 1024 samples per I and Q channel, We follow the standard SNR grid of

- 20 : 2 : 30

dB and evaluate all 24 modulation types. For complementary validation, we also report results on the smaller RML2016.10a dataset with SNR

- 20 : 2 : 18

dB and 11 modulation types; each example there contains 128 IQ samples. In both datasets, the injected noise model is AWGN, while RML2018.01a additionally includes CFO, timing offset, and SRO to emulate practical channel distortions.

5.2. Experimental Details

In the experiment, the RML2016.10a dataset and the RML2018.01a dataset were divided into training and validation sets in a 8:2 ratio. The dataset was divided according to the aforementioned ratio at all SNR levels and signal modulation formats to ensure the balance of the dataset. The experiment process uses an NVIDIA A10 GPU, the environment are pytorch2.3.1 and the language is python3. During training, the initial learning rate was set to 0.001, and the cross-entropy loss function (CrossEntropyLoss) was used as the loss function. Terminate training when the validation accuracy fails to improve over ten epochs. To optimize the learning process, we used the Adam optimizer. To validate the performance of the proposed algorithm, we evaluated it based on classification performance metrics. To confirm the superiority of the proposed method, we compared and evaluated it against several high-performance baseline methods, specifically: MCLDNN [35], IC-AMCNET [48], MCNET [37], CGDNET [36], and ResNet [49]. For consistency in comparison, all methods were conducted under identical conditions.

5.3. Results and Analysis

As shown in Figure 8, the training loss consistently decreases over 30 epochs, indicating effective optimization and stable convergence. The loss drops rapidly during the initial stages and gradually stabilizes at a low value, demonstrating the robustness of the proposed framework without signs of overfitting.

Figure 8. Training loss curve over 30 epochs in RML2016.10a.

Figure 9 presents the confusion matrix of the proposed modulation recognition framework evaluated at an SNR of 10 dB, providing a quantitative depiction of the classification outcomes across eleven modulation categories: QPSK, PAM4, AM-DSB, GFSK, QAM64, AM-SSB, 8PSK, QAM16, WBFM, CPFSK, and BPSK. The diagonal components indicate correctly identified samples, whereas the off-diagonal elements represent instances of misclassification. In general, the framework exhibits remarkably high classification performance under this moderate SNR condition. The results demonstrate the robustness and reliability of the adopted feature extraction and decision-making strategies, even when subjected to low-level noise disturbances.

Figure 9. Confusion matrix at SNR = 10 dB in RML2016.10a.

Despite achieving near-perfect recognition performance for most modulation types, the proposed model consistently exhibits a higher misclassification rate for AM-SSB across all SNR levels. This phenomenon primarily arises from the inherent spectral similarity between AM-SSB and AM-DSB, as well as the potential feature overlap with WBFM.

In the RML2016.10a dataset, the recognition performance generally stabilizes once the SNR exceeds 10 dB. Within the high-SNR range (10–18 dB), the proposed method consistently achieves an accuracy of approximately 95%, whereas MCLDNN fluctuates around 92%, IC-AMCNET fluctuates around 86%, MCNET fluctuates around 83%, CGDNET fluctuates around 80%, and ResNet maintains roughly 85%. By comparison, the proposed method demonstrates superior accuracy, confirming its improvement.

Figure 10 depicts the classification performance of a selected set of PSK and APSK modulation formats across a wide SNR range. In the severely noisy regime (e.g., SNR (≤

- 10

dB)), all modulation types perform poorly, reflecting the expected degradation in signal discriminability under strong noise interference. Nonetheless, some schemes such as 16PSK begin to show noticeable improvements earlier than others, suggesting favorable feature separability under moderate distortions.

Figure 10. Accuracy versus SNR for selected PSK/APSK modulations in RML2018.01a.

As the SNR increases into the transitional region, differences among modulation types become more evident. Lower-order PSK signals (e.g., BPSK, QPSK) achieve rapid gains in accuracy, approaching near-perfect recognition at moderate SNR levels. In contrast, higher-order APSK schemes (such as 64APSK and 128APSK) exhibit slower convergence, indicating their increased sensitivity to noise due to denser symbol constellations.

Beyond 10 dB, the majority of modulations reach stable classification, with most curves flattening near unity. However, subtle discrepancies persist—particularly among high-order APSK formats—where residual misclassifications hint at inherent ambiguity in feature spaces despite favorable SNR conditions.

As presented in Figure 11, the classification performance of several modulation formats—including OOK, FM, GMSK, OQPSK, and four AM variants—is analyzed over a wide SNR range.

Figure 11. Accuracy vs. SNR curves in low-order and analog modulation performance in RML2018.01a.

At extremely low SNR levels (e.g., below

- 10 dB

), all schemes exhibit low recognition accuracy, reflecting the challenges of distinguishing signal structures submerged in heavy noise. However, as the SNR improves, accuracy increases rapidly for most formats. Notably, AM-SSB-WC and AM-SSB-SC demonstrate faster convergence to high accuracy, likely benefiting from their relatively distinct spectral and envelope features. OOK also displays a steep performance rise, attributable to its binary nature and amplitude-based distinguishability.

In contrast, AM-DSB-SC shows persistent instability and under-performance even at high SNRs, suggesting a more ambiguous signal representation under the current model. GMSK and OQPSK follow intermediate trends, with smoother curves and stable classification from mid-to-high SNRs.

Figure 12 illustrates the classification accuracy of the proposed model across a range of SNR values for a subset of amplitude-based modulation schemes, including both amplitude shift keying (ASK) and quadrature amplitude modulation (QAM) variants. This focused evaluation provides deeper insight into the model’s ability to discriminate among modulations that share similar amplitude domain characteristics yet differ in constellation complexity.

Figure 12. Accuracy versus SNR for QAM/ASK-style modulation performance in RML2018.01a.

As the figure demonstrates, lower-order ASK and QAM types (e.g., 4ASK, 8ASK, and 16QAM) achieve rapid convergence to high classification accuracy as the SNR surpasses 0 dB, benefiting from relatively distinct constellation structures and reduced symbol ambiguity. In contrast, higher-order QAM schemes (e.g., 128QAM and 256QAM) exhibit more gradual performance improvement with increasing SNR, requiring significantly higher signal quality to achieve comparable accuracy levels. This behavior can be attributed to the reduced Euclidean distance between constellation points and heightened sensitivity to noise in these densely packed modulations.

Notably, despite the intrinsic complexity of high-order QAM formats, the proposed architecture maintains consistent accuracy progression across the entire SNR range, ultimately achieving near-saturation accuracy beyond 10 dB for most schemes. These results suggest that the feature extraction and classification layers are sufficiently expressive and robust to manage the finer granularity required for high-order modulations.

In summary, the comparative analysis across the three modulation groups reveals several consistent trends. First, lower-order modulation schemes (e.g., BPSK, 4ASK, OOK) demonstrate superior robustness to noise and rapidly achieve high recognition accuracy even in low-to-moderate SNR regimes. Second, the classification difficulty increases markedly with modulation order and constellation complexity, as evidenced by the delayed convergence and lower peak accuracies of higher-order formats such as 256QAM, 128APSK, and AM-DSB-SC. Third, modulation types characterized by spectral redundancy or envelope variation (e.g., AM and FM) present non-monotonic or irregular trajectories, indicating a greater challenge in discriminative feature extraction under noisy conditions.

These findings collectively highlight the importance of adaptive and hierarchical feature learning strategies in automatic modulation recognition. The ability to extract both local and global patterns is essential not only for handling noise-robust modulations but also for resolving fine-grained distinctions among visually similar, high-order schemes. Therefore, designing architectures capable of multi-scale representation is key to achieving generalizable performance across diverse modulation families.

Figure 13 presents the row-normalized confusion matrix obtained at a SNR of 10 dB, providing a comprehensive visualization of the model’s classification behavior under moderately high channel quality. The matrix enables detailed assessment of the model’s ability to discriminate among the 24 modulation categories, highlighting both strengths in identification and patterns of confusion.

Figure 13. Confusion matrix under 10 dB SNR in RML2018.01a.

As evident from the strong diagonal dominance, the proposed architecture achieves near-perfect classification accuracy for the majority of modulation types. Particularly, lower-order PSK and ASK classes (e.g., BPSK, QPSK, 4ASK) exhibit almost ideal separability, reflecting their distinct signal structures and robustness to noise at this SNR level. Moreover, analog modulation formats such as AM-DSB-SC and AM-SSB-WC also exhibit slight confusion, which may stem from their signal envelope similarity in the time–frequency domain. Importantly, no major systematic bias or model instability is observed across the matrix, affirming the generalization capacity of the model across both the digital and analog domains.

This confusion matrix not only confirms the model’s high discriminative power at SNR = 10 dB but also provides actionable insights into class-specific vulnerabilities, guiding future refinement in feature extraction or attention calibration. Overall, the results validate the effectiveness of the proposed fusion architecture in achieving fine-grained modulation recognition under realistic wireless conditions.

5.4. Comparison of Methods

To thoroughly evaluate the performance of the proposed model under varying channel conditions, Figure 14 illustrates the classification accuracy of six representative models across a wide range of SNR levels. Due to differences in the environmental configuration and data processing employed in this paper, the results of the comparative methods differ from those provided in the original paper. As expected, the recognition performance of all models improves with increasing SNR. However, it is evident that the proposed method consistently achieves the highest accuracy across most SNR values, particularly under high-SNR scenarios (above 10 dB), where precise modulation classification is essential for reliable communication.

Figure 14. Accuracy achieved by different AMR approaches across a range of SNRs in RML2018.01a.

The embedded zoom-in subplot provides a magnified view of the top-4 performing models—namely, the proposed method, IC-AMCNET, MCNET, and MCLDNN—in the high-SNR region. This local comparison reveals that the proposed method not only achieves the best peak accuracy but also exhibits the most stable and steadily improving performance, reflecting its enhanced generalization ability and effective suppression of residual noise. In contrast, competing methods show performance saturation or fluctuations, suggesting potential limitations in handling signal variations or overlapping features.

Furthermore, the superior performance of the proposed model in low-to-mid SNR regimes (e.g., from

- 10

dB to 10 dB) underscores its strong robustness under adverse channel conditions. This can be attributed to its hybrid architectural design and efficient feature extraction mechanisms, which facilitate better discrimination of modulation patterns even under heavy noise contamination.

Although the accuracy curves in Figure 14 show only a modest improvement over baseline models at medium–high SNR levels, a more comprehensive comparison reveals the advantages of the proposed method. As summarized in Table 5, our tri-branch fusion architecture achieves the highest overall accuracy, as well as the best macro-F1 and Cohen’s Kappa scores among all evaluated methods. These indicators reflect not only improved recognition capability but also a more balanced performance across different modulation types. Furthermore, the proposed approach exhibits enhanced stability in low-to-mid SNR conditions, where competing models often suffer from pronounced fluctuations or class-specific degradation. This combined evidence demonstrates that, despite the seemingly small gap in the raw accuracy curves, the proposed method provides consistently superior and more robust classification performance across the entire SNR range.

Table 5. Performance comparison of different models.

Collectively, these results highlight the model’s dual advantage: robustness to noise in challenging conditions and high-fidelity classification under clean signals. This performance advantage makes it well-suited for practical deployment AMR systems, particularly in dynamic and noisy wireless environments such as cognitive radio networks, UAV-based communications, and battlefield spectrum monitoring.

Table 5 provides a comparative analysis of the proposed method against several established baselines in terms of accuracy, macro-F1, Kappa, training time, and convergence epochs. The proposed model consistently demonstrates superior performance, achieving a favorable trade-off between predictive precision and statistical consistency. Compared to alternative methods, such as MCLDNN and MCNET, which show partial strengths in individual metrics, the proposed approach maintains a more balanced profile across all evaluation criteria. However, the approach proposed herein involves a greater number of branches, resulting in relatively longer training times. In addition to achieving competitive or superior overall accuracy, the proposed model exhibits more stable behavior across the SNR range and a more balanced recognition capability over different modulation types, as reflected by the accompanying quantitative metrics. From the perspective of computational complexity, IC-AMCNet requires approximately 8.6 M parameters, while MCLDNN, MCNet, and CGDNet use about 0.40 M, 0.13 M, and 0.65 M parameters, respectively, and ResNet involves an exceptionally large number of parameters. Although the total number of parameters of the proposed tri-branch architecture remains non-negligible, its complexity stays within a reasonable range and can be further reduced to around 4 M by employing different weighting strategies at runtime, without changing the overall performance pattern. Taken together, these observations demonstrate that the proposed approach attains a favorable trade-off between classification performance, robustness, and computational cost.

5.5. Ablation Experiment

To assess the functional contribution of each architectural branch, an ablation experiment was performed by progressively removing key components from the proposed model. As shown in Figure 15, the recognition performance of four model configurations was evaluated across a broad SNR range. The complete framework (“the proposed”), which combines the multi-channel physical feature convolutional module, a BiLSTM branch, and a ViT-based transformer path, consistently achieved the highest accuracy at nearly all SNR levels. In contrast, the “only cnn” variant—relying solely on the convolutional physical features—exhibited notably lower performance, especially in higher SNR regimes, reflecting its inability to capture sequential or global contextual cues. The “cnn+bilstm” and “cnn+vit” models, each incorporating one auxiliary branch, demonstrated moderate improvements due to their ability to extract either temporal dependencies or long-range representations. Nevertheless, both remained inferior to the fully fused architecture.

Figure 15. Ablation study of the proposed architecture in RML2018.01a.

Although Figure 15 shows that the accuracy improvement of the full tri-branch model over the CNN-only baseline is numerically modest at some SNR levels, the additional BiLSTM and ViT branches are introduced to enhance the model in several complementary aspects rather than to create a completely different accuracy curve. The BiLSTM branch focuses on temporal feature modeling, enabling the network to capture symbol-to-symbol dependencies and subtle temporal patterns that are easily smeared by noise, which leads to more stable decisions in low- and mid-SNR regimes. In parallel, the ViT-based constellation branch emphasizes global spatial relationships in the IQ plane, allowing the model to exploit long-range correlations and class-specific constellation structures that are difficult to represent with local CNN kernels alone. As confirmed by the ablation results in Figure 15, removing either of these two branches causes a consistent decrease in performance across the SNR range, while the full tri-branch model achieves a better balance between robustness and representation capacity with only a moderate increase in complexity. This indicates that the proposed architecture improves not only raw accuracy but also the reliability and stability of modulation recognition under varying channel conditions.

The embedded zoom-in window in Figure 15 further magnifies the results for SNR

\geq 10

dB, revealing that the proposed model retains a consistent performance margin over its reduced counterparts, even when noise effects are minimal. This validates the synergistic effect of integrating all three feature pathways, highlighting that temporal modeling and global attention mechanisms jointly enhance the baseline convolutional encoding. Collectively, these results affirm the necessity of the complete tri-branch fusion strategy for maximizing classification robustness across varying channel conditions.

6. Conclusions

This paper proposes a three-branch fusion classification model comprising three distinct branches: a multi-channel physical feature convolution branch, a bidirectional long-short time series branch, and a ViT branch. This approach extracts multiple features from the time–frequency domain for recognition, utilises BiLSTM for auxiliary discrimination, and employs ViT to analyse signal constellation diagrams for final classification validation, thereby achieving high-performance classification. Experiments conducted on the RML2016.10a and RML2018.01a datasets demonstrate that the proposed method achieves superior recognition performance with enhanced accuracy.

Future research still faces several challenges. First, applying the proposed method to real-world systems and verifying its effectiveness using practical measurement data remain essential, as most existing studies are still based on simulation results. Second, enhancing the robustness of modulation recognition in harsh environments is equally important. In particular, achieving reliable accuracy under low signal-to-noise ratio conditions remains a key issue that requires further investigation.

Author Contributions

Conceptualization, C.Z. and K.S.; Methodology, J.L. and K.S.; Software, C.Z. and C.L.; Validation, J.L.; Investigation, T.L. and C.L.; Resources, T.L.; Writing—original draft, C.Z.; Writing—review and editing, C.Z. and J.L.; Supervision, K.S. and T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external research funding. The APC was funded by the Graduate Education Reform Project Fund of Chengdu University (Grant No. CDJGY2024004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Tao Liu was employed by the company Chengdu Kinyea Technologies Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, X.; Zhao, Y.; Huang, Z. A Survey of Deep Transfer Learning in Automatic Modulation Classification. IEEE Trans. Cogn. Commun. Netw. 2025, 11, 1357–1381. [Google Scholar] [CrossRef]
Nandi, A.; Azzouz, E. Algorithms for automatic modulation recognition of communication signals. IEEE Trans. Commun. 1998, 46, 431–436. [Google Scholar] [CrossRef]
Dobre, O.A.; Abdi, A.; Bar-Ness, Y.; Su, W. Survey of automatic modulation classification techniques: Classical approaches and new trends. IET Commun. 2007, 1, 137–156. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, X.; Huang, Z. Multi-Function Radar Modeling: A Review. IEEE Sens. J. 2024, 24, 31658–31680. [Google Scholar] [CrossRef]
Xu, J.L.; Su, W.; Zhou, M. Likelihood-Ratio Approaches to Automatic Modulation Classification. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2011, 41, 455–469. [Google Scholar] [CrossRef]
Zheng, J.; Lv, Y. Likelihood-Based Automatic Modulation Classification in OFDM With Index Modulation. IEEE Trans. Veh. Technol. 2018, 67, 8192–8204. [Google Scholar] [CrossRef]
Zhu, Z.; Nandi, A.K. Automatic Modulation Classification: Principles, Algorithms and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar] [CrossRef]
Ge, Z.; Jiang, H.; Guo, Y.; Zhou, J. Accuracy Analysis of Feature-Based Automatic Modulation Classification via Deep Neural Network. Sensors 2021, 21, 8252. [Google Scholar] [CrossRef]
Hameed, F.; Dobre, O.A.; Popescu, D.C. On the likelihood-based approach to modulation classification. IEEE Trans. Wirel. Commun. 2009, 8, 5884–5892. [Google Scholar] [CrossRef]
Dobre, O.A.; Hameed, F. Likelihood-Based Algorithms for Linear Digital Modulation Classification in Fading Channels. In Proceedings of the 2006 Canadian Conference on Electrical and Computer Engineering, Ottawa, ON, Canada, 7–10 May 2006; pp. 1347–1350. [Google Scholar] [CrossRef]
Panagiotou, P.; Anastasopoulos, A.; Polydoros, A. Likelihood ratio tests for modulation classification. In Proceedings of the MILCOM 2000 Proceedings, 21st Century Military Communications, Architectures and Technologies for Information Superiority (Cat. No. 00CH37155), Los Angeles, CA, USA, 22–25 October 2000; Volume 2, pp. 670–674. [Google Scholar]
Derakhtian, M.; Tadaion, A.; Gazor, S. Modulation classification of linearly modulated signals in slow flat fading channels. IET Signal Process. 2011, 5, 443–450. [Google Scholar] [CrossRef]
Stanescu, D.; Digulescu, A.; Ioana, C.; Serbanescu, A. Modulation Recognition of Underwater Acoustic Communication Signals Based on Phase Diagram Entropy. In Proceedings of the OCEANS, Hampton Roads, VA, USA, 17–20 October 2022; pp. 1–7. [Google Scholar] [CrossRef]
Stanescu, D.; Digulescu, A.; Ioana, C.; Serbanescu, A. Corrigendum: Spread spectrum modulation recognition based on phase diagram entropy. Front. Signal Process. 2023, 3, 1334782. [Google Scholar] [CrossRef]
Qin, X.; Jiang, W.; Gui, G.; Li, D.; Niyato, D.; Lu, J. Multilevel Adaptive Wavelet Decomposition Network-Based Automatic Modulation Recognition: Exploiting Time-Frequency Multiscale Correlations. IEEE Trans. Cogn. Commun. Netw. 2025, 11, 3218–3231. [Google Scholar] [CrossRef]
Li, Y.; Tan, H.; Shi, X.; Zhou, W.; Zhou, F. Wavelet-based Adaptive Network for Automatic Modulation Recognition under Low SNR. In Proceedings of the 2024 IEEE 35th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Valencia, Spain, 2–5 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
Sun, Z.; Wang, S.; Chen, X. Feature-Based Digital Modulation Recognition Using Compressive Sampling. Mob. Inf. Syst. 2016, 2016, 9754162. [Google Scholar] [CrossRef]
Sun, X.; Su, S.; Zuo, Z.; Guo, X.; Tan, X. Modulation Classification Using Compressed Sensing and Decision Tree–Support Vector Machine in Cognitive Radio System. Sensors 2020, 20, 1438. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Lin, M.; Zhang, X.; Huang, Y.; Zhu, Y. Automatic Modulation Classification Based on CNN-Transformer Graph Neural Network. Sensors 2023, 23, 7281. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Gao, C.; Wang, X.; He, X.; Li, Y. Graph Neural Networks for Recommender System. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual, 21–25 February 2022. [Google Scholar]
Qian, X.; Lin, S.; Cheng, G.; Yao, X.; Ren, H.; Wang, W. Object Detection in Remote Sensing Images Based on Improved Bounding Box Regression and Multi-Level Features Fusion. Remote Sens. 2020, 12, 143. [Google Scholar] [CrossRef]
Lin, S.; Zhang, M.; Cheng, X.; Zhou, K.; Zhao, S.; Wang, H. Hyperspectral Anomaly Detection via Sparse Representation and Collaborative Representation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 946–961. [Google Scholar] [CrossRef]
Lin, S.; Zhang, M.; Cheng, X.; Zhou, K.; Zhao, S.; Wang, H. Dual Collaborative Constraints Regularized Low-Rank and Sparse Representation via Robust Dictionaries Construction for Hyperspectral Anomaly Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2009–2024. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Graves, A.; Mohamed, A.r.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
O’Shea, T.J.; Corgan, J.; Clancy, T.C. Convolutional Radio Modulation Recognition Networks. In Proceedings of the Engineering Applications of Neural Networks, Aberdeen, UK, 2–5 September 2016; pp. 213–226. [Google Scholar]
Meng, F.; Chen, P.; Wu, L.; Wang, X. Automatic Modulation Classification: A Deep Learning Enabled Approach. IEEE Trans. Veh. Technol. 2018, 67, 10760–10772. [Google Scholar] [CrossRef]
Wang, Y.; Fang, S.; Fan, Y.; Wang, M.; Xu, Z.; Hou, S. A complex-valued convolutional fusion-type multi-stream spatiotemporal network for automatic modulation classification. Sci. Rep. 2024, 14, 22401. [Google Scholar] [CrossRef] [PubMed]
Rajendran, S.; Meert, W.; Giustiniano, D.; Lenders, V.; Pollin, S. Distributed deep learning models for wireless signal classification with low-cost spectrum sensors. arXiv 2017, arXiv:1707.08908. [Google Scholar] [CrossRef]
West, N.E.; O’Shea, T. Deep architectures for modulation recognition. In Proceedings of the 2017 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Baltimore, MD, USA, 6–9 March 2017; pp. 1–6. [Google Scholar] [CrossRef]
Xu, J.; Luo, C.; Parr, G.; Luo, Y. A Spatiotemporal Multi-Channel Learning Framework for Automatic Modulation Recognition. IEEE Wirel. Commun. Lett. 2020, 9, 1629–1632. [Google Scholar] [CrossRef]
Njoku, J.N.; Morocho-Cayamcela, M.E.; Lim, W. CGDNet: Efficient Hybrid Deep Learning Model for Robust Automatic Modulation Recognition. IEEE Netw. Lett. 2021, 3, 47–51. [Google Scholar] [CrossRef]
Huynh-The, T.; Hua, C.H.; Pham, Q.V.; Kim, D.S. MCNet: An Efficient CNN Architecture for Robust Automatic Modulation Classification. IEEE Commun. Lett. 2020, 24, 811–815. [Google Scholar] [CrossRef]
Zhang, Z.; Luo, H.; Wang, C.; Gan, C.; Xiang, Y. Automatic Modulation Classification Using CNN-LSTM Based Dual-Stream Structure. IEEE Trans. Veh. Technol. 2020, 69, 13521–13531. [Google Scholar] [CrossRef]
Liu, X.; Li, C.J.; Jin, C.T.; Leong, P.H.W. Wireless Signal Representation Techniques for Automatic Modulation Classification. IEEE Access 2022, 10, 84166–84187. [Google Scholar] [CrossRef]
Mendis, G.J.; Wei, J.; Madanayake, A. Deep learning-based automated modulation classification for cognitive radio. In Proceedings of the 2016 IEEE International Conference on Communication Systems (ICCS), Shenzhen, China, 14–16 December 2016; pp. 1–6. [Google Scholar] [CrossRef]
Lee, J.; Kim, B.; Kim, J.; Yoon, D.; Choi, J.W. Deep neural network-based blind modulation classification for fading channels. In Proceedings of the 2017 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 18–20 October 2017; pp. 551–554. [Google Scholar] [CrossRef]
Peng, S.; Jiang, H.; Wang, H.; Alwageed, H.; Zhou, Y.; Sebdani, M.M.; Yao, Y.D. Modulation Classification Based on Signal Constellation Diagrams and Deep Learning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 718–727. [Google Scholar] [CrossRef]
Huang, S.; Chai, L.; Li, Z.; Zhang, D.; Yao, Y.; Zhang, Y.; Feng, Z. Automatic Modulation Classification Using Compressive Convolutional Neural Network. IEEE Access 2019, 7, 79636–79643. [Google Scholar] [CrossRef]
Chang, S.; Huang, S.; Zhang, R.; Feng, Z.; Liu, L. Multitask-Learning-Based Deep Neural Network for Automatic Modulation Classification. IEEE Internet Things J. 2022, 9, 2192–2206. [Google Scholar] [CrossRef]
Ruikar, J.D.; Park, D.H.; Kwon, S.Y.; Kim, H.N. HCTC: Hybrid Convolutional Transformer Classifier for Automatic Modulation Recognition. Electronics 2024, 13, 3969. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 ×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
O’Shea, T.J.; Roy, T.; Clancy, T.C. Over-the-Air Deep Learning Based Radio Signal Classification. IEEE J. Sel. Top. Signal Process. 2018, 12, 168–179. [Google Scholar] [CrossRef]
Hermawan, A.P.; Ginanjar, R.R.; Kim, D.S.; Lee, J.M. CNN-Based Automatic Modulation Classification for Beyond 5G Communications. IEEE Commun. Lett. 2020, 24, 1038–1041. [Google Scholar] [CrossRef]
Liu, X.; Yang, D.; Gamal, A.E. Deep neural network architectures for modulation classification. In Proceedings of the 2017 51st Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 29 October–1 November 2017; pp. 915–919. [Google Scholar] [CrossRef]

Figure 1. Multi-channel CNN branch graph.

Figure 2. ViT* (where the asterisk denotes the extra learnable [CLS] token) with CLS token.

Figure 3. ViT- based feature extraction framework.

Figure 4. Illustration of the multi-head self-attention mechanism.

Figure 5. Architecture of the BiLSTM.

Figure 6. Overall structure of the proposed three-branch fusion network.

Figure 7. Detailed architecture of the three-branch feature extraction and fusion module.

Figure 8. Training loss curve over 30 epochs in RML2016.10a.

Figure 9. Confusion matrix at SNR = 10 dB in RML2016.10a.

Figure 10. Accuracy versus SNR for selected PSK/APSK modulations in RML2018.01a.

Figure 11. Accuracy vs. SNR curves in low-order and analog modulation performance in RML2018.01a.

Figure 12. Accuracy versus SNR for QAM/ASK-style modulation performance in RML2018.01a.

Figure 13. Confusion matrix under 10 dB SNR in RML2018.01a.

Figure 14. Accuracy achieved by different AMR approaches across a range of SNRs in RML2018.01a.

Figure 15. Ablation study of the proposed architecture in RML2018.01a.

Table 1. List of abbreviations and acronyms.

Abbreviation	Full Form
ALRT	Average likelihood ratio test
GLRT	Generalized likelihood ratio test
HLRT	Hybrid likelihood ratio test
ANN	Artificial neural network
SVM	Support vector machine
HMM	Hidden Markov model
CNNs	Convolutional neural network
RNNs	Recurrent neural network
GNNs	Graph neural network
LSTM	Long short-term memory network
SPWVD	Smooth pseudo-Wigner–Ville distribution
BJD	Born–Jordan distribution

Table 2. Multi-channel hierarchical output process table.

Layer (Name)	Output Shape	Description
Input	[B, 7, L]	7-channel IQ signal input
TimeConv1	[B, 64, L/2]	Conv1d (5 → 64, k = 3, p = 1) + BN + ReLU + SELayer + MaxPool (2) (BRNM)
TimeConv2	[B, 128, L/4]	Conv1d (64 → 128, k = 3, p = 1) + BRNM
TimeConv3	[B, 256, L/8]	Conv1d (128 → 256, k = 3, p = 1) + BRNM
FFTConv1	[B, 64, L/2]	Conv1d (2 → 64, k = 3, p = 1) + BRNM
FFTConv2	[B, 128, L/4]	Conv1d (64 → 128, k = 3, p = 1) + BRNM
FFTConv3	[B, 256, L/8]	Conv1d (128 → 256, k = 3, p = 1) + BRNM
FFTConv4	[B, 512, L/16]	Conv1d (256 → 512, k = 3, p = 1) + BRNM
FFT2Conv1	[B, 64, L/2]	Same as FFTConv1, input is FFT of squared IQ
FFT2Conv2	[B, 128, L/4]	Same as FFTConv2
FFT2Conv3	[B, 256, L/8]	Same as FFTConv3
FFT2Conv4	[B, 512, L/16]	Same as FFTConv4
FFT4Conv1	[B, 64, L/2]	Same as FFTConv1, input is FFT of ${IQ}^{4}$
FFT4Conv2	[B, 128, L/4]	Same as FFTConv2
FFT4Conv3	[B, 256, L/8]	Same as FFTConv3
FFT4Conv4	[B, 512, L/16]	Same as FFTConv4
RawFeatures	[B, 5, L]	Raw input channels: real, imag, phase, dphase, amplitude
Concat	[B, 1797, min_len]	Concatenate all branches, align time dimension
FusionConv	[B, 512, min_len]	Conv1d (1797 → 512, kernel = 1) + BN + ReLU
GlobalAvgPool	[B, 512, 1]	AdaptiveAvgPool1d (1)
Flatten	[B, 512]	Flatten layer

⁴ Input is the Fast Fourier Transform of the IQ signal raised to the fourth power.

Table 3. ViT-branch-specific input and output process table.

Module Name	Input/Output Dimension	Operation/Functional Description
Constellation Map Generation	IQ signal	Complex mapping, 2D histogram generation, and Gaussian smoothing; converts the IQ sequence into a 2D spatial modulation distribution.
Patch Embedding	Input: constellation image	Divide the image into patches, flatten them, and project via an MLP; embeds each patch into a fixed-length vector representation.
Add [CLS] + Positional Encoding	Embedded patches	Add a learnable [CLS] token and positional encodings; provides a global representation and spatial position awareness for the Transformer encoder.
Transformer Encoder	Embedded sequence	Stacked Transformer encoder layers that model long-range dependencies in the constellation map using self-attention mechanisms.
Extract Token	Transformer output	Take the output corresponding to the [CLS] token; produces an image-level global representation of the modulation signal.
Branch Output	–	Fused with the CNN and LSTM branch outputs; represents global spatial features of modulation patterns.

Table 4. Overall input–output dimensions of the proposed three-branch framework.

Branch	Input Shape	Output Shape	Key Feature
CNN	$2 \times 128$	512	Extracts multi-scale time–frequency features from the IQ signals.
BiLSTM	$128 \times 2$	512	Models temporal dependencies and modulation periodicity.
ViT	$1 \times 64 \times 64$	512	Captures global constellation structures via Transformer attention.
PAM + FC1 + FC2	$3 \times 512 \to 512$	C (classes)	Fuses multi-branch features adaptively and performs classification.

Table 5. Performance comparison of different models.

Method	Overall Accuracy	Macro-F1	Kappa	Time (s/Epoch)	Epochs
Proposed	0.590	0.592	0.574	190	65
MCLDNN	0.561	0.562	0.549	162	50
MCNET	0.566	0.565	0.551	81	75
CGDNET	0.505	0.510	0.483	203	45
ResNet	0.535	0.537	0.518	132	80
IC-AMCNET	0.582	0.584	0.566	104	70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Multi-Channel Physical Feature Convolution and Tri-Branch Fusion Network for Automatic Modulation Recognition

Abstract

1. Introduction

2. Signal Model and Task

3. The Proposed Method

3.1. IQ Signal Processing for Multi-Channel CNN

3.2. Multi-Channel CNN

3.3. Vision Transformer

3.4. BiLSTM

4. Three-Branch Fusion Classification Method

4.1. Overall Framework Description

4.2. The Path Attention Module and Classifier

4.3. Loss Function

4.4. Data Flow and Representation

5. Experiments and Results

5.1. Dataset Description

5.2. Experimental Details

5.3. Results and Analysis

5.4. Comparison of Methods

5.5. Ablation Experiment

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics