1. Introduction
Radar plays a crucial role in contemporary warfare as a real-time information acquisition device. To prevent hostile radars from detecting, tracking, or imaging targets, numerous radar jamming techniques have been developed [
1,
2]. Efficient radar jamming recognition capability not only provides guidance for radar anti-jamming strategy, but also is the premise for radar to survive in the increasingly complex electromagnetic environment. Therefore, there is vital military significance and practical value to research on radar jamming recognition technologies.
Radar jamming recognition is currently the subject of extensive research both domestically and overseas. The traditional jamming recognition method manually designs the features of radar jamming in the time, frequency, and time–frequency domains, and then implements jamming recognition using classification techniques such as threshold matching, support vector machines (SVM) [
3], and decision trees [
4]. For example, [
5] obtained a radar jamming mode by extracting the time-domain kurtosis ratio, moment kurtosis coefficient, envelope undulation, and spectral similarity coefficient features and comparing them according to a certain order of threshold values. The effectiveness of traditional jamming recognition methods relies on the researcher’s subjective design of jamming signal features, and the deep abstract features of jamming signals are not easily extracted by human design; thus, the traditional jamming recognition methods have found it difficult to cope with increasingly complex jamming patterns and compound jamming scenarios.
In recent years, a number of new radar jamming recognition methods have emerged. Due to deep learning’s strong capacity to automatically learn the features of input data, it has been extensively used in the field of digital image and signal recognition [
6]. Radar jamming signals can be converted from a 1D time-domain signal to a 2D time–frequency domain with the help of time–frequency transformation to obtain an image of jamming signal time–frequency distribution [
7]. Therefore, using jamming time–frequency images as input data [
8,
9,
10], deep learning-based image recognition methods are being widely migrated to the application of radar jamming recognition to make up for the lack of feature extraction capability in traditional jamming recognition methods. The authors of [
8] used the short-time Fourier transform (STFT) to obtain jamming signal time–frequency images, established the time–frequency image training dataset, and designed a simple convolutional neural network to achieve the jamming recognition of nine kinds of jamming under 0–8 dB JNR conditions. Ref. [
9] also used STFT to obtain jamming time–frequency images and used two convolutional neural networks, AlexNet [
11] and VGG-16 [
12], for recognition experiments, with significantly higher recognition accuracy than the traditional model. The research mentioned above demonstrates that it is practical and efficient to implement jamming recognition using neural network models and time–frequency domain data related to jamming signals. However, the recognition networks of [
8,
9] choose mature models in the field of computer vision and are not designed according to the characteristics of time–frequency images, so they cannot make full use of the information in time–frequency images and require a large number of samples for training. Ref. [
10] extracts the real part, imaginary part, mode, and phase of the jamming time–frequency map to construct multiple datasets and uses an integrated CNN with weighted voting and transfer learning to achieve jamming recognition with excellent performance, even under small sample training conditions. However, Ref. [
10] does not consider how lightweight the recognition network should be, and its recognition network requires multi-dimensional data to determine the type of jamming, resulting in a complex recognition network structure and a large number of parameters. It is not conducive to the deployment of actual devices.
In addition, there are still some problems with the above deep learning-based radar jamming recognition methods. Firstly, the existing studies above on jamming recognition based on a CNN with jamming time–frequency images as input do not exploit the global representation of time–frequency images. The time–frequency distribution image is distinct from the natural image, which is an image reflecting the change in jamming signal frequency with time and which has significant global context information. Getting the global representation of the jamming time–frequency image can help enhance the feature extraction ability of the recognition network. However, the structural limitations of CNNs prevent them from fully utilizing the time–frequency images’ global representation. Convolution is a straightforward and efficient method for obtaining local information from images, but it is difficult to capture global representations [
13]. CNNs need to expand the receptive field by continuously stacking convolutional layers and using pooling operations in order to achieve global information extraction. This mechanism leads to a bloated recognition network, with a significant increase in computation and the number of parameters. The Vision Transformer (ViT) [
14] is able to establish long-term dependencies in the input data using self-attention mechanisms and has the outstanding ability to capture global representations [
15], and ViT is gradually becoming a new approach to replace CNNs. Unlike CNNs, ViT is heavy in terms of weight, and the performance improvement of ViT-based models comes at the cost of increased network parameters and delays [
16]. Unfortunately, in contrast to CNNs, the self-attention module in ViT ignores local feature details [
13]. Apparently, if the different characteristics of CNNs and ViT can be combined for global representation and local information extraction, the recognition performance of jamming recognition networks can be better enhanced by fusing global representation and local information of time–frequency images. Therefore, this paper extracts a recognition network that fuses the global representation and local information in the jamming time–frequency domain, extracts the local information of the jamming time–frequency image using a convolutional operation, and captures the global representation using the self-attention mechanism of ViT.
In addition, the jamming recognition task has strict requirements for real-time performance [
17], and a jamming recognition network with a large model and many parameters will struggle to deploy applications on devices with limited resources and power. Therefore, in this paper, the local information is obtained first with a lower number of convolutional module parameters by adjusting the operational mechanism of convolution. Secondly, ViT is fused between convolution modules, which can implicitly combine convolutional characteristics in the whole network and deal with global representation at the same time. This ViT application method can model local information and global representation in the input tensor with fewer parameters [
16]. Based on the above improvements, this paper proposes a lightweight jamming recognition network with better performance compared to the large number of parameters in CNN networks, but with very a low number of parameters.
The research questions addressed in this paper are as follows. To address the problem that the existing CNN-based radar jamming recognition network cannot fully utilize the global representation of jamming signals in the time–frequency domain, the JR-TFViT is proposed that can fuse the global representation of jamming in the time–frequency domain with local information to improve jamming recognition performance.
For the lightweight requirement of the jamming recognition network, the traditional convolutional operation mechanism is adjusted and ViT is fused into the convolutional structure between, which significantly reduces the number of parameters in the jamming recognition network.
The rest of the paper is organized as follows.
Section 2 describes the construction method of the radar jamming dataset required for the experiments.
Section 3 presents the principle and details of the proposed JR-TFViT construction.
Section 4 presents the details of the experiments, the experimental results, and the analysis of the results.
Section 5 summarizes the work of this paper and discusses the future research outlook.