2D-WinSpatt-Net: A Dual Spatial Self-Attention Vision Transformer Boosts Classification of Tetanus Severity for Patients Wearing ECG Sensors in Low- and Middle-Income Countries

Tetanus is a life-threatening bacterial infection that is often prevalent in low- and middle-income countries (LMIC), Vietnam included. Tetanus affects the nervous system, leading to muscle stiffness and spasms. Moreover, severe tetanus is associated with autonomic nervous system (ANS) dysfunction. To ensure early detection and effective management of ANS dysfunction, patients require continuous monitoring of vital signs using bedside monitors. Wearable electrocardiogram (ECG) sensors offer a more cost-effective and user-friendly alternative to bedside monitors. Machine learning-based ECG analysis can be a valuable resource for classifying tetanus severity; however, using existing ECG signal analysis is excessively time-consuming. Due to the fixed-sized kernel filters used in traditional convolutional neural networks (CNNs), they are limited in their ability to capture global context information. In this work, we propose a 2D-WinSpatt-Net, which is a novel Vision Transformer that contains both local spatial window self-attention and global spatial self-attention mechanisms. The 2D-WinSpatt-Net boosts the classification of tetanus severity in intensive-care settings for LMIC using wearable ECG sensors. The time series imaging—continuous wavelet transforms—is transformed from a one-dimensional ECG signal and input to the proposed 2D-WinSpatt-Net. In the classification of tetanus severity levels, 2D-WinSpatt-Net surpasses state-of-the-art methods in terms of performance and accuracy. It achieves remarkable results with an F1 score of 0.88 ± 0.00, precision of 0.92 ± 0.02, recall of 0.85 ± 0.01, specificity of 0.96 ± 0.01, accuracy of 0.93 ± 0.02 and AUC of 0.90 ± 0.00.


Introduction
The life-threatening infectious disease tetanus is prevalent in low-and middle-income countries (LMIC); although this disease is unusual in high-income countries, it continues to be seen in these settings [1][2][3].Tetanus is caused by a bacterium called Clostridium tetani [4].Despite the availability of tetanus vaccinations and antitoxin for acute treatment, an estimated 213,000 to 293,000 tetanus patients die worldwide each year [5].
Tetanus toxin hinders the transmission of signals at synapses within the central nervous system, leading to painful muscle spasms and stiffness.Cardiovascular system instability occurs in severe cases due to the toxin effect in the autonomic nervous system (ANS).Over a period of 2 to 5 days, approximately 50% of patients will advance to severe disease, and, in the absence of treatment, muscle spasms impede the ability to breathe.These patients need strong muscle relaxants to counteract spasms and mechanical ventilation to support breathing.Approximately a quarter of all tetanus patients encounter ANS dysfunction, which causes blood pressure and heart rate instability.This ANS dysfunction is the primary cause of mortality among tetanus patients in facilities equipped with mechanical ventilation.However, managing this condition remains challenging.The prompt detection of severe tetanus at its early stages is extremely helpful, because it allows prompt intervention and facilitates more efficient allocation of resources [6].The Ablett score is extensively employed for the tetanus severity classification system and spans from 1 to 4 [2], allocating a severity grade based on impact on the respiratory and cardiovascular systems.Patients experiencing mild or moderate tetanus (grades 1 and 2) can be managed through non-invasive clinical approaches.Patients with severe tetanus (grades 3 and 4) require full intensive care unit (ICU) care, including mechanical ventilation.For patients with severe tetanus (grade 4), extra organ support may be required to address the effects of ANS involvement.The conventional Ablett grading system relies on a combination of clinical characteristics (e.g., tachycardia, fever and hypertension).In clinical settings with high patient volumes or limited staffing experience, achieving precise classification can be challenging.
Advanced continuous monitoring systems in the intensive and high staff-to-patient ratios in high-income countries are associated with enhanced tetanus outcomes [7,8].However, the cost of delivering ICU treatment is high across all nations, including LMIC.Furthermore, in LMIC, inadequate equipment and limited time are also frequently mentioned as obstacles in delivering superior care to patients affected by tetanus.
In most limited-resource settings, close monitoring and timely emergency treatment are frequently only available in high-dependency wards or ICUs, as these facilities possess the necessary staff and equipment to provide such services.This large burden of additional cases results in suboptimal use of already limited resources and potentially leads to poorer outcomes for individuals in need of intensive care [7,9,10].Furthermore, numerous patients in LMIC (e.g., Vietnam) bear the out-of-pocket medical expenses, and the additional costs associated with ICU care are considerably higher in comparison to standard ward care.Previous research has provided information about the direct medical costs for ICU patients with tetanus, dengue and sepsis in Vietnam [7,9,10].
Affordable wearable sensors have been suggested as a viable alternative approach for tetanus in settings with limited resources.The wearable sensors operate wirelessly and are small and lightweight.These sensors can provide real-time, continuous monitoring of vital signs, with the aim of facilitating the early detection of patient deterioration [7,11].Our previous work has shown that electrocardiogram (ECG) monitoring alone can be used to classify the severity of tetanus [12,13].Using affordable wearable sensors is still challenging due to inherent inaccuracies in the collected continuous physiological data.This is mainly attributed to missing data and the substantial amount of noise generated by various factors, diminishing its reliability [7].
This study employs ECG data obtained from wearable sensors utilised in an ICU in Vietnam and suggests a rapid triage tool, developed through deep learning techniques, to categorize tetanus severity based on the Ablett score.We design a dual self-attention Vision Transformer named 2D-WinSpatt-Net.The proposed 2D-WinSpatt-Net outperforms the previous methods of 1D and 2D convolution neural network (CNN), and 2D CNN with different attention mechanisms (e.g., 2D-CNN + Channel-wise Attention Network [12]), ViT and a hybrid CNN-Transformer Network [13].We investigate the time series imagingcontinuous wavelet transform (CWT)-as the input for the 2D-WinSpatt-Net.Moreover, we show the difference in generating the CWT and log-spectrogram image based on the tetanus ECG data and discuss why CWT works better in the proposed 2D-WinSpatt-Net.This study provides the following contributions: The structure of the paper is as follows: Section 2 presents an overview of related work in the tetanus diagnosis in LMIC, time series imaging and machine learning techniques.Section 3 outlines the proposed 2D-WinSpatt-Net network (see Figure 1).Section 4 provides comprehensive information about the collected tetanus dataset, implementation specifics, a comparison of baseline methods and evaluations.Sections 5 and 6 present the results and discuss the experimental findings.Finally, Section 7 delivers the final conclusions drawn from our research.

Related Work
The severity level of tetanus infection is linked to the functioning of the ANS [12].Heart rate variability (HRV) measures the fluctuations in the time intervals between consecutive heartbeats (RR intervals).The HRV variations are regulated by the ANS and serve as an indicator of ANS activity [12].Alterations in conventional HRV parameters obtained from electrocardiography (ECG) have been demonstrated to be associated with the severity of tetanus infection.To classify tetanus severity, HRV-based methods require an additional pre-processing stage for extracting RR intervals and QRS complex [14][15][16][17].The conventional techniques for detecting HRV require high-cost equipment and expertise, making them usually inaccessible in ICUs or limited-resource settings.Van et. al. [7] suggested extracting RR intervals from tetanus ECG data using wearable devices.However, it remains a persistent challenge to reliably extract accurate RR intervals [18].
Healthcare has undergone a profound transformation with artificial intelligence, encompassing machine learning (ML) and deep learning (DL) techniques [19] .Conventional ML methods require the manual extraction of features.For instance, RR intervals are extracted from the dataset [20].The support vector machine (SVM) is applied to automatically identify the degree of ANS dysfunction in tetanus [21].DL methods have demonstrated superior performance compared to traditional machine-learning techniques such as SVM [18].The experimental results of previous research were limited, because of the small datasets, which contain synchronised physiological data obtained from a group of 10 patients diagnosed with tetanus [18,21] and PPG data collected from 19 tetanus patients [19].In our most recent study [12,13], ECG data were collected from 110 tetanus patients, using the low-cost wearable monitor.
Time series imaging is a technique that converts temporal data into visual representations, commonly employed in 2D convolutional neural networks (CNNs) for classification purposes [12,13,18,[22][23][24][25].Time series imaging can be gramian angular field, recurrence plot, spectrogram or continuous wavelet transform [26].One-dimensional (1D) convolutional neural networks (CNNs) have been utilised in various biomedical signal processing tasks, including the classification of biomedical data and the early detection of medical conditions [27,28].However, an image-based ECG signal classification structure using time series imaging (2D spectrograms) surpasses the performance of traditional 1D CNN models [29].Utilising spectrograms, transfer learning and a combination of ECG and PPG data, researchers have successfully employed these techniques to classify the severity of two infectious diseases: HFMD and Tetanus [18].Lu et al. [12,13] suggested the logarithmic spectrogram represented the ECG signal.Results showed that the image-based ECG signal classification networks, the 2D-CNN-Transformer/8 [13] and the 2D-CNN + Channel-wise Attention Network [12] achieve better performances than the 1D CNN.
Transformer [30] is remarkable for capturing global or long-range dependencies through parallel self-attention mechanisms.This has proven to be highly effective in a wide range of natural language processing (NLP) tasks.The remarkable achievements observed in the field of NLP using the Transformer model have inspired researchers to explore its application in the domain of computer vision [31].Vision Transformer (ViT) [32] is an extension of Transformer, which already surpasses all previous benchmarks and achieves the state-of-the-art technique in image classification.An input image is split into a set of 16 × 16 non-overlapping image patches, named visual tokens.Next, these patches are combined with positional encoding and fed into transformer blocks to capture global relationships for the classification.Multiple variations of Vision Transformers (ViTs) have been proposed with the aim of enhancing performance in vision tasks.For instance, Swin Transformer is a hierarchical ViT choosing shifted windows [33], which achieved better performances than the ViT and CNN-based architectures.Data-efficient image Transformer (DeiT) [34] employs knowledge distillation for image classification [35].TNT [36] processes the relationship between sub-patches via an inner transformer block and captures the interconnections among patch-level embeddings via an outer transformer block.
Transformers have emerged as a significant breakthrough in the field of computer vision and image analysis [13,[37][38][39][40][41].Our previous work [13] is the initial implementation of a transformer-based method for categorizing tetanus severity levels, which can help to triage patients quickly in LMIC wearing ECG sensors.The hybrid CNN-Transformer Network [13] is inspired by transformers on audio spectrograms [42][43][44].Tetanus ECG is represented by a log-spectrogram.The ViT Encoder is employed in this hybrid CNN-Transformer Network.Transformers have great potential for stratifying tetanus severity levels, which has not been fully investigated.Hence, we need to explore further methodology based on Transformers.

Data Preprocessing
During the pre-processing step, the crucial objective is to denoise an ECG signal.There are two primary types of noise-low-frequency noise [45] and high-frequency noise [45]which disturb the ECG signal analysis.The presence of low-frequency noise arises from patient muscle movement, while high-frequency noise stems from the electrical source that powers the ECG monitor.
In this study, we obtain one-lead ECG signals from an affordable wearable monitor.To enhance the data quality, we employ a Butterworth filter to eliminate background noise and refine the signals.The high-pass filter is set at a cutoff frequency of 0.05 Hz, while the low-pass filter is set at a cutoff frequency of 100 Hz.We utilise the SciPy package [46] to implement the data preprocessing step.

Continuous Wavelet Transform
In this work, we visualised the ECG waveform in its time-frequency representation using (discrete) continuous wavelet transform (CWT).CWT is a technique employed to assess the similarity between a signal and an analysing function, enabling a refined depiction of the signal's time-frequency characteristics [47,48], as compared to computing a spectrogram with consecutive Fourier transforms over windowed time.The CWT of a discrete time signal, x n , with a constant sampling period, δ t , can be expressed as the outcome of convolving x n with a mother wavelet that has been scaled and translated.
where ( * ) denotes the complex conjugate, s is the wavelet scaling factor and n is the localised time index.The subscript 0 on ψ has been dropped to indicate that this ψ 0 has been multiplied by , in order to normalise ψ to have unit energy.This ensures that the wavelet transforms, W n (s), at each scale, s, are directly comparable to each other and to the transforms of other time series; see [49].In this work, we used a Morlet mother wavelet, which has previously been shown to effectively capture the morphology of various biomedical signals, including ECG [26,[50][51][52].A Morlet wavelet consists of a plane wave modulated by a Gaussian: where η is a non-dimensional time parameter and w 0 is the non-dimensional frequency, here taken to be 6, as per [49], to satisfy the admissibility condition.The total signal energy at a specific scale can be measured by the scale-dependent energy density spectrum, E s : where s ∈ [1, S] and |W n (s)| 2 represent the scalogram, a 2-D wavelet energy density that captures and quantifies the complete energy distribution of the signal.The frequency, f , in Hz, can be approximated from the wavelet scaling factor, s, such that [53]: where the center frequency in Hz can be defined by [49]: Figure 2 shows an example of a tetanus ECG (in 5 s) and the subsequent time-frequency resolution using CWT.MLP.The MLP means multi-layer perceptron or multiple fully connected layers, which can be described as MLP(X) = FC(σ(FC(X))), where σ(.) represents an activation function GELU [54]; the FC means a fully-connected layer.
LN. Layer normalization [55] enhances the stability of hidden state dynamics within the training network, resulting in expedited training time and improved convergence.The equation is given by where µ and are the average value and standard deviation of the elements in x, γ and β are learnable parameters and • represents the element-wise dot.W-MSA.The attention is calculated within each window, which is different from the standard MSA.In our previous work [13], we chose the standard MSA to compute global self-attention.The global self-attention considers the relationship between each patch in an image.Each patch is compared to all other patches in an image.However, the computational cost increases remarkably when the size of the image grows.If the window size is fixed, the complexity of window-based MSA is linear, with the number of patches based on the size of the image.
In a local window with M × M patches, a group of relative position bias . ., n win } is added to compute the similarity of each head of W-MSA.In the i-th local window, the W-MSA can be described as where i ×d are queries, keys and values, and M 2 i is the number of patches in the i-th local window.The scale factor 1 √ d leads to stable gradients.

Local Spatial Window Attention
Given an input CWT image x ∈ R W×H×C , where the C represents the channel quantity and W and H indicate the width and height of the feature map, we first split the x into flattened non-overlapping patches.Each patch is treated as a "token" and its feature is set as a concatenation of the raw pixel RGB values.We consider each patch as a "token" and represent its feature by combining the raw pixel RGB values into a concatenated form.The raw-valued feature undergoes a linear embedding process that maps it to a vector of an arbitrary dimension, D. Secondly, the windows are organised to partition these image patches evenly.The local spatial window-based attention works on these patch tokens, and it is calculated within each local window.In our work, we use a patch size of 4 × 4. The window-based attention module maintains the number of tokens ( H 4 × W 4 ).After the embeddings, we employ L transformer layers.The output of the l-th layer is as follows: The m l denotes the output features of the W-MSA module and the MLP module for the l-th layer after LN operation.In our implementation, one transformer layer can achieve an optimal result.
We assume each local window with M × M patches; the computational complexity of a global MSA module and a window-based W-MSA module on the image of h × w patches are as follows: where the global self-attention computation is quadratic to patch number, hw, and the window-based W-MSA is linear when M is fixed (the default value is 7).Global selfattention computation is too expensive for a large hw, while the window-based selfattention offers scalability.The W-MSA computation is reduced compared to standard global MSA.However, the window-based self-attention module does not have connections across windows, which forfeits the capacity to model the global information.In order to deal with this challenge, we propose a global spatial attention after the local spatial window attention module.

Global Spatial Attention
A simple and effective attention module is designed to boost the performance of convolutional neural networks (CNNs), such as squeeze and excitation (SE) block [56] and the convolutional block attention module (CBAM) [57].Inspired by these attention modules in CNNs, we build a spatial attention map based on the inter-spatial relationship of outcome features of local spatial window attention.To calculate the spatial attention, we initially perform global average-pooling, F AvgPool (m l ), and global max-pooling, F MaxPool (m l ), operations along the channel axis and then concatenate them to produce an efficient feature vector.Next, we generate a spatial attention map, M spatial (m l ), using a convolution layer on the concatenate feature vector.
where σ represents the sigmoid function and f 7×7 means a convolution operation with the filter size of 7 × 7.

Experiments 4.1. ECG Data from Tetanus Patients
The collection of tetanus data has obtained approval from both the Oxford Tropical Research Ethics Committee and the Ethics Committee of the Hospital for Tropical Diseases.This dataset is obtained from the Hospital for Tropical Diseases, located in Ho Chi Minh City, Vietnam.This tetanus dataset was published in 2021 [7].
For our study, we utilised ECG data obtained from patients diagnosed with tetanus.The ePatch, a low-cost wearable monitor, was chosen as the monitoring device (ePatch V.1.0,BioTelemetry, Malvern, PA, USA) (see Figure 1).The 7g-weight ePatch (ePatch.https: //www.philips.co.uk/healthcare/resources/landing/epatch, accessed on 1 September 2023) sensor was securely pressed onto the patient's chest skin, ensuring firm adhesion.The ePatch device captures ECG readings in two channels at a sampling rate of 256 Hz.
The two channels (channel 1 and 2) of the ePatch device are not directly correlated with the conventional bedside monitor's ECG leads 1 and 2. [13].The recorded continuous ECG data were stored within the ePatch and later exported upon completion of the recording period.The study focused on adult tetanus patients (age 16 years), who were admitted to the ICU at the Hospital for Tropical Diseases in Ho Chi Minh City.Collection of vital-sign monitoring data included the recording of two approximately 24-h ECG datasets: one on the day of enrolment (1st day in the ICU) and another on the 5th day of hospitalization.For our experiment, we only utilised ECG signals captured from channel 1 of the ePatch device.To ensure signal stability, we trimmed the initial and final five minutes of each ECG recording [7].
The dataset comprises a total of 178 ECG waveform example files collected from 110 patients during their enrolment and on the 5th day of hospitalization (referred to as days 1 and 5).To ensure data separation, the dataset is divided into training, validation and test sets in a ratio of 141/19/18, respectively.Importantly, the same patient data are not present in multiple sets simultaneously.The time-series ECG waveform is divided into a sequence of non-overlapping ECG samples, with each window length set to 20 s.This duration is shorter than the 60-s window length used in our previous work [12,13].

Implementation Details
Pre-processing.From each ECG example file, we selected 30 ECG time series, each lasting 20 s.Consequently, the training set contains a total of 4230 (141 * 30) ECG continuous wavelet transform (CWT) samples, comprising 2370 samples of mild disease and 1860 samples of severe disease.The validation set consists of 540 ECG CWT samples (18 * 30), with 270 samples representing mild disease and 270 samples representing severe disease.Similarly, the test set includes 570 ECG CWT samples (19 * 30), with 360 samples denoting mild disease and 210 samples representing severe disease.The labelling of mild and severe tetanus cases was performed by clinicians at the Hospital for Tropical Diseases.For a detailed overview, please refer to Table 1.Based on our previous experiments [13], we employed a resizing and stitching process to transform the continuous wavelet transform (CWT) into a square image format.The resulting square CWT was saved as a JPG image file, utilizing the default 'hsv' colour map.This square CWT image represents a resolution of 224 × 224 pixels, capturing the CWT over every 20 s of ECG data.These processed CWT images are then utilised as input for the proposed 2D-WinSpatt-Net architecture (refer to Figure 3).The model is trained for 100 epochs, employing the Adam optimizer with a learning rate set at 0.001 and a batch size of 32.The torch.nn.BCEWithLogitsLoss is selected as the loss function.The implementation of the suggested 2D-WinSpatt-Net was carried out in Python 3.7 using PyTorch.The experiments were conducted on computational hardware consisting of the NVIDIA RTX A6000 48GB GPU.

Baselines
In our work, we compare the proposed method-2D-WinSpatt-Net-with six different baseline methods.These baseline methods encompass five 2D deep learning approaches, namely 2D-CNN, 2D-CNN + Dual Attention, 2D-CNN + Channel-wise Attention [12], 2D-CNN-Transformer/8 [13] and Swin Transformer, alongside a 1D-CNN method.Furthermore, we evaluate the performance of the proposed 2D-WinSpatt-Net by employing two different types of time series imaging, namely log-spectrogram and continuous wavelet transform (CWT), as input representations.

Evaluation Metrics
In this study, we employed several performance metrics to assess the effectiveness of the binary classification task.These metrics contain F1-score, precision, recall, specificity, accuracy [18] and the area under the curve (AUC) [58].To ensure robustness, each model was executed five times, and the average and standard deviation of the performance metrics were computed and reported using an independent test dataset.A higher AUC indicates superior model performance in accurately distinguishing between severe and mild cases of tetanus.
The F1-score is a metric that quantifies the balanced combination of precision and recall.A higher F1-score indicates better model performance in precision and recall in classification tasks.The F1-score is defined by the following formula: The precision rate calculates the percentage of true positives among the data that the model predicted as positive.Recall rate represents the model's ability to correctly identify all positive cases in the data, and it is also called the sensitivity rate.Precision rate is often reported with the recall rate, both useful in evaluating how precisely a method predicts the true positive labels.True positive (TP) refers to the accurate prediction of severe tetanus cases, while true negative (TN) signifies the correct identification of mild tetanus cases.False positive (FP) refers to the instances where mild tetanus is inaccurately classified as severe tetanus, while false negative (FN) refers to cases where severe tetanus is inaccurately classified as mild tetanus.Precision and recall are defined as follows: Specificity measures the percentage of true negatives that a model correctly classifies out of all the negative instances in the data.It measures a model's ability in correctly identifying all negative instances in the data.The specificity is defined as: The accuracy rate measures the proportion of classifications that a method generates correctly among all the instances in the data, regardless of the specific type of error (false positives or false negatives).The accuracy rate is defined in the equation as :

Window-Based Self-Attention Module Selection
The proposed 2D-WinSpatt-Net method is inspired by the Swin Transformer [33,59].Our method does not have a shifting window partition operation, which is different from the Swin Transformer.The core concept of the Swin Transformer involves the dynamic displacement of the window partition between successive self-attention blocks.Table 2 shows how the proposed 2D-WinSpatt-Net outperforms Swin Transformer V2 [59].The AUC and the accuracy of the proposed 2D-WinSpatt-Net increase by 4% and 4%, respectively, compared to Swin Transformer V2.
Table 2.A quantitative analysis of the proposed 2D-WinSpatt-Net, utilizing resized and stitched continuous wavelet transform (CWT) with a 20-s window duration as input, compared to baseline methods that employ either resized and stitched log-spectrograms with a 60-s window duration or original 60-s window length ECG as input.The results are displayed as mean ± standard deviation, with the highest performance emphasised in bold.4 and Table 3 show the comparison of two methods.One method is using the local spatial window attention module only (abbreviated to window attention).The other is the proposed 2D-WinSpatt-Net, using a combination of the local spatial window attention module and the global spatial attention module.The suggested 2D-WinSpatt-Net works better than the Window Attention.

Different Attention Methods
The attention module is built based on window attention.In our experiments, we test different attention methods.Figure 5 shows details of the different attention modules built on window attention.Table 4 shows that the proposed 2D-WinSpatt-Net achieves better performance compared to other attention methods.

Comparisons
We compare the introduced 2D-WinSpatt-Net with six different DL techniques, including one-and two-dimensional convolutional neural networks.In light of the experimental outcomes presented in Table 2, the image-based 2D-WinSpatt-Net method using CWT as input achieves the best performance in diagnosing tetanus.The proposed 2D-WinSpatt-Net works better than our previous 2D-CNN-Transformer/8 method [13].

Time Series Imaging
Figure 6 shows two types of time series imaging that are used as input in our 2D DL methods.We also compare the 2D-WinSpatt-Net using two different time series images as input.Table 2 shows the proposed 2D-WinSpatt-Net using resized and stitched CWT as input outperforms using resized and stitched log-spectrogram as input.The shorter resized and stitched CWT (20 s) as input achieve better performance than the resized and stitched log-spectrogram (60 s) .

Relation to Swin Transformer
We make a comparison with one representative baseline method, Swin Transformer V2 [59].Both the proposed 2D-WinSpatt-Net and Swin Transformer V2 use window attention as an element of the network.The key concept of the Swin Transformer [33,59] is to shift the window partition between consecutive self-attention blocks.This approach, however, is not employed in the 2D-WinSpatt-Net.Table 2 shows that the proposed 2D-WinSpatt-Net achieves better performance, using resized and stitched CWT inputs with a window length of 20 s.

Discussion
The proposed 2D-WinSpatt-Net is a novel transformer-based method.It captures both the local and global attention information, which is based on the image patch token level.The local and global ECG information of the CWT boosts the classification of the tetanus severity level, which works better than our previous work, the 2D-CNN + Channel-wise Attention [12] and the 2D-CNN-Transformer/8 [13].Moreover, the proposed 2D-WinSpatt-Net using resized and stitched 20-s window length CWT as input outperforms 2D-CNN and 2D-CNN + Dual attention using resized and stitched 60-s window length log-spectrograms as input.Furthermore, the proposed 2D-WinSpatt-Net (imaging in machine learning (ML)) beats 1D-CNN (non-imaging in ML).In addition, it outperforms the traditional ML-Random Forest using HRV time domain features [13].
We believe that the work presented here is the first to explore the benefits of using CWT-based transforms of wearable ECG signals as inputs for tetanus severity classification.Our results indicated that richer ECG time-frequency information could be captured using CWTs as compared to log-spectrograms-which significantly boosted downstream tetanus severity classification for all models explored.The CWT can provide a more informative time-frequency representation than the short-time Fourier transform, which is computed during a spectrogram.Without the need to coarsely window the signal, overcoming the uncertainty principle associated with computing the STFT, the CWT can obtain dynamic time-frequency resolutions directly from the entire ECG sequence through decomposing the signal into varying scales over time.For more information on the benefits of utilising CWT representations with respect to ECG classification, we refer the reader to Wang et al. (2021) [48] and Al et al. (2018) [60] for further reading.
We believe that the encoding of the key characteristics of our ECG signals in the timefrequency domain was better captured by the CWT methodology and therefore yielded a richer representation to our downstream 2D-WinSpatt-Net model.For example, the logspectrogram, using the STFT, provides a uniform view of the time-frequency space, using a fixed window size, leading to a constant time-frequency resolution across all frequencies.In contrast, CWT is advantageous in that it provides a multi-resolution analysis-yielding good time resolution at high frequencies and good frequency resolution at low frequencieswhich is achieved by varying the width of the wavelet.For instance, the QRS complex is a high-frequency event that lasts for a short duration, while the T-wave is a lowerfrequency event spread over a longer time.Furthermore, ECGs are non-stationary signals; we are interested in characterising non-stationary properties, such as heart rate variability (HRV), morphological variations, baseline wander and artefacts.Due to this adaptability in resolution, the CWT can be better at handling non-stationary signals and can also be better at detecting transient events in an ECG signal, such as P-waves or T-waves, etc., especially when these events exist at different scales.Finally, the adaptability of the wavelet chosen often results in the better removal of common edge effects in ECG as compared to STFT, for example, when analysing ECG signals, where beginnings and endings (or abrupt changes) carry significant information which we want to characterize.
We use the discretised version of the CWT so that it can be implemented in a computational environment.Given its redundant nature, the CWT (especially in its discretized form) is preferred for signal analysis tasks where precise time-frequency localisation is crucial, such as in the detection of transient features in our ECG signal.The CWT offers a dense sampling in both the time and frequency domains, making it ideal for visualising our ECG signal and preserving the features that might be missed or inadequately represented by the coarser, dyadic scales of the DWT.Furthermore, the continuous nature of the CWT allows for flexibility in choosing scales, which can be important as we are interested in visualizing features that do not align necessarily with dyadic or other discrete scale sets.
Our methods could be applied to other infectious diseases, for example sepsis or dengue.The signal processing technology-time series imaging in a square shape-can be used in other fields, such as seismic signal analysis.The novel deep learning model-2D-WinSpatt-Net-can also apply to the image processing field and the medical imaging field.
In future work, we will explore various window durations of the raw ECG data to generate CWT, such as 60s, 50s, 40s, 30s, 20s, 10s and 5s window lengths.We would like to find the optimal shortest window length CWT which still maintains the accuracy of tetanus severity classification.
Currently, our work only uses ECG to classify tetanus severity.Normally, tetanus classification is dependent on respiratory features with or without added cardiovascular features.
The ultimate goal of our work is to develop a tetanus severity warning tool with the aim of improving clinical treatment outcomes and reducing the incidence of the disease [13].By utilising ECG data collected through wearable sensors from patients, this tool will provide predictions on the severity of tetanus.It is designed to be applicable in both lowresource settings, where there is a scarcity of equipment and medical staff that affects patient care, and high-income settings [13], where inexperienced personnel may face challenges in managing tetanus due to the limited exposure to such cases.The implementation of this tool holds the potential to assist in clinical decision-making processes by preventing unnecessary admissions of mild cases to the ICU and reducing treatment delays for severe cases.By accurately predicting the severity of tetanus, it can contribute to optimizing resource allocation and improving patient outcomes.

Conclusions
We proposed a novel transformer-based method named 2D-WinSpatt-Net.This method has a dual attention mechanism, including local spatial window attention and global spatial attention.The experimental findings clearly indicate the superiority of our proposed 2D-WinSpatt-Ne over other advanced DL approaches when it comes to classifying tetanus severity levels.This novel deep learning framework has the potential to greatly enhance clinical care decision-making processes and facilitate the optimal allocation of limited healthcare resources, particularly in LMIC.Furthermore, the success of our method opens up possibilities for its application in similar infectious diseases, such as sepsis and dengue.In future work, we will aim to predict tetanus severity level on the future 5th day, using the tetanus patient ECG information on the 1st day at ICU.In addition, the 2D-WinSpatt-Net can be applied to classification tasks in different fields, including time series classification tasks.Overall, the proposed deep learning framework represents a significant advancement in the field and holds promise for addressing the challenges faced by healthcare systems in LMIC, ultimately contributing to better patient outcomes and resource utilisation.

•
We propose a novel dual self-attention Vision Transformer model that contains both the local spatial window attention and global spatial attention mechanisms on the image patch token level rather than the image pixel level.The local spatial window attention works on the image patches, which obtain the fine-grained features and reduces the complexity to linear.Then the global spatial attention works on the output of the local spatial window attention, telling the proposed model where to look and focus.• The resized and stitched time series imaging-continuous wavelet transform (CWT)is explored for the first time to represent the tetanus ECG information.We can obtain better accuracy of tetanus severity level classification using shorter tetanus ECG (20-s), compared to 60-s ECG in previous work on tetanus infectious diseases.• The proposed 2D-WinSpatt-Net surpasses the performance of the state-of-the-art methods in tetanus classification.It can assist clinical decision making in resourcelimited settings.

Figure 1 .
Figure 1.Framework overview for tetanus severity classification.The ePatch wearable sensor is used to acquire raw ECG data.The proposed method, named 2D-WinSpatt-Net, takes the resized and stitched continuous wavelet transform (CWT) of the raw ECG data, with a window length of 20-s, as its input.The output of this method is a label classification, with label 0 representing mild tetanus and label 1 representing severe tetanus.

Figure 2 .
Figure 2.An example of tetanus ECG and continuous wavelet transform (CWT): (a) Tetanus ECG in 5-s; (b) The CWT related to (a).

Figure 3 .
Figure 3.The various approaches for time series imaging include the following: (a) Utilising a CWT with a window length of 20 s on images sized 224 pixels × 224 pixels.(b) Employing CWT after resizing and stitching the images obtained from (a), resulting in images of dimensions 224 pixels × 224 pixels.Experimental Setup.Based on our experiments, the local spatial window attention with the following selected hyperparameters of the proposed 2D-WinSpatt-Net achieves optimal results:• Image size: 224; • Input channels: 3; • Patch size: 4; • Number of classes: 2; • Embedding dimension: 96; • Transformer blocks: 1; • Number of heads: 2; • Window size: 7; • Query, keys and values bias: True; • MLP ratio: 4.

Figure 4 .
Figure 4. Ablation study using 20-s window length continuous wavelet transform (CWT) as input: (a) local spatial window attention module; (b) local spatial window attention module + global spatial attention module (the proposed 2D-WinSpatt-Net).

Table 4 .
A quantitative evaluation of the proposed 2D-WinSpatt-Net and the baseline methods using resized and stitched continuous wavelet transform (CWT) with a window duration of 20-s.The results are displayed as mean ± standard deviation, with the highest performance emphasised in bold.The local spatial window attention module is abbreviated to window attention.The global spatial attention module is abbreviated to window attention.

Figure 6 .
Figure 6.Comparing two time series imaging techniques: (a) Resized and stitched continuous wavelet transform (CWT) with a 20-s window length, resulting in 224 pixels × 224 pixels; (b) Resized and stitched log-spectrogram with a 60-s window length, resulting in 224 pixels × 224 pixels.

Table 1 .
Train-valid-test split definition for the Ttnus dataset.

Table 3 .
Analysing the effects of 2D-WinSpatt-Net: ablation studies with resized and stitched CWT input of 20-s window length.The results are displayed as mean ± standard deviation, with the highest performance emphasised in bold.The local spatial window attention module is abbreviated to window attention.