Classifying Dysphagic Swallowing Sounds with Support Vector Machines

Swallowing sounds from cervical auscultation include information related to the swallowing function. Several studies have been conducted on the screening tests of dysphagia. The literature shows a significant difference between the characteristics of swallowing sounds obtained from different subjects (e.g., healthy and dysphagic subjects; young and old adults). These studies demonstrate the usefulness of swallowing sounds during dysphagic screening. However, the degree of classification for dysphagia based on swallowing sounds has not been thoroughly studied. In this study, we investigate the use of machine learning for classifying swallowing sounds into various types, such as normal swallowing or mild, moderate, and severe dysphagia. In particular, swallowing sounds were recorded from patients with dysphagia. Support vector machines (SVMs) were trained using some features extracted from the obtained swallowing sounds. Moreover, the accuracy of the classification of swallowing sounds using the trained SVMs was evaluated via cross-validation techniques. In the two-class scenario, wherein the swallowing sounds were divided into two categories (viz. normal and dysphagic subjects), the maximum F-measure was 78.9%. In the four-class scenario, where the swallowing sounds were divided into four categories (viz. normal subject, and mild, moderate, and severe dysphagic subjects), the F-measure values for the classes were 65.6%, 53.1%, 51.1%, and 37.1%, respectively.


Introduction
Dysphagia assessment is a clinically paramount task for detecting dysphagia and presbyphagia in patients, as well as preventing or reducing some diseases caused by a swallowing disorder. For example, pulmonary aspiration due to dysphagia is one of the most common causes of pneumonia, which is the third leading cause of death following malignant neoplasms and heart disease in Japan. Previous research has demonstrated that the death rate caused by pneumonia increases with age [1]. Aspiration pneumonia is the primary type of pneumonia that occurs in older people. Based on clinical research, more than 70% of people with pneumonia appear to have aspiration pneumonia [2].
Currently, the video fluoroscopic (VF) swallowing test is the golden standard for assessing dysphagia [3]. However, the equipment for the VF test is expensive, and the medical attendants performing the test may be exposed to radiation. The video endoscopic (VE) swallowing test is a common test in clinics and hospitals [4]. Because the equipment for the VE test is small, VE can be performed at the patient's bedside. Nevertheless, the disadvantage of both tests is that they require trained clinicians to perform them. In particular, these tests cannot be performed by nurses, speech-language pathologists, or care helpers.
Alternatively, cervical auscultation is one of the most useful methods for dysphagic assessment [5]. The usefulness of cervical auscultation has been demonstrated throughout the literature. For example, Zenner et al. reported that the results obtained by performing a dysphagia examination with cervical auscultation had a high level of agreement with those obtained from the VF test in patients under long-term care [6]. However, to perform cervical auscultation, trained clinicians or speech-language pathologists are required. Nurses and care helpers in clinics or rehabilitation centers cannot perform cervical auscultation. For overcoming such a situation, an easily operable system for cervical auscultation is expected.
To design such a system, one of the key issues is developing a processing method for the signal obtained from cervical auscultation [7].
In the literature, several researchers have proposed different types of signal processing methods for cervical auscultation. Dudik et al. reviewed several past approaches of signal processing for auscultation [8]. Initially, Takahashi et al. demonstrated the possibility of dysphagic screening using a microphone and accelerometer [9]. In addition, automated classification of abnormal swallowing has been investigated in several studies [10][11][12][13][14][15]. The relation between swallowing sounds and the mechanism of swallowing [16], as well as the differences in swallowing sounds among various categories of subjects, such as the young, old, normal, or abnormal subjects [17], have been studied previously. Recently, Dudik et al. demonstrated the performance of a deep brief neural network for distinguishing normal swallowing from that in unhealthy patients [18]. The majority of the proposed methods have been applied to two-class problems. However, a conclusive method for automatically classifying swallowing signals based on multiple dysphagic levels has not been realized.
In this study, the support vector machine (SVM)-a machine learning framework-has been applied to swallowing sounds for classifying the subjects into different dysphagic categories. The classification results were compared with the categories based on the dysphagic assessment performed by clinicians. Subsequently, we evaluated the accuracy of the resulting SVM model in classifying the subjects into different dysphagia categories.

Method
Healthy subjects, 17 men and 10 women (age range: 21-47, mean: 22.4) with no dysphagic diseases were recruited. Between 2015 and 2017, 78 male patients and 65 female patients (age range: 25-102, mean: 83.3) hospitalized in Kusatsu General Hospital were recruited as dysphagic subjects. The following study protocol received the approval from the ethics committee at the University of Shiga Prefecture and Kusatsu General Hospital.
According to the modified water swallowing test, multiple 3 mL samples of water were given to the subjects. Acoustic sounds were subsequently recorded using a neck-mounted microphone connected to a laptop computer. We used an SH-12iK microphone (Nanzu Musen, Shizuoka, Japan) with a frequency range of 200-3000 Hz. The computer recording software used was Audacity; the sampling rate was set to 8000 Hz, while the recording gain was 0.7. The segment of the swallowing sound was extracted from the successively recorded sound. As the typical swallowing period is known to be about 700 ms in a healthy person [16], a sound segment of a duration of 800 ms was obtained to sufficiently cover most of the swallowing sound. The center of the sound segment corresponded to the position of the peak intensity of the swallowing sound. The total number of segments for the swallowing sound was 170.
Furthermore, clinicians assessed all the healthy subjects and patients according to the VE scoring method proposed by Hyodo et al. [19]. Accordingly, they were categorized into four groups, as listed in Table 1. Category A included healthy subjects. The swallowing sound segments for each subject were also categorized; these results are also listed in Table 1, along with the number of categorized sound segments.

Preprocessing
The preprocessing of the obtained sounds consisted of three steps: noise suppression [20], compensation, and sensitivity of the used microphone. In addition, a filter was applied for the frequency band limit.
In the first step, as the noise profile of the microphone was required for noise suppression, the output sounds obtained from the recording system with no input signals were recorded beforehand in an anechoic chamber. By applying fast Fourier transform (FFT) on the output sounds, the frequency characteristics of the noise was obtained. By subtracting the noise characteristic from the swallowing sound segments in the frequency domain, the silent swallowing sounds were generated.
In the second step, as the sensitivity of the used microphone is required, the amplitude of the recorded signal was measured using the Audacity software. At the beginning, each input had a sinusoidal wave with pressure of 0.1 Pa and a frequency range between 50 Hz-10 kHz in steps of 50 Hz. Fitting a polynomial curve to the variation of the measured amplitude produced the sensitivity level of the microphone, Figure 1. By multiplying the inverse of K[ f ] by the magnitude spectrum of the swallowing sounds, the compensated swallowing sounds were obtained.
Finally, the spectral range of the swallowing sounds was restricted from 200 Hz-3 kHz because of the microphone's acoustic restriction. Furthermore, applying the inverse FFT to the resulting spectrum yielded the final preprocessed swallowing signals. Figure 2 shows an example of the preprocessed swallowing signal.

Features
In this subsection, we describe the definitions of the various features that were used for the machine learning process. For extracting the features from the frequency and time-frequency domains of the preprocessed sound signal x[n] n = 0, 1, 2, . . . N − 1, we applied the discrete time Fourier series and a short time Fourier transform to x[n]. The definitions for both transforms are given as follows:

Frequency Domain Features
The following conventional features associated with X[k] were used.
• Maximum magnitude of spectra [13] A max ≡ max where F s denotes the sampling frequency.
• Frequency average [17] • Frequency median and quartile ratio By considering the magnitude of the spectra |X[k]| as part of a frequency distribution, the quartiles k 1 , k 2 , k 3 can be calculated, where k 2 corresponds to the median, and the median frequency is given by Q 2 = k 2 F s /N. Other indices can also be translated into the frequency Q i = k i F s /N. Consequently, the quartile ratio is defined as

Time-Frequency Domain Features
In this study, we propose some features derived from the spectrogram |X w [m, k]| based on which a binary image X th [m, k] is obtained via thresholding. These obtained features describe some characteristics of the time-frequency domain signal, including statistical variation, peak locations, and dispersion of peak values.
First, some features associated with the spectrogram in the time-frequency domain are restricted by the indices shown in Figure 4, which are defined as follows.
• Relative distances of the peak location from the center of the spectrogram Furthermore, to describe the time variation of the statistics with respect to the spectrogram, the spectrogram is divided into B-blocks along the m axis, as displayed in Figure 4; subsequently, the following features are defined in each block. The quartile k b1 , k b2 , k b3 for each block is derived from the histogram H b [k] of the b-th block given by The frequencies for the corresponding quartiles were calculated as Q bi = k bi F s /L. The quartile ratio can be similarly defined by Q rb1 ≡ Q b1 /Q b2 , Q rb3 ≡ Q b3 /Q b2 . Each block energy is derived from the square sum of the spectrogram, including each block, as In the later experiments, B = 15 is used. This means that each block length is approximately 0.05 s under the current sampling rate. The width of the envelope of each peak in the preprocessed signal shown in Figure 2 can be affordably dropped into the block length. Hence, each block holds sufficient information of the difference signal among the other blocks.
Finally, to describe the effects of multiple spectral peaks and the spread of each peak, we propose the following features with respect to the threshold spectrogram X th [m, k] defined as where ε t ≡ κA w is a threshold level related to the above peak value A w in the spectrogram and the scale κ. In the following experiments, we used κ = 0.5. From Figure 3a, there are some local peaks in the spectrogram, except a central main peak. Applying a higher threshold level to the spectrogram may remove the positions of the other spectral peaks. In the case of a lower threshold level, many peak locations, including noisy level spectral peaks, appear. Preliminary experiments determine that κ = 0.5 could give the appropriate peak location shown in Figure 3b.
• Area of the threshold spectrogram: • Ratio of the area of the threshold spectrogram to the complete area of the spectrogram: Average distance from the center of the spectrogram: Average distance from the peak location of the spectrogram:

Machine Learning
In this subsection, we describe the data set conditions and size adjustments for training the SVMs. Generally, the choice of machine learning method depends on the size of the data set. Deep learning requires vast training data for constructing proper models. For example, the CIFAR-10 data set consists of 60,000 images in 10 classes, with 6000 images per class [21]. It is difficult to collect such a large number of swallowing sounds of patients in a clinical setting. It is experimentally well-known that the SVMs have the advantage of operating with relatively small data sets [22,23]. Therefore, the SVMs were employed in this study. We used the LIBSVM library [24], which is commonly used for implementing SVMs. The efficiency of an SVM with a radial basis function (RBF) kernel depends on the cost and RBF kernel parameter; therefore, these parameters are optimized using the grid-search tool included in the LIBSVM library. The number of swallowing sounds belonging to each category are not always equal to each other. The number of samples from each category used for training should be adjusted to obtain consistent results from the SVM, such that they are almost equal to each other. For example, in the two-class problem, the swallowing sound segments belonging to category A are classified as normal swallowing, and those belonging to categories B, C, and D were classified as abnormal swallowing. Table 1 lists a higher number of sound segments for abnormal swallowing than those for normal swallowing. The number of sound segments that were used was restricted to 104 randomly selected segments. Overall, segments from both groups were divided into 84 training data samples and 20 test data samples. Five different sets of the training and testing data samples were obtained, and all the classified results were averaged to obtain the final classification accuracy. In the four-class problem, the smallest number of the available sound segments was 37 for category D; hence, 37 sound segments were randomly selected from categories A, B, and C. These 37 sound segments were divided into 27 training data samples and 10 test data samples. The remaining process was similar to the two-class problem.

Choice of Features
When parameter B is set to 15, a total of 83 features can be obtained. These features can create various combinations of features, which require large computational costs for machine learning. For reducing the number of feature combinations for the computational cost, the following restriction, which is a type of filtering method [25], was introduced. The correlation coefficients between the VE scoring and each feature value were calculated beforehand, and they were sorted into the descending order of the absolute value of the correlation coefficients. This variation of the sorted correlation coefficients is shown in Figure 5. Figure 5 shows that the magnitude of the correlation coefficients rapidly decreases, and the decrease is almost linear after the 10th feature. Therefore, the first nine features were used as the candidate features for the machine learning experiments. The classification accuracy for using the top k features of all the 83 features are plotted in Figure 6. The graph is saturated at approximately four features in the two-class problem, and six features in the four-class problem. Therefore, the feature combinations 9 C 4 and 9 C 6 were studied in these experiments.   Table 2 lists the classification accuracy in the two-class problem, where the swallowing sounds of group A are classified as normal, whereas those of groups B, C, and D are classified as abnormal. In this Table, the list has been sorted in descending order of accuracy. The highest value of the accuracy is 0.780, and the highest F-measure is 0.789. Table 3 summarizes the result of the classification accuracy for the four-class problem. The list is sorted in descending order of total accuracy. The highest total accuracy is 0.460 with featuresD v ,f , A w , Q 2 ,D h , and Q r1 . The F-measures for all feature combinations in each class are plotted in Figure 7. The x-axis and y-axis in those figures indicate the combination number and the F-measure value, respectively. Table 2. Accuracy, precision, recall, and F-measure for each feature combination in the two-class problem. The list is sorted in descending order of accuracy.  Table 3. Accuracy and F-measures of classes A, B, C, and D for each feature combination in the four-class problem. The list has been sorted in descending order of accuracy.

Discussion
The accuracy in the two-class problem shows similar performance compared to the previous results reported by Mérey et al. [26]. They applied the SVM with an RBF kernel to the dual accelerometry data and demonstrated that the accuracy is 0.806 by using simple feature combination. Based on the feature combination listed in Table 2, the average distance from the center of the spectrogram in the frequency direction,D v , as well as in the quartile-related features, such as Q r1 and Q 8,2 , are commonly used for obtaining higher accuracy, which suggests that combining these features is useful for increasing the precision of the classification performance. All of these features depend on the position of the frequency peaks or the spectrum bias. As Lee pointed out in [13], the presence of vallecular residue and pyriform sinus residue may change the pharyngeal air space volume, which may cause a change in the acoustic vibration characteristics of the airflow in the pharynx. The feature value of the above features reflects this change.
In the four-class problem, the featureD v and quartile-related features are also included in the combination of the features from Table 3. Although combination #42 yields the highest accuracy and holds a relatively high F-measure for classes A, B, and C, its F-measure in class D is less than 0.15 from Figure 7. Combination #70 indicates the highest F-measure of 0.371 within class D (see Figure 7d). Similarly, combinations #61 and #75 hold the second and third highest F-measures of 0.362 and 0.356, respectively. These feature combinations commonly includeD h rather thanD v , which is included in the feature combinations with high accuracy. These results suggest the following: 1.
The characteristic of swallowing sounds for class D is different from that of the other classes.

2.
The featureD h relates to the area position with a higher magnitude along the time direction in the spectrograms. Robbins et al. reported that the total duration of oropharyngeal swallowing was significantly longer in the oldest group than in any other younger counterparts [27]. Borr et al., also reported that the duration of the onset time, and deglutition apnea of auscultation for the dysphagic group, was significantly longer than that of the nondysphagic group [28]. These duration variances caused a temporal bias in the spectrogram.

3.
When classifying the swallowing sounds into four categories, multiple classifiers or a combination of these might be useful due to the different characteristics of each class.

Conclusions
The accuracy of classifying swallowing sounds into two classes and four classes by using SVMs was examined. Although the F-measure reached 78.9% in the two-class scenario, the overall classification accuracy of the four-class scenario was still insufficient when using the classifier as a stand-alone method for the diagnosis. Conversely, this study revealed that the variation of some feature combinations can serve as useful classifiers for individual categories.
Although these results demonstrate the potential of using a classifier constructed by the SVM, some problems and limitations remain in their use as a practical classifier. The results of the trained SVM relied on the category defined by the total score obtained by Hyodo's VE scoring. The scoring method consisted of four tests. The segmented swallowing sounds may not correspond to the results of the total score of the four tests. To solve this problem, an improved model for considering each test in the scoring should be created. For selecting the features for the SVMs, a filter-based method was used in the experiments and the restriction on the feature combination was introduced. The resulting feature combination may not be optimal. Other types of selection methods, such as wrapper methods or embedded methods, should also be applied and the resulting classifier performances should be compared with each other. It will also be useful to employ other types of machine learning for increasing the classification performance. Deep neural networks is one such option, provided many data sets can be obtained. We are currently working on constructing an improved classifier for considering each test based on Hyodo's VE scoring by using deep neural networks combined with data augmentation. We will report the results of this research in the future.