Effectiveness of MP3 Coding Depends on the Music Genre: Evaluation Using Semantic Differential Scales

: MPEG-1 Layer 3 (MP3) is one of the most popular compression formats used for sound and especially for music. However, during the coding process, the MP3 algorithm negatively affects the spectral and dynamic characteristics of the audio ﬁle being compressed. The aim of this study is to evaluate the effect the MP3 coding format for different music genres and different bitrates via listening tests in which the original uncompressed ﬁles and the compressed ﬁles are compared. For this purpose, ﬁve different music genres were selected (rock, jazz, electronic, classical and solo instrument), and the ﬁles were compressed in three different bitrates (96 kbps, 160 kbps and 320 kbps). The semantic differential method was used, and ten bipolar scales were selected for the listening tests (e.g., better–worse, more distortion–less distortion, etc.). The following are the most important ﬁndings of this study: classical music was negatively affected the most among the genres due to the MP3 compression (lowest ratings in 8 out of 10 bipolar scales), the solo instrument was least affected among the genres (highest rating in 7 out of 10 bipolar scales), and for higher bit rates, the differences in ratings were small for all music genres. The ﬁndings of this study could be used to optimize and adapt the standard, depending on the music genre and the musical piece that needs to be encoded.


Introduction
MPEG-1 Layer 3 (MP3) is one of the most widely used sound compression methods. Its small storage space requirement (which enables fast distribution over the Internet), along with its compatibility with every audio player, makes it ideal for audio sharing, streaming and online entertainment. Big streaming services, such as Amazon Music, Google Play Music and various others, offer the majority of their content in MP3 form [1]. Audiobook platforms (Audible, Amazon, Scribd, etc.) sell their books exclusively in MP3 form [2]. Even forensic analysis and investigation make use of the algorithm [3,4] for detecting forgeries, etc.
MP3's build takes advantage of our hearing's inability to perceive all the information contained in a complex audio signal (perceptual audio coding). All perceptual coding algorithms contain a filterbank, a perceptual model, quantization and coding elements. The encoder contains a hybrid filterbank (polyphase and modified discrete cosine transform (MDCT)) through which the audio signal passes and becomes divided into sub-bands [5]. The polyphase filterbank divides the signal into 32 equal sub-bands, and the MDCT splits those sub-bands into 18 smaller ones, creating a total of 576 frequency lines [6]. Simultaneously, the signal passes through the FFT block that transforms the signal in the frequency domain. The output of the FFT block feeds the psychoacoustic model. The psychoacoustic model uses the critical bands of human hearing [7,8] to determine which for two reasons: to make sure that all participants would rate the exact same audio excerpt and to keep the duration of the listening tests to a minimum and avoid participants' fatigue. The audio clips were carefully selected so they would contain all the unique characteristics that represent each music genre. Rock music's features usually are distinct vocals, electric guitars, electric bass and intense drum line. Jazz's features are complicated melodic lines (often without the presence of vocals), dynamic fluctuations, use of wind instruments (trumpet, saxophone, etc.), distinct rhythmic patterns that come from string instruments (guitars, double-bass, electric bass) and piano. Electronic music's features are exclusively electronic instruments (samplers, pads, etc.), simple, repeating melodic patterns and fast tempo. Classical music's features are wide variety of acoustic instruments (wind instruments, chordophones, percussions, piano) in big numbers, multiple and complex melodic lines and significant dynamic fluctuations. The category of the "solo instrument" is not a music genre in itself. However, its characteristics are widely different from any music genre. The frequency range of the sound is limited to the instrument's capabilities, a very small number of melodic lines can be present (one or two), the timbre of the sound is constant and the masking effect taking place in such an audio excerpt is different from any specific music genre.
Out of these WAV audio clips selected, three MP3 files were created for each genre, in three bitrates: 96 kbps, 160 kbps and 320 kbps. In total, for each music genre, four audio files were used: one WAV (44.1 kHz, 16-bit) and three MP3's (96 kbps, 160 kbps and 320 kbps). All files were encoded with the LAME MP3 encoder (sampling frequency 44.1 kHz).
The bitrate 96 kbps is one step below the recommended 128 kbps bitrate of the standard and is very popular for streaming, while 320 kbps is the highest available bitrate and is thus the one that produces the highest quality. These two bitrates have been used in similar studies [25,34] and the 96 kbps bitrate in one study [26]. On the other hand, 160 kbps is not a very popular bitrate amongst studies but is one step above the recommended bitrate and has a good compression/quality ratio. Some suggest that 160 kbps should be the minimum bitrate used for the compression of all audio files [35]. It is also the lowest bitrate used in some download and streaming applications, such as audiobook platforms. Spectrographs of the original and compressed files, as well as their dynamic ranges, can be found in Figure 1 and in Appendix A, respectively.  1. Spectographs of the original and compressed audio files for all the music genres. Scale is common for every audio file and is presented as Power/frequency (dB/Hz).

Figure 1.
Spectrographs of the original and compressed audio files for all the music genres. Scale is common for every audio file and is presented as power/frequency (dB/Hz).

Participants
In total, 30 persons participated in the listening tests (M age = 23.8 years, SD age = 2.5 years). According to Heise [36], the minimum number of participants for conducting tests using the semantic differential method is 30. They comprised 15 male participants (M age = 24.0 years, SD age = 3.0 years) and 15 female participants (M age = 23.5 years, SD age = 2.0 years). Participants' ages ranged from 18 to 29 years. This particular age range was chosen in order to avoid age-related hearing loss [37]. In order to achieve a homogenous group, a qualified majority of participants were university students, self-reporting normal hearing. The participants were recruited through an ad posted in a university campus. During recruitment, participants were told that the study involved listening tests on several perceptual attributes. All participants are users of MP3 audio files on a daily basis. All participants, self-reported, were not professional musicians. In total, five participants reported being amateur musicians. No participants reported possessing a technical knowledge regarding the algorithm's function. Participants were not reimbursed for their time and took part voluntarily. Informed consent was obtained from all participants before the experiment.

Questionnaire
In order to form the questionnaire for this study, various other approaches and questionnaires from other audio-linked studies were taken into account [24,25,[38][39][40][41][42]. For this study, the semantic differential method [27] was selected in order to evaluate various aspects of the MP3 compression effects. This method allows the measurement of objective sound characteristics (amplitude, frequency, etc.) by using easily perceived, bipolar subjective scales.
The three criteria for selecting the bipolar scales used in semantic differential method were compliance with the EPA (Evaluation, Potency, Activity) structure, the nature of the subject examined [43] (in this case, the evaluation of the audio files' compression) and the easy and common interpretation by the participants. Each scale belongs to one of three categories (Evaluation, Potency, Activity), called factors. The Evaluation and Activity factors contain three scales each, and the Potency factor contains four scales. All scales are linear and semantically stable. Linearity is a characteristic associated with the integers appointed to each interval of a scale. According to the method [44], all intervals are assumed to be equal, so consecutive integers are appointed to each interval, ranging from −3 (negative pole) up to 3 (positive pole). Semantic stability is associated with the ability of the participants to interpret the meaning of the scale in the exact same way [45]. Table 2 shows the bipolar scales used for each factor. Table 2. Correspondence between factors (Evaluation, Potency, Activity) and the ten bipolar scales used in this study for the application of the semantic differential method.

Evaluation
Potency Activity

Experimental Design and Procedure
The experiment was conducted in a sound-proof test room in university facilities. The participants were seated in the middle of the test room behind a desk, and the questionnaires were presented. The listening sessions were carried out with headphones, Sennheiser HD-215 (Sennheiser, Wedemark, Germany). The values of the sound levels used in the listening tests were in accordance with the reference level typically recommended by the ITU-R [46,47], since listening to music on headphones can be annoying [48][49][50]. The headphones were calibrated with a dummy head "Artificial head-John" (Soundman, Berlin, Germany).
Possible effects of the headphones due to irregularities in the frequency response were minimal due to the nature of the specific listening tests. Listening tests included pair comparison of the original and the compressed audio and the evaluation of the differences through bipolar scales. Therefore, the effects on the frequency responses due to headphones would be the same for both sound samples (original and compressed), thus minimizing the alteration of the result. In the listening tests, written instructions were given to the participants which informed them that they would be presented with pairs of audio files for different music genres. Their task was to compare the audio files of each pair and rate them for ten characteristics, presented in ten scales. The participants were instructed to submit their opinion for each comparison by placing one "X" mark on each scale. The participants were advised to make sure they rated all ten scales for each audio pair before proceeding to the next one in the questionnaire. Participants were tested individually.
At the beginning of each session, a short training sequence was inserted consisting of 5 stimulus combinations, one for each music genre, at random bitrate. The participants were asked to rate the training sequence's stimulus as if they were part of the listening test. After the practice sequence, the participants were presented with the 15 pair comparisons in random order. Each comparison was presented for approximately 8-11 s, and there was no time limit for giving a response. Each audio pair was evaluated via a 7-point bipolar scale, such as the one demonstrated in Table 3. Since the study was in Greek, the original scale is presented in Appendix B. This resulted in 15 ratings per subject per session. The whole experiment lasted approximately 50 min. After the experiment, the participants were informed about the nature of the study and were allowed to ask questions. The study was approved by the Committee on Ethics and Deontology of Research (C.E.D.R), Technical University of Crete, Greece (project identification code: protocol number 38/5.5.2022).

Results
Comparisons of results between genres for different bitrates are presented in Figure 2. Diverging stacked bar charts [51] are used, which are suggested to be appropriate for the presentation of rated scales such as semantic differential [52,53]. The advantages of diverged stacked bars over other graphical forms for rating scales are presented in [53]. The percentages of respondents who agree with the statement are shown to the right of the zero line; the percentages who disagree are shown to the left. The percentages for respondents who neither agree nor disagree are split down the middle and are shown in a neutral color. Results are presented for different bitrates (96 kbps, 160 kbps and 320 kbps) and for different genres (rock, jazz, electronic, classical and solo instrument).
In addition, for convenience, average evaluation scores are also presented in Figures 3 and 4. The data of the rating for the semantic differential method are ordinal; however, it is common practice to average the evaluation results to be presented in similar studies (e.g., [54,55]). Average evaluation scores are presented in Figure 3 for different bitrates (96 kbps, 160 kbps and 320 kbps) and in Figure 4 for different genres (rock, jazz, electronic, classical and solo instrument). Additionally, average evaluation scores are presented in Appendix C in tables for different bitrates. The reason for the additional presentation is that the differences in the case of higher bitrates (e.g., 320 Kbits) are very small and difficult to distinguish in Figures 3 and 4.       In order to evaluate the reliability of the ten bipolar scales, Cronbach's alpha was calculated (Table 4) [56]. In general, Cronbach's alpha produced internal consistencies of more than 0.7 for almost all of the ten scales. However, a value smaller than 0.7 was reported for the scale "More noise-Less noise". Regarding the acceptable values of alpha, there are different reports ranging from 0.70 to 0.95 [57]. However, according to Hair et al. [58], a good standardized loading factor of each measurement's latent variable, which is quantified from a manifest variable, should be above 0.5 and ideally 0.7 or higher. A low value of alpha could be due to a low number of questions [59], poor inter-relatedness between items or heterogeneous constructs [60]. Possibly, a score below 0.7 suggests that the items within the tool may not be measuring the same underlying construct. In the case of the bipolar scale "More noise-Less noise", there is a possibility that the participants estimated the underlying construct of the effect of the MP3 coding format differently, and there was an interaction due to another underlying concept (e.g., endogenous noise of music). However, since the value is not far from the recommended and above 0.5 [58], we decided to keep this result in this section and also as a reference for future studies. Further experimental studies are needed to estimate the effect and extent of this possible interaction.

Impact of MP3 Compression on Classical Music
Perhaps the most striking result of this study is the impact of MP3 compression on classical music. Among all genres, classical music was affected the most and especially for the lowest bitrate. It can be seen in Figures 2 and 3 that in 8 out of the 10 bipolar scales, classical music had the lowest ratings among all genres (can also be seen in Appendix D (Table A3)). Therefore, the application of the MP3 standard caused the classical music sample to have the lowest ratings amongst all genres and therefore was the most assessed as being (in comparison with the original file): worse, with poorer spectrum, with more noise, to be more annoying, to be less loud, to have fewer dynamic changes and to be less warm.
A possible explanation for this phenomenon is likely due to the very nature of the MP3 algorithm. As mentioned in detail in the introduction, the audio signal is divided into sub-bands [5], which are later split into 18 smaller ones, creating a total of 576 frequency lines [6]. Thereafter, the psychoacoustic model uses the critical bands of human hearing [7,8] to determine which part of the information contained can be discarded as inaudible. Therefore, a musical signal that has a large amount of musical information that spans the entire auditory spectrum is likely to have the greatest negative impact from the application of the MP3 algorithm. Classical music (orchestral) contains a great variety and number of music instruments. Its long term average spectra have differences from other music genres [61].
Additionally, although the note range of the typical orchestra instruments does not exceed 3520 Hz, the harmonic overtones that are created can reach up to 20,000 Hz, especially in the violins [62]. When an MP3 file is encoded with a bitrate of 96 kbps, all frequencies above 15 kHz are discarded, so all the instrument harmonics above that limit are lost, resulting in the perception of poorer spectrum and loss of brightness.
Dynamic changes in classical music are in general more prominent than other genres such as rock or rap [63]. As opposed to some other genres examined, classical music is a genre that rarely becomes dynamically compressed during the mixing and mastering processes; therefore, the dynamic compression applied by the MP3 algorithm is easily perceived, degrading the quality of the audio file. Similar effects regarding the dynamic range of the orchestral excerpt were observed in another study [25].

Impact of MP3 Compression on Solo Instrument
Another important finding of this research, related to the previous case, is that the solo instrument was least affected by the encoding of the MP3 standard for the lowest bitrate. It can be seen in Figures 2 and 3 that in 7 out of the 10 bipolar scales, the solo instrument had the highest ratings among all the genres (this can also be seen in Appendix D (Table A3)).
Again, a possible explanation, presented previously in Section 4.1, is likely due to the very nature of the MP3 algorithm. Since the audio signal is divided into sub-bands and a large number of frequency lines, of which part of them is discarded as inaudible according to the psychoacoustic model, a signal which has a relatively small amount of sound information (relative to a signal with a large number of instruments and harmonics) is likely to experience the least quality reduction from the MP3 compression. Given that the excerpt of this genre (solo instrument) contains audio information that comes from just one instrument, with the spectrum limitations that accompany it, it is possible that the MP3's compression affects it the least.
Additionally, the limit of 15 kHz, for the 96 kbps bitrate, takes away very little audible information. Although the piano is the acoustic instrument with the biggest note range (A0/27.5 Hz-C8/4168 Hz) and produces a lot of harmonic overtones, the majority of the frequencies that are loud enough to be easily perceived are from 70 Hz-4000 Hz. The algorithm's encoding process does not heavily affect this range, since it functions on a psychoacoustic basis and is aware of a human's sensitivity in the range of 1000-5000 Hz. Moreover, lower frequencies need fewer bits to become encoded than higher frequencies, so the algorithm's bit allocation method does not affect them widely.

Impact of MP3 Compression on Electronic, Rock and Jazz Music
Regarding the impact of MP3 compression on electronic, rock and jazz music, it seems that the results for these genres are somewhere in between the results for classical music and the solo instrument. This can be seen in Figures 2 and 3 and Appendix C. As mentioned in Section 4.1, it seems that quality reduction from the MP3 compression depends on the amount of sound information that is spread across the audible spectrum for each music genre and music piece.
Additionally, the algorithm's impact on electronic music appears to be also related a lot to its dynamics and loudness characteristics. The two worst scores in terms of specific music attributes for the lowest bitrate examined belong to the scales "Less changes-More changes (dynamics)" and "Less loud-Louder". The scores of these two scales combined with the "Worse-Better" score (which has the lowest score amongst all scales) point to the assumption that the electronic music's likeability is heavily connected to its loudness characteristics.
The results for rock music genre revealed a possible correlation between noise perception and overall satisfaction for this genre. Surprisingly, although at 96 kbps the MP3 audio clip was perceived as less noisy compared to the WAV clip, the scales "Worse-Better" and "More annoying-More enjoyable" had the worst ranking for the MP3 clip amongst all scales.
Interestingly, the algorithm seemed to have a similar effect on jazz regarding noise as the one stated for rock music. Again, for the 96 kbps bitrate, the MP3 clip was perceived as less noisy than the WAV file, but the scales "Worse-Better" and "More annoying-More enjoyable" received the worst scores.

General Remarks
The findings of this study confirm the fact that MP3's lossy nature produces audible artifacts that people can perceive, thus affecting the quality of the compressed audio signal compared to its uncompressed form. The findings share a big number of similarities with Pras et al.'s [25] and Ruzanski's [26] findings. Although differences can be spotted among them (music genres, bitrates and population), it becomes apparent that one must take under consideration the music genre of the audio file that needs to be encoded, in order to achieve the best quality/size ratio. In both [25,26], the listeners' overall preference for the CD quality audio is clear, just like in the present study. Regarding higher bitrates, participants of these studies could not discriminate CD quality over MP3. Similarly, in this study, the differences between the MP3 files and the original are very small for all the music genres.
The results of this study support the idea that MP3's lossy nature affects a wide range of audio characteristics. More specifically, bitrates below 160 kbps are not recommended for any of the music genres studied. Lower bitrates introduce artifacts that can be easily perceived and thus lower the audio quality drastically. However, this research has given rise to many questions and possibilities for future work in need of further investigation. Those are going to be discussed in the next section.

Future Work
Subjective evidence from this study supports the idea that the effectiveness of MP3 coding depends on the music genre. The authors find that the results reported here are encouraging but also should be validated by a larger sample size and for more music genres. Additionally, the results of this study could possibly have some implications for research into improving the efficiency of the algorithm.
One possible application could be the utilization of various music genre classification techniques that could be incorporated into the MP3 algorithm to improve its efficacy. For music genre classification, machine learning techniques could be used [64][65][66] that would automatically recognize the music genre of the audio file to be compressed. Classification through machine learning could be applied with the use of spectrograms [67] or even with the use of deep neural networks [68]. The outcome of this step could potentially be used in a number of ways. For example, suggestions could be given by the algorithm for a minimum recommended coding quality according to the music genre of the audio file.
The next and most important step after using the algorithm to identify the musical genre of the audio file that is going to be compressed is how the MP3 encoding could be improved according to the music genre. Various directions could possibly be explored. For example, the results of this study show that there are differences in the evaluation scores for the bipolar scales for each music genre. Compression has a different effect between genres for scales such as "Poorer spectrum-Fuller spectrum", "Less changes -More changes (dynamics)" and "More distortion-Less distortion". Therefore, improvement of the algorithm for lower bitrates could focus so that there is the least possible alteration of scales at which each music genre has the greatest sensitivity. Future work will concentrate on this problem. We hope that our study will serve as a base for future studies on this important issue.

Conclusions
The main aim of this research was to evaluate the effect of the MP3 coding format for different music genres and different bitrates via listening tests in which the original uncompressed files and the compressed files are compared. The semantic differential method was used, and ten bipolar scales were utilized for the listening tests. For this purpose, five different music genres were selected (rock, jazz, electronic, classical and solo instrument), and the files were compressed in three different bitrates (96 kbps, 160 kbps and 320 kbps).
The most important findings of this research are:

•
Classical music had the greatest negative impact due to MP3 compression, among the genres (lowest ratings in 8 out of 10 bipolar scales).

•
The solo instrument was least affected by the MP3 encoding, among the genres (highest rating in 7 out of 10 bipolar scales). • For electronic, rock and jazz music, it seems that the results for these genres are somewhere in between the results for classical music and the solo instrument.
The conclusions and implications of this research are: • A musical signal that has a large amount of musical information that spans the entire auditory spectrum is likely to be negatively affected the most from the application of the MP3 algorithm. The music genre is of great importance when choosing the compression bitrate.

•
The findings of this study can be used to optimize and adapt the MP3 standard, depending on the music genre and the music piece that needs to be encoded. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Appendix B
Frequency response of the Sennheiser HD 215 headphones retrieved from [69].

Appendix B
Frequency response of the Sennheiser HD 215 headphones retrieve Figure A1. Frequency response of the Sennheiser HD 215 headphones.

Appendix C
Since the study was in Greek, the original seven-point bipolar sca Table A2. More about an appropriate translation of scales and attribute [70]. Rock (160) Jazz (160) Electronic (160) Classical Figure A1. Frequency response of the Sennheiser HD 215 headphones.

Appendix C
Since the study was in Greek, the original seven-point bipolar scale is presented in Table A2. More about an appropriate translation of scales and attributes can be found in [70].