1. Introduction
A soundscape is a contextual acoustic environment perceived by humans. Initially, this concept was used as a compositional approach in order to provide a sense of space via the recording and rearrangement of noise, including natural and environmental elements. This role of the soundscape is used to help an audience imagine its space [
1,
2]. Recently, the concept of the soundscape has been used in a variety of fields, such as architecture, art, and education [
3,
4,
5,
6]. In particular, some research [
7] has proposed art-based spatial design by presenting the concept of shopping galleries through a combination of shopping spaces and soundscape music. The concept of the soundscape in art education has also been explored [
8]. Recently, IT technology [
9,
10] has been used to improve user experience in map applications through data-based soundscape construction methods.
In this study, we applied the soundscape concept to an art exhibition environment to improve the artwork appreciation experience. At this point, music is an important component of soundscape-based exhibition environments because it can interfere with or help an audience appreciate the experience. In previous studies, soundscapes were designed by experts by selecting or composing music containing a message that the experts wanted to convey. However, this approach is expensive in terms of time and effort. In this study, we investigated the replacement of this traditional approach to soundscape construction with deep-neural-network-based methods. Our requirement was that the soundscape constructed using our method should provide as impactful of an experience as that of the expert’s choice. If our requirements are met, our approach could spread the abundance soundscape-based media art at a lower cost than that which was previously possible.
The purpose of this study was to improve users’ artwork appreciation experience through auditory cues, such as classical music, in order to provide an abundant media art exhibition environment. These multi-modal sense-based exhibits [
11,
12,
13,
14] not only provide the user with a more impressive, realistic, and immersive experience, but also have potential cognitive and emotional impacts on the appreciator [
15,
16,
17]. Thus, we propose a soundscape-based methodology that uses deep neural networks to identify music associated with a given visual artwork through multi-modal data processing based on weakly supervised learning. With a multi-faceted approach, we measure whether our system can recommend music that can have an impact on the user. This soundscape-based media art will improve users’ experience of appreciating artworks through metaphorical and psychological interactions as well as through the direct and material appreciation of media art. Our method of soundscape design using deep learning can help disseminate media art exhibitions by reinterpreting them as immersive media. Furthermore, we hope that our proposed method will help people with visual impairments to appreciate artworks through its application to a multi-modal media art platform. The contributions of this study are described below.
Contributions
We developed a system for matching classical music and paintings using the concept of the soundscape through an improved deep neural network based on weakly supervised learning.
We propose a multi-faceted approach to measuring ambiguous concepts, such as the subjective fitness, implicit senses, immersion, and availability; then, an interdisciplinary discussion is also provided.
We show the improvements in the appreciation experience, such as metaphorical and psychological transferability, time distortion, and cognitive absorption, with in-depth experiments involving 70 participants.
4. Experiments
We performed three main experiments. The first experiment was a quantitative evaluation in order to determine how effective the multi-time-scale transform was and if it improved the task of audio feature extraction. In the second experiment, we evaluated how reasonably music and paintings were matched by our system. In the third experiment, we evaluated whether the user experience improved while appreciating artworks and the soundscape music.
Hypothesis 1. “If the soundscape music and painting are well matched, user appreciation of the experience will increase.”
The above hypothesis is the central focus of this study. However, assessing the match between the music and painting and the improvement of the user experience is challenging. We therefore developed strategies for measuring these factors, as described in the following subsections.
4.1. Audio Feature Extraction via the Multi-Time-Scale Transform
We performed an ablation and benchmark study to determine the effectiveness of the multi-time-scale transform. This experiment was conducted using the ESC-50 as an environmental sound classification dataset. The environmental settings were as follows: Pytorch 1.1 with Python 3.6. The batch size was 32, and a random shuffle was used. The optimizer used was Adam, and the initial learning rate was ; the scheduler used the cosine annealing learning rate scheduler. The hyper-parameter T value was 50. The hyper-parameters of training were obtained via a greedy search. The ranges of the greedy searches were as follows: The batch sizes were [8, 16, 32], and the initial learning rates were [, , …, , , ].
Table 2 shows the results of the ablation study. First, we conducted experiments using various backbone network settings, such as Mobilenet v2, Efficientnet, VGG, Resnet, and Densenet, with different widths and depths. However, networks other than Resnet and Densenet were omitted from
Table 2 due to their poor results. We increased the depth of the network-provided setting and the width by one, two, and four times the depth. The width showed the best performance when it was doubled, and the depth tended to be better, but not in all cases. Resnet101 with double width showed the best performance. In
Table 2, the application of our multi-time-scale transform to resnet resulted in about 2.8% higher accuracy than that of the baseline network, WaveMsNet.
Table 3 shows the sound classification results for various methods based on the ESC-50 benchmark. Methods without an asterisk in
Table 3 are examples of supervised learning methods that use backbone and data argumentation methods, while methods with an asterisk were trained using various methods, such as weakly (or self) supervised learning, between-class learning, or some other method. Our network showed better performance than that of baseline, but lower performance than that of the state-of-the-art methods. AclNet uses raw feature extractors, such as Envnet v2. This is a weakness for mutual learning; therefore, we did not use these methods. Ensemble-fusing CNN is a state-of-the-art method, but is not a single-model method. Our model performed as well as the single-model methods, and had advantages for mutual learning because it is spectrogram-based. Therefore, in this work, we selected our network as the baseline. However, in our future research, various augmentations with ensemble-fusing CNNs should be explored for applications in our work. Our model’s performance was about 12.2% poorer than that of WEANET. Therefore, work is needed to enhance the performance of our network. Nevertheless, our network has the advantage of being spectrogram-based and showed better performance than human performance. The validation results for each fold of our proposed method can be found in detail in
Table 4.
4.2. Relevance of Music–Artwork Matching
Definition.“Well-Matched Soundscape Music–Painting”: In this work, we defined details factor of “Well-Matched Soundscape Music–Painting” for a measurement of the quality of a match. Blind touch is intended to provide direct and material interactions that go beyond metaphorical and psychological interactions between the art and the appreciator. However, this does not imply the absence of traditional art. Therefore, in this study, we wanted to measure the metaphorical and psychological relevance of matching between music and a painting. The metaphorical scale measured how much the same implicit multi-sensory experience was transferred because blind touch is a multi-sensory-based media art. The psychological scale was measured through subjective fitness, which is an individual’s appreciation, because it is the most representative single scalar value of individual appreciation.
Experimental Setting: Twenty-four men and 16 women, all of whom are Korean, were evaluated in this study. Each group consisted of 10 people divided into four groups with a balanced gender ratio. The criteria for dividing the groups were as follows. First, the artistic culture of the subjects was evaluated via a simple test with the music and paintings used in this experiment. Second, the subjects were allocated to each group based on their test scores, which we averaged. The average music score was 2.3, the average painting score was 3.4, and a perfect score was 5. The roles of each group was as follows: Group 1 participated in subjective fitness experiments. Groups 2 and 3 participated in multi-sensory concordance experiments. Group 4 participated in multi-sensory concordance inverse measurement experiments. The groups were designed to avoid duplication of the experiments and to ensure that prior knowledge did not affect the experimental results. Furthermore, we did not provide any information other than information related to the experimental progress. This meant that the subjects did not know that the paintings and music were recommended via a deep neural network.
Figure 6 shows the paintings used in our experiment, and the music allocated to the paintings by our system is indicated in
Table 5.
4.2.1. Measurements of Subjective Fitness
Experimental Methods: Subjective fitness was measured in group 1, which comprised five men and five women. First, we input the paintings into our system and extracted the top three music recommendations. Second, when both the paintings and music were experienced at the same time, we measured the subjective fitness using a five-point Likert scale. We changed the paintings in the painting–music pairs to ensure that previous and subsequent experiments did not affect the current experiments; for example, painting 1–music 1, painting 2–music 1, painting 3–music 1, painting 4–music 1, then painting 1–music 2, painting 2–music 2, etc.
Table 6 shows the results of the subjective fitness experiments. Columns (a), (b), (c), and (d) are associated with
Figure 6, Music pieces 1, 2, and 3 are associated with
Table 5, and Music 4 was used in the soundscape music exhibition [
12] “Bunker de Lumières Van Gogh” held in Jeju, Korea. Therefore, Music 4 can be considered to be a type of ground truth labeled by experts. F-score is the fitness score, and the P-score is the preference ratio. The fitness score was measured on a five-point Likert scale, and the P-score was the proportion of people who scored three or more F-score points. Music 4 used the same values as in previous studies [
70]. Therefore, the system matched the artwork and music well in terms of subjective fitness. However, the P-score was not stable for Music 2, and Music 3 had a low F-score and P-score. The average F-score for all music items was 3.24, and the average P-score for all music items was 75%. The average F-score of Music 1 was 3.68, and the average P-score of Music 1 was 97.5%. This result is similar to that reported in a previous study, which reported an average F-score of 3.16, an average F-score for Music 1 of 3.74, an average P-score of 87.6%, and an average P-score of Music 1 of 94%. Thus, our system matched music and artworks well, but did not perform better than previous systems, despite the improved feature representation. This is likely due to the small size of the music database. However, the reliability of the system was demonstrated by the attainment of results similar to those reported in previous studies. In future studies, we need to assess whether the stability and performance of the system increase in response to the expansion of the music database.
4.2.2. Measurements of Implicit Multi-Sensory Concordance
Experimental Methods: This experiment was conducted in groups 2 and 3, each consisting of five men and five women. The members of group 2 first wrote a review after viewing the four paintings. Group 3 then wrote a review after listening to the 12 music items. At this time, the subjects were not provided any information about the content of the experiment; the only guidelines that they received were to use all five senses when assessing the artworks or music. Fourth, we measured multi-sensory concordance by comparing the sensory language similarity of the reviews. We mapped words in the review using a Korean sensory word classification table (Table [
71]). Sensory word tables are classified as gustatory, tactile, and temperature sensations; examples of such words are bitter, sweet, salty, sour, nutty, astringent, spicy, plain, rough, smooth, soft (texture), soft (material), hard, moist, sharp, cold, cool, lukewarm, warm, hot, etc.
Table 7 shows the results of the implicit multi-sensory concordance experiments. The D-score refers to the Euclidean distance and the C-score to cosine similarity.
Table 7 shows that similar sensory trends between the two groups were measured through the C-score. However, it can be confirmed that the C-score is not proportional to
Table 6’s P-score or F-score because the cosine similarity is advantageous for distance measurements in high-dimensional positive space, but the size of each dimension is not meaningful. In other words, only the trends in each dimension can be evaluated, while the size of the difference is difficult to assess. We used the D-score to overcome this limitation. The C-score in
Table 7 indicates the sensory similarity between both the music matched by experts and the music matched by our system. The D-score was similarly measured. The results show that not only can media art provide similar multi-sensory appreciation, but our system can also provide similar results to those provided by experts. However, this result has the limitation that a criterion was not provided to determine the significance according to magnitude of the score. Future studies should develop a criterion to determine the significance of the score’s magnitude.
4.2.3. Inverse Measurements of Implicit Multi-Sensory Concordance for Validation
Experimental Methods: This experiment was conducted in group 4, which comprised nine men and one woman. The purpose of this experiment was to verify the experiments in
Section 4.2.2. A questionnaire was created based on a multi-sensory word table constructed in
Section 4.2.2. The experiment was conducted in the same order as that described in
Section 4.2.1, but the questionnaire was completed instead of a written review. The questionnaire evaluated the extent to which the appreciator agreed with the sensory table using a five-point Likert scale.
Table 8 presents the results of the inverse implicit multi-sensory concordance experiment. The A-score is the agreement scale, which was the average score obtained using the five-point Likert scale. The C-score is the cosine similarity of the A-scores calculated based on an active value (3 or higher). The purpose of this experiment was to measure how much the appreciator agreed with the implicit multi-sensory data. The A-scores and C-scores were low because of the failure of the experiment. Several problems were encountered during this experiment. The first problem was the mechanical marking phenomenon. When responding to the questionnaire, subjects gave low scores to senses that were mechanically opposed to the first high score. This was in contrast to the phenomenon where opposite senses were expressed together in the free reviews of appreciation. The second problem was the phenomenon of monotonous responses, which occurred when subjects became familiar with the experiment. This phenomenon demonstrated the tendency of subjects to exclude complex senses before appreciation. For example, the subjects gave low scores to combinations of options, such as the bitter taste of wine and the sour taste of candy, before even listening to music. Thus, in further studies, the validation of the experimental design should be improved.
4.3. Improvement of the Appreciation Experience with the Soundscape
Definition. “Improvement of the appreciation experience”: In this work, we defined the measurable factor of “Improvement of the appreciation experience” as the combination of appreciation and subjective satisfaction; however, this is an ambiguous concept. To address this, we proposed an evaluation method for the immersion based on the flow theory of Csikszentmihalyi. We differentiated flow and cognitive immersion in this study according to flow theory. Flow was indirectly measured via time distortion phenomena, and cognitive absorption was measured through a simple test of working memory and attention concentration. The subjective satisfaction score of the appreciation experience comprised experiences of the environment and appreciation. The subjective environmental satisfaction was not related to improvements in the appreciation experience; however, the environmental score was a useful indicator of if the environmental setting was appropriate. Appreciation scores were measured with a questionnaire based on the SSID [
72] and WHO-5 Well-Being indices. The SSID is an evaluation index for soundscapes, and the WHO-5 index is an evaluation index for quality of life. Our questionnaire was prepared based on the SSID and WHO-5 Well-Being indices.
Experimental Setting: The participants in this experiment included a total of 29 men and 1 woman. They were divided into three groups, with each group consisting of 10 participants. The roles of each group were as follows: Groups 5 and 6 participated in the immersion experiments. Group 7 participated in the subjective satisfaction experiments. The criteria for dividing the groups and other experimental conditions were similar to those described for the previous experiments.
4.3.1. Measurements of Immersion
Experimental Methods: This experiment was conducted in groups 5 and 6. The purpose of this experiment was to indirectly measure immersion. Time distortion was measured for the flow measurement experiments. Group 5 appreciated artworks without listening to music, and group 6 appreciated artworks while listening to music. Time distortion was measured as the subjective assessment of time between the two groups, and flow was evaluated indirectly from the time distortion measurements. Cognitive absorption was measured with a simple test of working memory and attention concentration. This test asked questions about the color, location, shape, and texture of an object or scene.
Table 9 presents the measurement results for the immersion experiment. Each row presents the subjective times that participants felt while appreciating the exhibition. The ground truth refers to the real length of the piece of music. The
p-value is the result of the
t-Test, which was performed under the assumption that the variances were different. The results of the time distortion experiment indicate that group 5 predicted a time closer to the ground truth than group 6. In particular, participants in group 6 felt that more time had elapsed than actually had. Two interesting phenomena were discovered during the analysis of the interviews. The first was the “sleepy” phenomenon. When music pieces 1 or 4 were played, participants in group 5 did not experience sleepiness, while more than 80% of the participants in group 6 experienced sleepiness. The subjective viewing time was 2 m 36 s on average, while the experimental time was longer than the appropriate viewing time, ranging from 3 m 37 s to 6 m 52 s. The second phenomenon was the phenomenon of ambiguous answers. Group 5 answered with specific times, such as 4 m 20 s and 5 m 30 s, while group 6 tended to answer with ambiguous numbers, such as “about 5 minutes” and “about 10 minutes”. In the experiments on cognitive absorption, the averages of the answers given by each group in our simple test were used. The perfect score for this test was 10. As shown in
Table 8, groups 5 and 6 had an average score difference of 2.2. This can be attributed to the sleepy phenomenon. Thus, with the results obtained from this experiment, it was not possible to accurately confirm whether cognitive absorption was affected or not. However, music is qualitatively conducive to cognitive absorption. Examples of answers from participants included the following:
“The bouncy rhythm in the song reminded me of stars”; “The music felt like a young man in the country was leaving the village, and it helped me remember because there was a real village in the artwork”; “it wasn’t hard to find the location of the lover because I watched the couple carefully because of the sentimental music being played.” 4.3.2. Measurements of Subjective Satisfaction
Experimental Methods: This experiment was conducted in group 7. The purpose of this experiment was to measure the subjective satisfaction score. The subjects appreciated the paintings with the top music and filled out a questionnaire that was prepared based on the SSID and WHO-5 Well-Being indices.
The subjective satisfaction scores, which were based on the environment and the appreciation experience, are shown in
Table 10. The top three questions were about environmental experience. The environmental satisfaction scores ranged from 3.3 to 3.6, which are fairly high scores. For the first question, discomfort was associated with 0 points and comfort with 5 points. Therefore, this experiment was appropriately constructed to assess the appreciation experience. The subjective satisfaction scores of appreciation ranged from 3.1 to 4.0. Therefore, soundscape music can aid in appreciation.
5. Discussion
In this section, we would like to conduct an interdisciplinary analysis of how this metaphorical and psychological transfer could have occurred. This interdisciplinary approach consists of five parts: The first is the consideration of the inter-sensory transition phenomenon through the concept of synesthesia; the second is an artistic approach based on the characteristics of the paintings that we used in our experiments; the third is the consideration of the characteristics of deep neural networks in connection with the concept of synesthesia and the characteristics of our paintings; the fourth includes the limitations of this study and the future directions of development; the fifth is the overall blueprint of our research.
First, we deal with synesthesia by focusing on the transition of senses. References [
15,
16,
17] show that media art can give potential cognitive and emotional impacts. We would like to discuss why this effect could occur. Synesthesia is a blending of the senses in which the stimulation of one modality simultaneously produces sensations in a different modality. Synesthesia is known to affect four percent of the population. However, via color–alphabet experiments with 400 people, studies [
73] have shown that people without synesthesia perceive synesthesia but do not consciously recognize it. In other words, non-synesthesia implies that synesthesia is being perceived unconsciously. In particular, the study by [
74] dealt with implicit associations between color and sound, particularly showing that various color properties can be mapped to acoustic properties, such as tone and volume. The study by [
75] showed that the properties of sounds can be associated with colors. The study by [
76] also demonstrated empirical investigations of the associations between auditory and visual perception. Based on these experimental results, transitions between implicit senses are possible; for example, value and hue can be mapped onto acoustic properties, such as pitch and loudness, pitch can be associated with color lightness, loudness can be mapped onto greater visual saliency, high loudness can be associated with orange/yellow rather than blue, and chroma can have a relationship with sound intensity. The following sections cover the features of our artworks.
Second, the artworks used in our experiments were Impressionistic paintings. Impressionism is a trend in art that focuses on colors, lighting, and textures, and it is characterized by the ability to describe nature in the changing colors of light and to accurately and objectively record the visible world using the momentary effects of colors and shades. In particular, three of our four paintings were by Van Gogh, and he is known as a post-impressionist. Van Gogh tried to thicken paint through the Impasto technique to express the maximum texture and create a three-dimensional effect. In addition, Van Gogh did not describe objects in the same way, but captured his emotion and feeling very strongly through the touch of his brush. This distorted form, intense color, and simplification of the form were sought to express emotions. The other painting was “Woman with a Parasol in a Garden”. Renoir was an early impressionist who understood how to express the effects of light, and was known to use black to express shadows. Therefore, the paintings we used are from the early to the late Impressionist period, and they are characterized by their good representation of colors, light, and texture. Therefore, we selected our paintings in consideration of these features. The characteristics of the paintings were analyzed and the possibilities of sensory transitions were addressed.
Third, we discuss how a deep neural network was able to recommend music and paintings in the above consideration. The study by [
77] dealt with the perspectives of beauty, sentiment, and remembrance of art in deep neural networks. The study by [
77] quantitatively and qualitatively analyzed three subjective aspects of human consciousness: image features in relation to aesthetics, sentiment, and memorability. This study indicated that the CNN considered features of various aspects, such as color, intensity, harmony of colors, object emphasis, pattern, art style, semantics, genre, and content. This study also addressed how deep neural networks judge and remember emotional and artistic values by exploring predicted aesthetic and emotional memory scores in the context of art history. In particular, these experiments also showed that aesthetic and emotional scores and color correlations are consistent with common assumptions. These experimental results show that deep neural networks consider factors that can cause sensory transference. Therefore, these features make our sensory transition possible.
Fourth, we address the limitations of our experiment and the directions of future study. There is a limitation in that no global solution was shown. The limitations can be seen from three perspectives: The first is that our research only dealt with Impressionist paintings from an artistic point of view. Therefore, further study is needed as to whether such studies can be applied to different artistic styles beyond Impressionism. The second is the problem of generalization; this study is not representative of the general population because the experiments were conducted on people in their 20s in Korea. In addition, research should be conducted on whether the experimental results are universal or if they are due to social learning. If the results obtained are due to social learning, extended studies should also be carried out from various perspectives, including culture, age, and local areas. Future research will therefore need to extend these local solutions into a global solution. The third is the effectiveness of media art, which can have potential cognitive and emotional impacts. However, our research focused on the potential cognitive impacts of the experiments. Therefore, we measured emotional impact with the F-score, the most representative and implicit single scalar value. These representative features of F-score can be influenced by various factors, such as emotions and environments, as well as by potential cognitive factors. Therefore, more detailed measurements should be studied in the future.
Fifth, our experiment was intended to enhance the experience of appreciation of media art through the effect of visual and auditory transitions. The transitions of these senses can be expected not only in vision and hearing, but also between various other senses, such as vision and smell or vision and touch. In this study, we aimed to create a platform that guides blind people through sensory transition they appreciate media art. Finally, we hope that this multi-sensory experience will be harmonized and become abundant in media art. Furthermore, we hope that our proposed method will be applied to multi-modal media art platforms to help visually impaired people appreciate artworks.
6. Conclusions
This study described the construction of a soundscape-based exhibition environment using deep neural networks. The baseline was improved by different perspectives, such as modeling methods, learning methods, and domain adaptation methods. In addition, the soundscape music selected by our system was output via hyper-orientated speakers to improve the appreciation experience. To measure the improvements in user experience, we devised a soundscape music evaluation method and an appreciation experience evaluation method and conducted extensive experiments with 70 subjects. However, our research suffered from three major limitations. First, this was not state-of-the-art performance from the perspective of the deep neural network. In particular, models with low accuracy from the feature extraction perspective were selected for mutual learning. Therefore, the development of models with improved audio feature representation is required. Second, our current database comprised only 2000 music items. In addition, these 2000 music pieces were limited from the genre perspective. A wide spectrum of databases containing music from various cultures and times should be used in future studies. Third, we conducted experiments using 70 individuals, but each experiment was conducted on a group of only 10 people. Therefore, our experimental results cannot be generalized. Large-scale experiments need to be conducted to generalize these experimental results. This study shows the results of a pilot test with sighted test participants. We hope that the results of this study help people with visual impairments to appreciate art and that they help promote the cultural enjoyment rights of people with visual impairments.