MuIm: Analyzing Music–Image Correlations from an Artistic Perspective

Ullah, Ubaid; Choi, Hyun-Chul

doi:10.3390/app142311470

Open AccessArticle

MuIm: Analyzing Music–Image Correlations from an Artistic Perspective

by

Ubaid Ullah

and

Hyun-Chul Choi

^*

Intelligent Computer Vision Software Laboratory (ICVSLab), Department of Electronic Engineering, Yeungnam University, 280 Daehak-Ro, Gyeongsan 38541, Gyeongbuk, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11470; https://doi.org/10.3390/app142311470

Submission received: 2 October 2024 / Revised: 3 December 2024 / Accepted: 4 December 2024 / Published: 9 December 2024

Download

Browse Figures

Versions Notes

Abstract

Cross-modality understanding is essential for AI to tackle complex tasks that require both deterministic and generative capabilities, such as correlating music and visual art. The existing state-of-the-art methods of audio-visual correlation often rely on single-dimension information, focusing either on semantic or emotional attributes, thus failing to capture the full depth of these inherently complex modalities. Addressing this limitation, we introduce a novel approach that perceives music–image correlation as multilayered rather than as a direct one-to-one correspondence. To this end, we present a pioneering dataset with two segments: an artistic segment that pairs music with art based on both emotional and semantic attributes, and a realistic segment that links music with images through affective–semantic layers. In modeling emotional layers for the artistic segment, we found traditional 2D affective models inadequate, prompting us to propose a more interpretable hybrid-emotional rating system that serves both experts and non-experts. For the realistic segment, we utilize a web-based dataset with tags, dividing tag information into semantic and affective components to ensure a balanced and nuanced representation of music–image correlation. We conducted an in-depth statistical analysis and user study to evaluate our dataset’s effectiveness and applicability for AI-driven understanding. This work provides a foundation for advanced explorations into the complex relationships between auditory and visual art modalities, advancing the development of more sophisticated cross-modal AI systems.

Keywords:

music–image; cross-modality; neural networks; multi-modality

1. Introduction

Since the inception of human society, individuals have communicated complex ideas through a blend of semantic and non-semantic channels—the former providing rational clarity, while the latter delves into emotional depth, a duality artists have long harnessed for impactful expression [1,2]. Today’s technology empowers individuals to explore artistry through multimedia platforms, creating rich avenues for nuanced self-representation [3,4]. Examining these diverse forms of multimodal data from both semantic and affective perspectives unveils the complexities of contemporary self-expression, highlighting challenges and opportunities within computer vision to capture this layered human experience [5,6,7,8].

The prevailing body of literature for the audio and visual domain mainly concentrates on unimodal analyses, exploring their semantic [9,10,11,12,13,14] and emotional dimensions [15,16,17,18,19,20,21]. However, their effectiveness is constrained by data quality, the inherent limitations of each modality, and ambiguous relations with other modalities. Consequently, there is an escalating academic focus on integrating multimodal [22,23] data to enhance analytical performance and reconcile semantic and non-semantic interpretations of data.

In audio-visual multimodal research, studies generally fall into two primary categories: semantic and non-semantic correlations. Semantic analyses are typically more straightforward due to their clear interpretability, while non-semantic studies encounter challenges related to subjectivity, limited training data, and a relative scarcity of multimodal research efforts. Recent work on semantic correlation [24,25,26,27,28,29,30,31] has often employed video data or presumed links between modalities to perform contrastive learning with neural networks. Conversely, non-semantic correlation studies, which typically involve emotional labeling [15,32,33], frequently use valence–arousal (VA) metrics or basic emotion labels to train deep models on paired modalities.

However, existing research has key limitations, primarily due to data constraints. First, the binary distinction between semantic and non-semantic correlations in audio-visual data limits a nuanced understanding of the complexities inherent in inter-modal relationships. Second, the use of overly simplistic categories or intricate valence–arousal models can obscure emotional clarity, especially since music or images within the same medium may convey contrasting emotions, making it difficult for both professionals and non-professionals to interpret artistic data accurately. Third, the scarcity of open-access datasets and the limited size of the available datasets pose significant challenges for training more effective and robust models.

To address existing limitations, we recognize that music and visual data encompass multiple layers of information, as depicted in Figure 1. Current state-of-the-art methods predominantly focus on either the semantic or emotional dimensions of music–visual relationships, resulting in an incomplete analysis of their complex interactions. To bridge this gap, we introduce the MuIm dataset, which comprises two segments: one focusing on artistic music–visual content and another featuring realistic web-based music–visual content. To accurately capture the emotional nuances inherent in artistic content, we employ a hybrid approach that integrates both dimensional and categorical emotional models, offering a balance between comprehensive analysis and interpretability. Our dataset has been statistically validated and is available for further research, fostering advancements in the field.

In summary, the contributions of this paper are threefold:

Exploring Multilayered Music–Image Correlation: To the best of our knowledge, we are pioneers in exploring the correlation of music–image data from an artistic perspective, incorporating multilayered information that encompasses both semantic and emotional dimensions within these modalities.
Introduction of the MuIm Dataset: We introduce a novel, comprehensive dataset named MuIm, which is divided into two segments: artistic Music–Art and realistic Music–Image datasets. This dataset enables the study of music–image correlations by integrating both semantic and emotional data, utilizing a hybrid emotion model to capture complex, layered emotions within artistic content.
Evaluation and Validation of the Dataset: Our statistical analysis validated the dataset’s balanced emotional and semantic diversity, emphasizing its superiority over existing datasets. A user study further confirmed the strong emotional alignment, particularly with multi-dimensional emotion vectors, and high participant ratings for pairs with well-matched semantic and emotional content.

2. Related Work

2.1. State-of-the-Art Methods in Music–Image Correlation

The existing literature frequently reduces music–visual cross-modal relationships, due to their inherently complex many-to-many mapping characteristics. Based on human cognitive mechanisms, these relations lie in two categories: music–visual semantic translation and music–visual emotional or affective translation. The former has garnered significant academic attention and constitutes the bulk of current research efforts.

2.1.1. Semantic Relation

To untangle the intricate relationship of music–visual data, many studies [26,27,28,34,35] have mainly considered the semantic content of the music and visual domains to build a connection between them. The methods presented in [26,27,28,34,35] utilized video as a ground truth for matching the two modalities, which refers to the simultaneous existence of image and sound as similar data. Thus, employing this approach undermines the other associations of music data with the visual domain.

Another research focus is audio–image alignment [25,31,36], where objects are matched with respective sounds and images to create a multi-modal relationship. However, this method struggles with complex auditory forms like music that involve multiple sound elements. Moreover, it is limited to objects with sound and incapable of handling the abstract or non-semantic relations of the two modalities.

2.1.2. Non-Semantic Relation

In order to bridge the emotional content of music–visual data, multiple studies [15,32,33] have employed end-to-end neural network models. Some of them [15,33] focused on matching music with images using specific emotional categories. They achieved this by combining features from both audio and visual data and then processing them through a series of fully connected neural layers. Another study [32] took a different approach, using continuous valence–arousal (VA) emotions. It combined deep metric learning and multi-task regression techniques, creating a shared space for making cross-modal and single-modal VA predictions.

However, these approaches have shortcomings. They often overlook the modalities’ semantics and use photorealistic images for affective analysis, and their emotional labeling, whether categorical or dimensional, can be ambiguous for complex data samples, leading to contradictory emotional interpretations.

2.2. Emotional Models in Art and Music

In the audio-visual domain, emotions are generally measured in two ways: categorical emotion states (CES) and the dimensional emotion space (DES). CES models classify emotions into easy-to-understand categories like positive and negative, while more advanced versions may use specific sets like Ekman’s six emotions [16,17,37,38,39]. DES models, on the other hand, capture the complexity of emotions by representing them in multi-dimensional spaces. A popular DES model is the valence–arousal (VA) model [18,21,32], which measures pleasantness and emotional intensity. These DES models can be challenging to interpret due to their complexity.

In this paper, we employ a hybrid approach for modeling emotions, positioning our model between the CES (categorical emotion space) and DES (dimensional emotion space) frameworks. This model encompasses 28 distinct emotions to effectively capture the emotional states of multimedia content, which we utilize to align and analyze the emotional states of our artistic data.

3. Multi-Layered Music–Image Correlation

In this section, we delve into the concept of multilayered information inherent in complex modalities such as images and music, examining their correlations through a detailed mathematical representation of these intricate relationships. We then specify the variations within existing data, including artistic and realistic media, and their respective correlations. Following this, we highlight the limitations of current state-of-the-art (SOTA) methods, underscoring the need for our proposed dataset.

3.1. Analysis of Multilayered Information

As illustrated in Figure 1, music and image data exhibit complex, multilayered interactions where similar semantic content (e.g., depicting a landscape) can evoke different emotions due to factors such as aesthetics, individual subjectivity, and context. Conversely, audio-visual content with divergent semantics (e.g., fast-paced music paired with calm imagery) may still produce similar emotional responses, highlighting the necessity for a multi-layered analytical approach for cross-modal AI systems to better interpret music–image correlations.

To represent these correlations mathematically, let M denote the set of music data and I represent the set of image data. We define a correlation function

f (M, I)

that captures both semantic (S) and emotional (E) layers:

f (M, I) = α d_{s} (S_{m} (M), S_{i} (I)) + β d_{c} (E_{m} (M), E_{i} (I))

(1)

where

α

and

β

are weighting coefficients,

d_{s}

denotes a similarity distance function (e.g., cosine similarity) for semantic alignment between semantic vectors

S_{m} (M)

and

S_{i} (I)

, and

d_{c}

is a correlation distance function (e.g., Mahalanobis distance) for measuring the emotional congruence or divergence between emotional state vectors

E_{m} (M)

and

E_{i} (I)

.

3.2. Artistic Media: Emotional Complexity

Referring to Equation (1), when applied to artistic content—particularly complex and subjective works—the challenge of capturing emotional complexity becomes evident. The emotional content in artistic media often cannot be accurately represented using basic categorical emotion states (CES) or valence–arousal (VA) dimensional emotion space (DES) models. This limitation arises because the emotions in such works are inherently subjective and can vary significantly across different regions of an artwork, as illustrated in Figure 2a. Moreover, the traditional use of CES models (assigning only a single label to a piece of media) and DES models (applying VA values uniformly across an entire image) has struggled to capture the nuanced emotional interplay between visual and audio data, as demonstrated in Figure 2b.

To address this complexity, we adopt a hybrid approach informed by psychological studies [40,41]. This approach incorporates 28 distinct emotions, combining both categorical and dimensional components to more precisely capture the emotional nuances of artistic content:

Categorical Component: Utilizes 28 distinct emotions, aligning with CES by defining clear emotion categories.
Dimensional Component: Adds a 9-point Likert scale to capture emotion intensity, similarly to VA-DES, for nuanced representation.
Hybrid Nature: Supports multiple emotions per input, blending categorical and intensity-based approaches for complex overlapping emotional states.

Figure 2b illustrates how the hybrid model facilitates clearer differentiation between pieces of artistic data, effectively overcoming the cluttered and ambiguous emotional labeling often observed with standard valence–arousal (VA) models. For additional clarity, a comparison of the three emotion labeling models is provided in Table 1.

Realistic Media: Layered Information

In real-world scenarios, multimedia data, such as music–image pairs, are often tagged with descriptors that reflect various embedded layers of information. Building on the correlation function defined in Equation (1), we identify key layers relevant to realistic media, emphasizing their roles in influencing semantic and emotional correlations:

Semantic Layer [10,25,42]: This layer pertains to identifiable objects, themes, or contexts within multimedia content, such as a forest image or classical music.
Affective/Emotional Layer [18,32,43]: This layer addresses intangible emotional responses, such as joy, which are tagged within visual and auditory media to denote mood or atmosphere.
Contextual/Historical Layer [44]: While providing insights into the origin, significance, or intention of the content, such as historical periods or social concepts, this layer was not included in this study to maintain simplicity.

3.3. Limitations of Current State-of-the-Art Methods

Current state-of-the-art (SOTA) methods of audio-visual correlation primarily employ two approaches: one-to-one mappings based on time, and few-to-few mappings based on emotion, as highlighted in Table 2. One-to-one methods are prone to bias, due to factors such as background noise and frame misalignment, which can reduce their accuracy. In contrast, few-to-few mappings offer better control over matching pairs and thus provide more reliable alignments. Our focus on few-to-few mapping aimed to achieve precise alignment by considering both emotional and semantic attributes.

These existing methods typically emphasize either semantics or emotions in audio-visual associations. However, as illustrated in Figure 2a, these modalities often exhibit a complex, multi-layered relationship. For example, audio and visual data may share the same semantics but evoke contrasting emotions, or they may differ semantically yet produce similar emotional responses. Visual data often contain clearly identifiable semantic elements, whereas music’s semantics are more subtly embedded in aspects such as style, genre, and instrumentation. This highlights the need for a multi-layered analytical approach to comprehensively understand the interplay between these two modalities.

4. MuIm Dataset

In this section, we introduce the MuIm dataset and explain the methods used for its collection and labeling. We also outline the strategy employed to create meaningful music–visual pairs from the data. The MuIm dataset is divided into two parts: the first focuses on artistic data with emotional–semantic connections, while the second explores realistic, web-based music–image correlations, utilizing human-generated labels to distinguish between semantic and non-semantic elements.

4.1. Data Acquisition and Labeling

To address the diverse categories of media in our dataset—artistic and realistic—we first outline the data collection and labeling processes for each segment. Artistic data are isolated because they often encapsulate complex emotions and nuanced meanings. The MuIm dataset is structured into two distinct parts, each designed to capture unique aspects of music–visual correlations.

4.1.1. Artistic Segment

The artistic portion of the MuIm dataset focuses on emotional music excerpts and artistic images, capturing complex emotions and nuanced meanings inherent in such works. We identified limitations in using standard valence–arousal (VA) models for categorizing emotional states in artistic content, necessitating a more comprehensive approach.

For data collection, emotional–semantic music data were sourced from [40], resulting in 1841 five-second music samples gathered from YouTube and annotated through crowdsourcing by approximately 1900 participants. Each sample was labeled for 28 distinct emotions using a 9-point Likert scale, with an average of 31 observers per sample, providing nuanced emotional context.

Art data were curated to align with the emotional analysis performed on the music dataset. Art professionals collected 2319 art pieces from the WikiArt website [52], covering a diverse range of genres and styles. Figure 3a shows the web interface for collecting emotion-evoking art. Crowdsourced evaluations were conducted, with participants divided into the general public, music experts, and art specialists. Using a 9-point Likert scale, each art piece was annotated with 28 predefined emotions, mirroring the methodology applied to the music data, with an average of six observers per art sample to reduce bias. Figure 3b shows images of the labeling website.

To enhance the semantic information, art images were accompanied by metadata such as genre, style, title, and artist. Music data were labeled with titles, and additional semantic insights were generated using models such as [53] for image captions and [13] for music captions. Furthermore, a pretrained LLM [54] was utilized to generate links between music and imagery by extracting common semantics between music and images, similarly to [55], as specified in Table 3.

4.1.2. Realistic Segment

The realistic segment of the MuIm dataset encompasses web-sourced photorealistic images and diverse non-lyrical music, typically tagged with descriptive labels. This collection captures detailed semantic, affective, and contextual information, facilitating in-depth correlations between music and imagery.

Data acquisition involved collecting 26,000 music clips from the internet, ranging from 8 seconds to 30 min in length, each tagged with content, mood, genre, theme, and movement labels. Additionally, 1.1 million user-tagged images were curated, containing labels with semantic, emotional, or contextual information. To ensure the consistency between the music and image data, only images with both sentimental/emotional and semantic tags were selected, resulting in a total of approximately 550,000 images. The data acquisition process for web-based multimedia data (music and images) involved several key stages:

Web Media Metadata Collection: Due to the large volume of web data and limitations imposed by the hardware and bandwidth, URLs and human-defined tags were gathered to provide initial metadata, ensuring that only relevant data for our task were downloaded and processed.
Tag Pre-processing: Media URLs without tags were filtered out, and word tags were validated using transformer models, such as GPT-3, to ensure semantic accuracy and remove invalid tags.
Assurance of Emotional and Semantic Tags: Music data included separate mood tags, which, when combined with user-defined tags, provided accurate emotional information, while semantic tagging relied primarily on user-defined tags. In contrast, image data depended solely on user-defined tags for both emotional and semantic labeling. Due to potential noise, contextual variability, or inaccuracies in user-defined tags, a processing pipeline was implemented to refine and extract accurate semantic and emotional information for both image and music data. This pipeline, illustrated in Figure 4, consists of the following main components:
- Multi-Layer Information: To obtain media with dual-layer information (emotional and semantic), sentiment analysis frameworks [56,57] were used to select media that contained at least one sentimental tag in its user-defined tag list. This was particularly crucial for the image data, as they relied solely on user-defined tags.
- Information Check: To further enhance and validate the tags, media captioning models [13,53] were employed to generate captions directly from the media (image or music).
- Information Consistency: Finally, to ensure label consistency across the tags, a large-language model (LLM) [54] was used to determine the intersection between the human labels and AI-generated captions, ensuring consistency and accuracy. Examples of the image and music datasets for this process can be found in Table 4 and Table 5.
Data Post-processing: Based on the metadata processing, only relevant media (music/images) were downloaded. For the final pre-processing, a vision transformer (ViT) was employed to filter out duplicate images, and an NSFW detection model [58] was used to filter out NSFW content in the images.

4.1.3. Data Diversity and Balance

To ensure diversity and balance within the MuIm dataset, the artistic segment was curated through a crowdsourcing approach for both music and images. For music data within the artistic segment, we utilized a well-defined dataset known for its balanced emotional representation, further enriched with semantic data derived using AI models. The image data, collected by professionals from WikiArt, spanned a wide range of styles and genres, encompassing both classical and modern works. Emotional balance was achieved by having each image labeled by an average of six individuals. For the web-based content, we adopted a rigorous collection process and conducted metadata evaluations to ensure both semantic and emotional diversity. Media captioning models and tag refinements were used to enhance the balance of multilayered information. While a slight bias toward positive and uplifting emotions was observed, measures such as the exclusion of certain samples were taken to mitigate this imbalance and maintain a comprehensive representation. The MuIm dataset statistics are detailed in Table 6.

4.2. Music–Visual Pairing Methodology

We perceive music–image correlation as a multilayered construct, encompassing both affective and semantic dimensions, as described in the mathematical model presented in Equation (1) in Section 3.1. This framework introduces a unique methodology for pairing these modalities within a bi-dimensional space, as illustrated in Figure 5.

To leverage the mathematical model presented in Equation (1) for pairing the modalities, we adapted the similarity calculation model from [32] to compute the similarity score between image and music samples based on their emotional relationships. For semantic similarity, we utilized cosine similarity, a well-known metric for measuring text similarity. These similarities are expressed as follows:

\begin{matrix} S (I_{i}, M_{j}) = exp (- \frac{d (y^{I_{i}}, y^{M_{j}})}{σ_{n}^{m}}) \\ where i = {1, 2, \dots, n}, j = {1, 2, \dots, m} \end{matrix}

(2)

Here,

σ_{n}^{m}

denotes the average distance between all music–image samples, with n and m representing the total number of images and music samples, respectively. The term d signifies the distance metric used to measure the dissimilarity between each pair of music and image samples.

\begin{matrix} Cos ineSim (I_{i}, M_{j}) = \frac{y^{I_{i}} \cdot y^{M_{j}}}{∥ y^{I_{i}} ∥ \cdot ∥ y^{M_{j}} ∥} \\ where i = {1, 2, \dots, n}, j = {1, 2, \dots, m} \end{matrix}

(3)

4.2.1. Artistic Segment Pairing

In the artistic segment, emotional content is represented by a 28-dimensional (28D) emotional vector, while semantic associations were derived from media captioning models and further aligned using large language models (LLMs). In Equation (2), for the artistic dataset, d employs the Mahalanobis distance to measure the similarity between the 28D emotional vectors of music (

y_{m}

) and art (

y_{i}

) due to the inherent correlations among the emotions.

For semantic information, we utilized a text embedding model [59] to generate textual embeddings for the semantic tags produced by large language models (LLMs) and enhanced AI caption models. We then employed Equation (3) to compute the similarity score for the semantic content. Given the complex nature of semantic information, which relies heavily on pretrained models, we placed a slight emphasis on emotional representation over semantic content when calculating the music–art similarity within the artistic dataset. If

w_{e}

and

w_{s}

represent the weights for emotional and semantic similarity, and

S_{e}

and

S_{s}

denote the emotional and semantic similarity scores, then the final similarity score between music and art is determined by assigning a greater weight to the emotional representation than to the semantic content, mathematically expressed as

\begin{matrix} S_{final} = w_{e} \cdot S_{e} + w_{s} \cdot C o s i n e S i m_{s} \\ where w_{e} > w_{s} \end{matrix}

(4)

4.2.2. Realistic Segment Pairing

The realistic data segment consists of data obtained from web sources, containing multiple user-defined labels. We adopted an automated pipeline to filter and refine these labels for both emotional and semantic information. After processing through this pipeline, we obtained textual information for semantics, allowing for similarity scoring similar to the artistic segment. However, unlike the artistic segment, this segment lacked predefined emotional vectors with intensities to measure the emotional similarity. Therefore, we employed a unique approach to derive emotion vectors, as follows:

Sentiment Analysis: Sentiment analysis was performed on the tags to identify words that express positive or negative sentiments using the VADER sentiment analysis model [60].
Clustering Emotions: A large language model (LLM) [54] was utilized to group and classify the identified sentiments using the 8-emotion model proposed by [61].
Emotional Vector Creation: Emotionally similar word sentiments were averaged to generate an 8-dimensional affective vector. This strategy aimed to capture nuanced emotional undertones, potentially spanning multiple emotions represented in the media tags.

We then used the Mahalanobis distance as d to find the similarity measure between the music and image emotional data, while the semantic similarity measure was computed similarly to for the artistic segment. The final similarity measure is expressed as

\begin{matrix} S_{final} = w_{s} \cdot Cos ineSim (S_{semantic}^{I_{i}}, S_{semantic}^{M_{j}}) + w_{e} \cdot S (S_{emotional}^{I_{i}}, S_{emotional}^{M_{j}}) \\ where i \in {1, 2, \dots, n}, j \in {1, 2, \dots, m} \end{matrix}

(5)

Figure 6 illustrates the music–image pairing process, incorporating dual-dimension information for comprehensive matching.

5. Dataset Analysis

In this section, we present a thorough analysis of our dataset. Our evaluation process began with a detailed statistical assessment, aimed at providing an in-depth understanding of the dataset’s characteristics. Following this, we conducted a user study to evaluate the quality and accuracy of the data pairings. Given that no established metrics exist for assessing paired results, we primarily relied on user feedback to gauge the effectiveness and reliability of our data pairings. This dual approach ensured both quantitative and qualitative insights into our dataset’s robustness and relevance.

5.1. Statistical Analysis

For the statistical analysis, we considered two parts of our dataset: artistic and realistic. We first analyzed the artistic segment, followed by an analysis of the realistic segment.

5.1.1. Artistic Data

In the artistic section, we utilized a 28-dimensional (28D) emotion representation and AI-generated captions for semantic information. We then attempted to pair the two based on their emotional and semantic characteristics. To analyze this segment, we first present the mean distribution of emotions for both music and image data in Figure 7. Next, Figure 8 illustrates the cross-emotion correlation, revealing three categories of information:

Some emotions demonstrated positive correlation values, such as ‘Goose bumps’ and ‘Scary/Fearful’, indicating a relatively strong alignment between the emotions evoked by the music and images. This suggests that similar emotional responses may be consistently induced across both modalities.
Certain emotions showed negative correlations, such as ‘Tender/Longing’ and ‘Romantic/Loving’, implying that music and images can evoke these emotions differently, potentially leading to distinct variations in their emotional impact.
Emotions with correlations close to zero suggest little to no observable alignment between how music and images evoke these specific emotional responses.

Next, we present the self-emotion correlation for the artistic music and image datasets to demonstrate the richness of the emotional responses of the two modalities, as depicted in Figure 9. This behavior can be summarized as follows:

For images, emotions exhibit a broad distribution of correlations, indicating that each emotion carries distinct and meaningful information. This suggests that a diverse range of emotions are necessary to fully capture the emotional responses elicited by images.
For music, certain emotions display strong correlations, suggesting that a smaller number of broader emotional categories may be sufficient to capture music-induced emotional experiences. This reflects a more condensed emotional response structure.
This difference in emotional correlation patterns highlights the modality-specific nature of emotional responses, where images evoke a wider range of distinct emotions, while music tends to elicit a more interconnected or overlapping set of emotional responses.

Next, for the artistic dataset, we adopted an approach to better analyze and summarize the semantic information, showcasing the effectiveness and diversity of the modalities’ semantic labels. First, we obtained embeddings for the semantic information using [59] for both modalities. We then applied k-means clustering to extract the top 10 clusters and plot their frequencies. This analysis allowed us to assess the diversity of the semantic information and ensured a balanced dataset when pairing, as shown in Figure 10.

Finally, to assess the appropriateness of the pairing of the cross music–image dataset, we consider our observation that music–image correlation is multilayered and can have four different categories in terms of semantic and emotional responses. To better analyze this, we made a scatter plot showing the similarity between the two modalities in both semantic and emotional spaces. This scatter plot illustrates the modality pairing, where we set a threshold of 0.2 for negative correlations and 0.6 for positive correlations. Using these thresholds, we found that pairs of music and images with both semantic and emotional similarities above the positive threshold, indicating strong alignment, were more appropriately paired. Conversely, pairs with weak alignment exhibited both semantic and emotional distinctions. Figure 11 illustrates the scatter plot representation of our artistic cross-modal pairing.

5.1.2. Realistic Data

In this segment, we performed music–image correlation using a similar technique as applied to the artistic data. To begin, we present the emotion distribution of the two modalities in Figure 12, highlighting their diversity and providing a detailed analysis of the emotional relationship between the music and image modalities. This comparison illustrates the breadth and nature of the emotional responses elicited by the two data types.

Following this, we present a word cloud in Figure 13, showcasing the most prominent semantic labels for the three modalities utilized for our pairing technique. This visualization serves to emphasize the key semantic attributes that contribute to effective cross-modal pairing.

Finally, we display a t-SNE plot representing approximately 10% of the randomly selected samples for the paired data across all four categories, as depicted in Figure 14. This plot allows a visual analysis of the pairing quality within this segment, demonstrating that the paired data points are well-grouped and aligned according to our specified requirements.

5.2. User Study

Our primary evaluation method for the paired data was a user study, as no standard metric existed to assess the quality of the paired dataset. Therefore, we first outline the user study setup, followed by an analysis of the feedback.

5.2.1. Setup

To conduct a comprehensive user study, we adopted methodologies similar to those outlined in [55]. The questions were designed to be accessible to the general public, while addressing two primary concerns: (1) the semantic alignment and (2) the affective/emotional similarity between music and image pairs. Building upon insights from a psychological study [62], we framed three targeted questions (detailed in Table 7) to evaluate the semantic content of the music. Given that emotional content in music and images is relatively more intuitive to interpret, we included a single question to assess the affective congruence between the two.

To mitigate potential biases in both the data and user responses, the study encompassed 30 diverse videos. Each video comprised three to four images synchronized with a complete music sample, with the duration of each music sample being less than 3 min. To ensure diversity in the dataset, we curated videos representing a broad spectrum of content, including (1) music–image pairs with similar semantic and emotional properties, and (2) pairs with contrasting properties (e.g., identical semantics but differing emotions, and vice versa). The videos were then divided into six groups, each containing a randomized selection of five videos.

To gain additional insights into user preferences and their approach to evaluating music–image pairs, we collected demographic information such as age, gender, and other relevant attributes. Following the methodology in [55], we also gathered details about the participants’ familiarity with music–image associations. This allowed us to assess the relationship between their feedback and the generated pairs. Figure 15 provides an overview of the participant demographics and general study setup.

5.2.2. Result Evaluation

To ensure unbiased and informed feedback from participants, we assessed their understanding of both music and image domains, as visualized in Figure 16. The distribution indicates that the participants typically had a balanced proficiency in both music and art. A significant subset appeared more well versed in art than in music. Nonetheless, the majority possessed foundational knowledge in both domains, ensuring diverse and informed evaluations.

In assessing our dataset pairing, we referenced the dimensionality of the dataset, as outlined in Table 7. The assessment was structured as follows: First, the dataset was divided into two segments: artistic and realistic (web-based). Each segment targeted a distinct music–image category and varied in scale, with the artistic segment being relatively smaller compared to the realistic segment. Second, we aligned the modalities on a bi-dimensional plane, focusing on semantics and emotions. Positive pairs were the primary focus, as negative pairs were derived from the positives and can theoretically be infinitely numerous. Third, for the artistic segment, due to its smaller scale, we considered only a single video sample per group. These samples were randomly assigned to represent either similar or contrasting properties of semantics and emotions. Conversely, for the realistic (web-based) segment, which had a larger scale, we created four video samples per group: two showcasing contrasting pairings, and two reflecting similar emotion and semantic content. This structured approach ensured a balanced representation across both segments, while maintaining the diversity and integrity of the dataset.

Using the established setup, we generated histogram plots for all four questions. The initial three questions focused on the semantic connection between the modalities, while the final one targeted their emotional relation. Figure 17 displays the average responses from six groups of five participants each, scored on a 0-5 Likert scale (with 0 indicating “disagree” and 5 as “strongly agree”). The concluding plot in Figure 17 aggregates the average semantic ratings from the three questions for both data segments.

In the “average symbol rating” plot from Figure 17, the artistic data scored below 2, suggesting limited recognition of shared semantics between the modalities. This can be attributed to the constraints of AI-generated semantics for artistic data. Though this pattern was observed in subsequent questions, the scores were slightly higher, possibly due to the participants’ ability to associate music with imagery based on prior experiences.

For the web data, the “varied semantics, same emotions” responses for the first question hovered around neutral. However, for questions 2 and 3, there was a general disagreement, which was expected. In the “varied emotions, same semantics” category, while responses for question 2 were neutral, questions 1 and 3 leaned towards agreement, suggesting consistent semantic interpretations across the modalities.

Lastly, columns representing “similar emotions and semantics” showed high levels of agreement across all questions. Their close alignment underscored a robust correlation in paired samples, as demonstrated by two distinct videos.

In Figure 18, the “average semantic rating” plot reveals a general disagreement among participants regarding the semantic similarity between modalities for the artistic data. This could be attributed to the suboptimal representation by the media captioning models. For the varied semantics or emotions, responses tended slightly above neutral, indicating the participants might occasionally have conflated music semantics with emotions. Notably, ratings for similar semantics and emotions were consistently high across groups, reinforcing that pairings with strong alignment in both dimensions resonated well, aligning with the study’s objectives.

In the emotional analysis, depicted in Figure 18, we simplified our approach due to the inherent ease in interpreting emotional resonance between music and images, thus employing only one question. The artistic data revealed a substantial agreement between the modalities, which was expected given the reliance on 28D emotion vectors for pairing. This high concordance is also evident in the “average emotion rating” plot, especially when both dimensions are well-aligned.

6. Conclusions and Future Work

In conclusion, this study explored the artistic side of visual and musical data, revealing a research gap due to the limitations of existing datasets, especially in combining music and visuals. To bridge this gap, we introduced the MuIm dataset and a new pipeline to better understand the relationship between these two forms of art. We found that the emotional content in artistic data is quite intricate and proposed a hybrid approach to capture it more accurately. Our dataset was thoroughly analyzed and validated through a user study, showing promising results.

This research marks the beginning of our exploration into the intricate relationship between music and visuals, and while it has its limitations, it sets the stage for future work. We plan to further assess and refine our dataset, exploring additional connections to deepen our understanding of music–visual correlation. This paper serves as a foundation for further simplified and insightful studies in the cross-modal domain of music and visuals.

Author Contributions

Conceptualization, U.U.; formal analysis, U.U.; investigation, U.U.; experiments, U.U.; writing—original draft preparation, U.U.; visualization, U.U.; supervision, H.-C.C.; project administration, H.-C.C.; funding acquisition, H.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2022R1A2C2013541), in part by the Ministry of Trade, Industry & Energy (MOTIE) and Korea Semiconductor Research Consortium (KSRC) support program (No. 20020265), and in part by the 2024 Yeungnam University Research Grant.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study were derived from publicly available resources. Specifically: 1. Music emotion data for the artistic segment were obtained from the supplementary materials of the study available at https://www.pnas.org/doi/10.1073/pnas.1910704117#supplementary-materials. 2. Artistic image data were sourced from WikiArt, accessible at https://www.wikiart.org/. 3. Realistic music and image data were retrieved from Pixabay, available at https://pixabay.com/.

Acknowledgments

We would like to thank our participants for collecting and labeling the data.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

Gaut, B.; Gaut, B. Art, Emotion and Ethics; Oxford University Press: Oxford, NY, USA, 2007. [Google Scholar]
Giachanou, A.; Crestani, F. Like It or Not: A Survey of Twitter Sentiment Analysis Methods. ACM Comput. Surv. 2016, 49, 1–41. [Google Scholar] [CrossRef]
Zhao, S.; Ma, Y.; Gu, Y.; Yang, J.; Xing, T.; Xu, P.; Hu, R.; Chai, H.; Keutzer, K. An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 303–311, ISSN 2374-3468/2159-5399. [Google Scholar] [CrossRef]
Zhao, S.; Yao, H.; Gao, Y.; Ding, G.; Chua, T.S. Predicting Personalized Image Emotion Perceptions in Social Networks. IEEE Trans. Affect. Comput. 2018, 9, 526–540. [Google Scholar] [CrossRef]
Hassan, S.Z.; Ahmad, K.; Al-Fuqaha, A.; Conci, N. Sentiment Analysis from Images of Natural Disasters. In Image Analysis and Processing—ICIAP 2019; Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; pp. 104–113. [Google Scholar] [CrossRef]
She, D.; Yang, J.; Cheng, M.M.; Lai, Y.K.; Rosin, P.L.; Wang, L. WSCNet: Weakly Supervised Coupled Networks for Visual Sentiment Classification and Detection. IEEE Trans. Multimed. 2020, 22, 1358–1371. [Google Scholar] [CrossRef]
Truong, Q.T.; Lauw, H.W. Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN. In Proceedings of the 25th ACM International Conference on Multimedia, MM’17, New York, NY, USA, 21–25 October 2017; pp. 1274–1282. [Google Scholar] [CrossRef]
Zhao, S.; Jia, Z.; Chen, H.; Li, L.; Ding, G.; Keutzer, K. PDANet: Polarity-consistent Deep Attention Network for Fine-grained Visual Emotion Regression. In Proceedings of the 27th ACM International Conference on Multimedia, MM’19, New York, NY, USA, 21–25 October 2019; pp. 192–201. [Google Scholar] [CrossRef]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
Doh, S.; Won, M.; Choi, K.; Nam, J. Toward Universal Text-To-Music Retrieval. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–4. [Google Scholar] [CrossRef]
Doh, S.; Choi, K.; Lee, J.; Nam, J. LP-MusicCaps: LLM-Based Pseudo Music Captioning, 2023. arXiv 2023, arXiv:2307.16372. [Google Scholar] [CrossRef]
Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Berg-Kirkpatrick, T.; Dubnov, S. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Xing, B.; Zhang, K.; Zhang, L.; Wu, X.; Dou, J.; Sun, S. Image–Music Synesthesia-Aware Learning Based on Emotional Similarity Recognition. IEEE Access 2019, 7, 136378–136390. [Google Scholar] [CrossRef]
Achlioptas, P.; Ovsjanikov, M.; Haydarov, K.; Elhoseiny, M.; Guibas, L. ArtEmis: Affective Language for Visual Art. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11564–11574. [Google Scholar] [CrossRef]
Mohamed, Y.; Khan, F.F.; Haydarov, K.; Elhoseiny, M. It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 21231–21240. [Google Scholar] [CrossRef]
Aljanaki, A.; Yang, Y.H.; Soleymani, M. Developing a benchmark for emotional analysis of music. PLoS ONE 2017, 12, e0173392. [Google Scholar] [CrossRef]
Wang, S.Y.; Wang, J.C.; Yang, Y.H.; Wang, H.M. Towards time-varying music auto-tagging based on CAL500 expansion. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo (ICME), Chengdu, China, 14–18 July 2014; pp. 1–6, ISSN 1945-788X. [Google Scholar] [CrossRef]
Aljanaki, A.; Soleymani, M. A Data-driven Approach to Mid-level Perceptual Musical Feature Modeling. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018. [Google Scholar]
Soleymani, M.; Caro, M.N.; Schmidt, E.M.; Sha, C.Y.; Yang, Y.H. 1000 songs for emotional analysis of music. In Proceedings of the 2nd ACM International Workshop on Crowdsourcing for Multimedia, CrowdMM’13, New York, NY, USA, 21–25 October 2013; pp. 1–6. [Google Scholar] [CrossRef]
Ullah, U.; Lee, J.S.; An, C.H.; Lee, H.; Park, S.Y.; Baek, R.H.; Choi, H.C. A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint. Sensors 2022, 22, 6816. [Google Scholar] [CrossRef]
Zhao, S.; Yao, X.; Yang, J.; Jia, G.; Ding, G.; Chua, T.S.; Schuller, B.W.; Keutzer, K. Affective Image Content Analysis: Two Decades Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6729–6751. [Google Scholar] [CrossRef]
Li, T.; Liu, Y.; Owens, A.; Zhao, H. Learning Visual Styles from Audio-Visual Associations. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 235–252. [Google Scholar] [CrossRef]
Owens, A.; Isola, P.; McDermott, J.; Torralba, A.; Adelson, E.H.; Freeman, W.T. Visually Indicated Sounds. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2405–2413, ISSN 1063-6919. [Google Scholar] [CrossRef]
Nakatsuka, T.; Hamasaki, M.; Goto, M. Content-Based Music-Image Retrieval Using Self- and Cross-Modal Feature Embedding Memory. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2173–2183, ISSN 2642-9381. [Google Scholar] [CrossRef]
Hong, S.; Im, W.; Yang, H.S. CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR’18, New York, NY, USA, 5 June 2018; pp. 353–361. [Google Scholar] [CrossRef]
Yi, J.; Zhu, Y.; Xie, J.; Chen, Z. Cross-Modal Variational Auto-Encoder for Content-Based Micro-Video Background Music Recommendation. IEEE Trans. Multimed. 2023, 25, 515–528. [Google Scholar] [CrossRef]
Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, A.P.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv 2016, arXiv:1609.08675. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
Chen, L.; Srivastava, S.; Duan, Z.; Xu, C. Deep Cross-Modal Audio-Visual Generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, Workshops’17, New York, NY, USA, 23–27 October 2017; pp. 349–357. [Google Scholar] [CrossRef]
Zhao, S.; Li, Y.; Yao, X.; Nie, W.; Xu, P.; Yang, J.; Keutzer, K. Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space. In Proceedings of the 28th ACM International Conference on Multimedia, MM’20, New York, NY, USA, 10–16 October 2020; pp. 2945–2954. [Google Scholar] [CrossRef]
Verma, G.; Dhekane, E.G.; Guha, T. Learning Affective Correspondence between Music and Image. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3975–3979. [Google Scholar] [CrossRef]
Arandjelovic, R.; Zisserman, A. Look, Listen and Learn. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 609–617, ISSN 2380-7504. [Google Scholar] [CrossRef]
Surís, D.; Vondrick, C.; Russell, B.; Salamon, J. It’s Time for Artistic Correspondence in Music and Video. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10554–10564. [Google Scholar] [CrossRef]
Arandjelovic, R.; Zisserman, A. Objects that Sound. In Computer Vision—ECCV 2018; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 435–451. [Google Scholar]
Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
Eerola, T.; Vuoskoski, J.K. A comparison of the discrete and dimensional models of emotion in music. Psychol. Music. 2011, 39, 18–49. [Google Scholar] [CrossRef]
Turnbull, D.; Barrington, L.; Torres, D.; Lanckriet, G. Towards musical query-by-semantic-description using the CAL500 data set. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’07, New York, NY, USA, 23–27 July 2007; pp. 439–446. [Google Scholar] [CrossRef]
Cowen, A.S.; Fang, X.; Sauter, D.; Keltner, D. What music makes us feel: At least 13 dimensions organize subjective experiences associated with music across different cultures. Proc. Natl. Acad. Sci. USA 2020, 117, 1924–1934. [Google Scholar] [CrossRef] [PubMed]
Cowen, A.S.; Keltner, D. Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Natl. Acad. Sci. USA 2017, 114, E7900–E7909. [Google Scholar] [CrossRef]
Drossos, K.; Lipping, S.; Virtanen, T. Clotho: An Audio Captioning Dataset. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–9 May 2020; pp. 736–740. [Google Scholar] [CrossRef]
Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Context Based Emotion Recognition Using EMOTIC Dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2755–2766. [Google Scholar] [CrossRef]
Lee, C.C.; Lin, W.Y.; Shih, Y.T.; Kuo, P.Y.P.; Su, L. Crossing You in Style: Cross-modal Style Transfer from Music to Visual Arts. In Proceedings of the 28th ACM International Conference on Multimedia, MM’20, New York, NY, USA, 12–16 October 2020; pp. 3219–3227. [Google Scholar] [CrossRef]
Malheiro, R.; Panda, R.; Gomes, P.; Paiva, R.P. Bi-modal music emotion recognition: Novel lyrical features and dataset. In Proceedings of the 9th International Workshop on Machine Learning and Music—MML 2016, Riba del Garda, Italy, 23 September 2016. [Google Scholar]
Aljanaki, A.; Yang, Y.H.; Soleymani, M. Emotion in Music Task at MediaEval 2015. In Proceedings of the MediaEval Benchmarking Initiative for Multimedia Evaluation, Barcelona, Spain, 16–17 October 2014. [Google Scholar]
Agostinelli, A.; Denk, T.I.; Borsos, Z.; Engel, J.; Verzetti, M.; Caillon, A.; Huang, Q.; Jansen, A.; Roberts, A.; Tagliasacchi, M.; et al. MusicLM: Generating Music From Text, 2023. arXiv 2023, arXiv:2301.11325. [Google Scholar] [CrossRef]
Law, E.; West, K.; Mandel, M.I.; Bay, M.; Downie, J.S. Evaluation of Algorithms Using Games: The Case of Music Tagging. In Proceedings of the International Society for Music Information Retrieval Conference, Kobe, Japan, 26–30 October 2009. [Google Scholar]
Mei, X.; Meng, C.; Liu, H.; Kong, Q.; Ko, T.; Zhao, C.; Plumbley, M.D.; Zou, Y.; Wang, W. WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research, 2023. arXiv 2023, arXiv:2303.17395. [Google Scholar] [CrossRef]
Huang, Q.; Park, D.S.; Wang, T.; Denk, T.I.; Ly, A.; Chen, N.; Zhang, Z.; Zhang, Z.; Yu, J.; Frank, C.; et al. Noise2Music: Text-conditioned Music Generation with Diffusion Models. arXiv 2023. [Google Scholar] [CrossRef]
Kim, C.D.; Kim, B.; Lee, H.; Kim, G. AudioCaps: Generating Captions for Audios in The Wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 3–5 June 2019; pp. 119–132. [Google Scholar] [CrossRef]
WikiArt.org—Visual Art Encyclopedia. Available online: https://www.wikiart.org/ (accessed on 10 January 2024).
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Xiong, Z.; Lin, P.C.; Farjudian, A. Retaining Semantics in Image to Music Conversion. In Proceedings of the 2022 IEEE International Symposium on Multimedia (ISM), Naples, Italy, 5–7 December 2022; pp. 228–235. [Google Scholar] [CrossRef]
Stargazers · Explosion/spaCy. Available online: https://github.com/explosion/spacy-layout (accessed on 15 February 2024).
Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Codd, A. ViT-Base NSFW Detector. 2023. Available online: https://huggingface.co/AdamCodd/vit-base-nsfw-detector (accessed on 16 February 2024).
Face, H. Sentence-Transformers/All-MiniLM-L6-v2. 2023. Available online: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed on 20 February 2024).
Hutto, C.; Gilbert, E. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the International AAAI Conference on Web and Social Media 2014, Ann Arbor, MI USA, 1–4 June 2014; Volume 8, pp. 216–225. [Google Scholar] [CrossRef]
Semeraro, A.; Vilella, S.; Ruffo, G. PyPlutchik: Visualising and comparing emotion-annotated corpora. PLoS ONE 2021, 16, e0256503. [Google Scholar] [CrossRef] [PubMed]
Schlenker, P. Musical Meaning within Super Semantics. In Linguistics and Philosophy; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar] [CrossRef]

Figure 1. Visual representation of multilayered information structures in music and visual data, demonstrating the potential for understanding complex correlations between these two modalities.

Figure 2. (a) Illustrative example of semantically similar but contradicting emotional media, where visual data and music data are represented. (b) Illustrative comparison of 2D (VA) vs. 28-category dimensional emotional representation model.

Figure 3. (a) Illustration of expert-guided art image collection interface and (b) images of 28-category emotion labeling in art using a 9-point Likert scale.

Figure 4. Pipeline for web-based music–image tag processing with affective–semantic Tags.

Figure 5. Illustrative figure of the dual-dimension music–image pairing strategy, approximating strong and mixed correlations across semantic and emotional dimensions. Subplots show individual projections on simplified semantic and emotional planes.

Figure 6. Detailed summary of the adapted pipeline for pairing image-music data using dual-dimension information.

Figure 7. Average emotional representation for (a) image and (b) music data in the artistic part.

Figure 8. Cross music–image correlation coefficient to show the emotional similarity between the two modalities.

Figure 9. Self-emotion correlation for the (a) music and (b) image modalities to show the significance of emotional response in the datasets.

Figure 10. Semantic information for the music and image artistic data.

Figure 11. Pairing result of the two modalities for 2D information.

Figure 12. Emotion distribution of (a) image and (b) music modalities from the web-based realistic data.

Figure 13. Word clouds for the semantic words in (a) image and (b) music data.

Figure 14. t-SNE plot for 10% random sample paired dataset for the web-based realistic data.

Figure 15. User-study overview showcasing participants’ distribution based on gender, followed by their age group and the corresponding number of participants.

Figure 16. Comparative analysis of participants’ knowledge in music and image domains.

Figure 17. Trends in semantic analysis responses. The combined average semantic rating histogram consolidates the results from the other three histograms.

Figure 18. Trends in affective analysis responses. The average emotional rating histogram consolidates results by merging web-data variations.

Table 1. Comparison of categorical, dimensional, and hybrid models for emotion representation.

Aspect	Categorical Model	Dimensional Model	Hybrid Model
Emotion Representation	Discrete, with a fixed set of basic emotions.	Continuous, using a 2D vector (e.g., valence-arousal).	Continuous, with a 28D vector for multiple emotions and intensities.
Complexity of Emotions	Simplistic; one emotion at a time.	Flexible; captures a range but may miss overlaps.	Highly flexible; captures overlapping, complex emotions.
Intensity of Emotions	Lacks intensity variation.	Includes intensity but limited in handling overlaps.	Uses intensity scales for continuous, overlapping emotions.
Applicability	Suitable for single-dominant emotion labeling.	Captures broader spectra but limited for complex multi-emotions.	Designed for complex artistic/multimedia content.
Interpretability	Easy to interpret; defined categories.	Moderate; requires understanding dimensions.	Allows detailed, nuanced emotional analysis.

Table 2. Overview of selected datasets for deep learning in audio, visual, and audio-visual modalities: focus on music in audio and images/art in visual domain. Highlighted red part specifies previous audio-visual correlation data, while the green part shows our data.

Information Layer	Modality	Dataset					ML Tasks
Information Layer	Modality	Name	Domain	Annotations	Quantity		ML Tasks
Affective	Visual (Image/Art)	[43] EMOTIC	Human Emotions	BBOX + CES (26) + DES (VAD)	23,571		DIS
		[16] ArtEmis	Affective Art Captioning	CES (9) + captions	80,031		GEN
		[17] ArtEmis 2.0	Affective Art Captioning	CES (9) + captions + contrast	-		GEN
	Music	[18] DEAM	Static Music Emotion	DES (VA)	1802		DIS
		[38] Soundtrack	Static Music Emotion	CES (4)	360		DIS
		[45] Bi-Modal	Static Music Emotion	CES (4) + Lyrics	162		DIS
		[39] CAL500	Static Music Emotion	CES (18)	502		DIS
		[19] CAL500exp	Dynamic Music Emotion	CES (18)	3223		DIS
		[46] EMM	Static Music Emotion	DES (VA)	489		DIS
		[20] MPF	Static Music Emotion	Perceptual Features + DES (8)	5000		DIS
		[21] EAM	Dynamic Music Emotion	DES (VA)	744		DIS
	Audio-Visual (Music-Image)	[32] IMEMNet	Music-Image	DES (VA)	144,435 pairs	ED	DIS
		[33] IMAC	Music-Image	CES (3)	-	0/1	DIS
		[15] ASD	Music-Image	CES (8)	250,000 pairs	ED, PCC	DIS
Semantic	Visual (Image/Art)	[9] Conceptual Captions	Image captioning	Web Captions	3.4M		GEN
		[10] COCO	Image captioning	Human captions	328		DIS
		[11] VG	Image Captioning	Captions + QA + Scene Graphs	108		DIS + GEN
	Audio (Music)	[12] ECALS	Text-Music Retrieval	Tag-Captions	517,022		DIS
		[47] MC	Text to Music	Captions	5,5		GEN
		[48] MTT	Music Retrieval	Tags (Semantic + Mood)	26		DIS
		[13] LP-MusicCaps	Audio Captioning	Tags + Pseudo captions	514		GEN
		[49] WavCaps	Audio Captioning	Captions	400		DIS + GEN
		[14] LAION-Audio-630K	Text-Audio Representation	Tags (K2C) + Captions	633,526		DIS
		[50] MuLaMCap	Text to Music	Pseudo Captions	400		GEN
		[51] AudioCaps	Audio captioning	Captions	52,904		GEN
		[42] Clotho	Audio captioning	Captions	5929		GEN
	Audio-Visual (Music-Image)	[24] ITW	Audio-Visual	Egocentric Videos	94	Temporal	GEN
		[25] GH	Audio-Visual	Videos	977	Temporal	DIS + GEN
		[26] MCA	Music-Image	Music Cover Art	78,325	1-to-1	DIS
		[27] HIMV-200K	Music-Video Retrieval	Micro Video	200,5	Temporal	DIS
		[28] TT-150K	Music-Video Retrieval	Micro Video	150	Temporal	DIS
		[29] YouTube-8M	Video Classification	Labels	8M	Temporal	DIS
		[30] AudioSet	Audio Event recognition	Audio classes	1.2M	Temporal	DIS
		[31] Sub-URMP + INIS	Music-Visual	Video + Labels	17,555 + 7200	Temporal	GEN
Affective + Semantic	Audio-Visual	MuIm	Music-Image	Artistic: 28D emotions + captions	50K pairs	Several Distance Metrics	DIS + GEN
Affective + Semantic	(Music-Image)	MuIm	Music-Image	Web: emotional + semantic labels	700k pairs	Several Distance Metrics	DIS + GEN

Abbreviations: DES: dimensional emotion space, CES: categorical emotion space, VA: valance–arousal, ED: Euclidean distance, AED: aesthetic energy distance, PCC: Pearson correlation coefficient, ASD: affective synesthesia dataset, IMAC: image–music affective correspondence, IMEMNet: Image-Music-Emotion-Matching-Net, DIS: discrimination, GEN: Generation, BBOX: bounding box, ArtEmis: art emotions, DEAM: Database for Emotional Analysis in Music, EMM: Emotion in Music task at MediaEval 2015, Pseudo captions: tag-based LLM captions, MPF: Mid-level Perceptual Feature Dataset, EAM: 1000 Songs for Emotional Analysis of Music Dataset, CAL500: Computer Audition Lab 500-song, ECALS: extended clean tag and artist-level stratified, MC: MusicCaps, K2C: keyword to caption, MuLaMCap: MuLan-LaMDA, ITW: into the wild, GH: greatest hits, MCA: music cover art, HIMV-200K: Hong–Im Music–Video 200K.

Table 3. Example of common semantics between artistic music and image.

Aspect	Artistic Example (Image)	Musical Example (Music)
Theme	Architecture	Baroque period architecture-inspired themes
Style	Early Renaissance, Perspective	Baroque instrumental style
Genre	Art, Architecture	Classical Music, Instrumental
Atmosphere	Historical, Serene, Spiritual	Serene, Majestic, Historical undertones
Instruments/Visual Elements	Perspective lines, Linear perspective, Church interior	Pipe organ, Strings, Spacious acoustic reverb

Table 4. Sample image data: showcasing image selections guided by tags encompassing both semantic and sentiment attributes.

Images	Human-Defined Tags	Generated Caption	Caption-Based Semantic Tags	NLP-Based Sentiment Tags
	tulips, flowers, march, garden, red, bouquet, flower, bloom, beauty, tulip, spring, flora, nature	a bouquet of colorful tulips in a store	bloom, beauty, tulips, spring, flora	tulips: 0.0, flowers: 0.0, march: 0.0, garden: 0.0, red: 0.0, bouquet: 0.0, flower: 0.0, bloom: 0.0, beauty: 0.5859, tulip: 0.0, spring: 0.0, flora: 0.0, nature: 0.0
	happiness, optimist, graffiti, smile, happy eyes, playful, youth, dream, green, young woman, charming, tree, smiling.	a young girl smiles while posing in front of a flowering tree	young woman, smile, tree	happiness: 0.5574, optimist: 0.5267, graffiti: 0.0 smile: 0.3612, happy eyes: 0.5719, playful: 0.4404, youth: 0.0, dream: 0.25, green: 0.0, young woman: 0.0, charming: 0.5859, tree: 0.0, smiling: 0.4588

Table 5. Sample Music Data: Showcasing music selections guided by tags encompassing both semantic and sentiment attributes.

Web-Music	Human-Defined Tags	Tags-Based Generated Caption (OpenAssistant)	Music-Based Captions (LP-MusicCaps)	NLP-Based Sentiment Tags
Music-31	1. Mood-based tags: ‘relax’, ‘relaxing’, ‘relaxation’, ‘ambient’, ‘piano’, ‘chilling’, ‘nature’, ‘forest’, ‘woods’, ‘planet’, ‘sleeping’, ‘pray’, ‘relaxing mood’, ‘chill’, ‘peaceful’, ‘dreamy’, ‘bright’, ‘relaxing’, ‘hopeful’, ‘laid back’ 2. Genre ‘ambient’, ‘beats’, ‘electronic’, ‘beautiful plays’ 3. Movement ‘floating’, ‘smooth’, ‘medium’, ‘elegant’ 4. Theme: ‘music for videos’, ‘music for YouTube videos’, ‘background music’	1. Tags-based: ‘A peaceful and dreamy piano melody fills the air, surrounded by the calming sounds of nature and chirping birds. The atmosphere is filled with a sense of hope and relaxation, as the listener finds solace in the sound of the piano’. 2. Genre-Movement- Theme based: ‘A musical journey of floating elegance, smooth movement and mesmerizing music for Youtube videos’. 3. All-tags: ‘A beautiful piano plays in the background of a tranquil forest, with peaceful, dreamy synths and ambient pads. The music adds a relaxing and hopeful mood to the video, perfect for promoting positive vibes and encouraging viewers to stay grounded and optimistic’.	[10:00–20:00] This audio contains someone playing a melody on a theremin. A deep male voice is singing in the mid to higher range. This song may be playing in a church. . . . [180:00–190:00] The low quality recording features a shimmering cymbal in the left channel, alongside a suspenseful didgeridoo melody, followed by tinny wooden percussion and a violin melody in the right channel of the stereo image. The recording is noisy and in mono.	relax: 0.4404, relaxing: 0.4939, relaxation: 0.5267, ambient: 0.0, piano: 0.0, chilling: 0.0258, nature: 0.0, forest: 0.0, woods: 0.0, planet: 0.0, sleeping: 0.0, pray: 0.3182, chill: 0.0, peaceful: 0.4939, dreamy: 0.0, bright: 0.4404, hopeful: 0.5106, laid back: 0.0

Table 6. MuIm segment statistics for image and music modalities.

MuIm Segment	Image	Music
Artistic Segment	Total Art Images = 2319 Total Genres = 52 Total Styles = 107 Source = wikiart.com Affective + Semantic = 28D + 384D	Total Music Clips = 1841 Average Duration/Sample = 5 s Source = [51] Lyrics = No Affective + Semantic = 28D + 384D
Realistic Segment	Total Images = 1.1 million Types = Photos + Digital Art Color = RGB Affective + Semantic = 8D + 384D	Total Music Clips = 25k Min Duration = 8-s Max Duration = 31-min Affective + Semantic = 8D + 384D

Table 7. Overview of the MuIm dataset, with an emphasis on data dimensionality for the user study evaluation.

	Semantics	Affective	Annotations		Number of Samples
	Semantics	Affective	Human	AI models	Images	Music	Pairs
Artistic	Do you think any elements in the music symbolize something in the image? Do you think the music and the image share a common theme or story? Does the image and music give same level of energy?	Do the images trigger same emotions as the music?	28D Emotions	1 Caption	2319	1841	50K
Web-based		Do the images trigger same emotions as the music?	Affective tags Semantic tags	Tag-caption Media-caption Tag-media Caption	540K	25K	700K

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ullah, U.; Choi, H.-C. MuIm: Analyzing Music–Image Correlations from an Artistic Perspective. Appl. Sci. 2024, 14, 11470. https://doi.org/10.3390/app142311470

AMA Style

Ullah U, Choi H-C. MuIm: Analyzing Music–Image Correlations from an Artistic Perspective. Applied Sciences. 2024; 14(23):11470. https://doi.org/10.3390/app142311470

Chicago/Turabian Style

Ullah, Ubaid, and Hyun-Chul Choi. 2024. "MuIm: Analyzing Music–Image Correlations from an Artistic Perspective" Applied Sciences 14, no. 23: 11470. https://doi.org/10.3390/app142311470

APA Style

Ullah, U., & Choi, H.-C. (2024). MuIm: Analyzing Music–Image Correlations from an Artistic Perspective. Applied Sciences, 14(23), 11470. https://doi.org/10.3390/app142311470

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MuIm: Analyzing Music–Image Correlations from an Artistic Perspective

Abstract

1. Introduction

2. Related Work

2.1. State-of-the-Art Methods in Music–Image Correlation

2.1.1. Semantic Relation

2.1.2. Non-Semantic Relation

2.2. Emotional Models in Art and Music

3. Multi-Layered Music–Image Correlation

3.1. Analysis of Multilayered Information

3.2. Artistic Media: Emotional Complexity

Realistic Media: Layered Information

3.3. Limitations of Current State-of-the-Art Methods

4. MuIm Dataset

4.1. Data Acquisition and Labeling

4.1.1. Artistic Segment

4.1.2. Realistic Segment

4.1.3. Data Diversity and Balance

4.2. Music–Visual Pairing Methodology

4.2.1. Artistic Segment Pairing

4.2.2. Realistic Segment Pairing

5. Dataset Analysis

5.1. Statistical Analysis

5.1.1. Artistic Data

5.1.2. Realistic Data

5.2. User Study

5.2.1. Setup

5.2.2. Result Evaluation

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI