Context-Aware Recommender Systems in the Music Domain: A Systematic Literature Review

: The design of recommendation algorithms aware of the user’s context has been the subject of great interest in the scientiﬁc community, especially in the music domain where contextual factors have a signiﬁcant impact on the recommendations. In this type of system, the user’s contextual information can come from different sources such as the speciﬁc time of day, the user’s physical activity, and geolocation, among many others. This context information is generally obtained by electronic devices used by the user to listen to music such as smartphones and other secondary devices such as wearables and Internet of Things (IoT) devices. The objective of this paper is to present a systematic literature review to analyze recent work to date in the ﬁeld of context-aware recommender systems and speciﬁcally in the domain of music recommendation. This paper aims to analyze and classify the type of contextual information, the electronic devices used to collect it, the main outstanding challenges and the possible opportunities for future research directions.


Introduction
In the area of data mining and big data technologies, recommender systems (RS) [1] are a widely used collection of algorithms aimed to link the items offered by any service or business with its consumers through personalization techniques. The importance of these systems is such that it is noticeable how they have attracted the interest of the scientific literature and large technology companies over the last few years. In some companies, in their early days they were a central part of their business models, as in the cases of Amazon or Netflix.
Therefore, the main objective of recommender systems is to offer the most relevant items to users based on their profile, history, their explicit preferences such as ratings or opinions of other items or implicit preferences related to their interactions in the system with other items or even their relationships with other customers.
Furthermore, the rise of mobile technologies and ubiquitous computing employing Internet of Things (IoT) devices has allowed these systems to include information about the user's context when making their personalized recommendations. These types of recommender system are called context-aware recommender systems (CARS) [2] that seek to improve the relevance of the recommendations provided to the user by taking advantage of contextual information such as the user's location, time of day, physical activity, emotional state and other aspects of their environment. These contextual characteristics may influence the rating that the user would give an item depending on the specific situation in which the user is expecting the recommendation, and this issue has been extensively studied and confirmed in the scientific community [3] and by companies such as Netflix [4].
One of the application domains that has attracted the interest of researchers and large companies is the music domain. This interest has been motivated by the rise of online music shops and music streaming services that have extensive music catalogues to offer their customers. In this domain it is possible to find what are known as music recommender systems and companies such as Pandora, Last.fm or Spotify (among others) together with the research community are looking for new ways to include context to improve the recommendations provided in this domain.
The problem of music recommendation [16] has a special set of characteristics when it is compared to other domains such as, for instance, travel, books or films. There is a significant difference in the time needed to consume the item in each domain: days or weeks in the case of a book, hours or less in the case of a film, and a few minutes in the case of a single song (although this could increase if it is a music playlist). As a consequence, the time it takes for the user to shape an opinion about the item being consumed is much shorter than in the case of books, and the likelihood of discarding the item (skipping the song) is much higher than in other domains.
Music is also characterized by the fact that an item can be consumed repeatedly even on the same day, which is less frequent in domains such as cinema or books. This makes the user not only tolerate recommendations of items they have already consumed, but also appreciate it and see it as a positive factor in the recommendation, something that in domains such as films or travel is highly unlikely to be the case. Regarding user penalization for a bad recommendation, the user is more tolerant due to the short time it may take to consume the item and the possibility of skipping the recommendation if it is necessary. Linked to the aforementioned, in the music recommendation it is possible to find a recommendation at different levels of abstraction, being able to recommend a single song, a playlist, an album or an artist.
From the point of view of the ratings that can be found, this is a domain in which explicit ratings are uncommon and, if they are available, they tend to be sparse due to the magnitude of the existing song catalogues. Therefore, it is a domain where implicit information is often used and where content-based approaches have often been used in the literature. The latter approaches are linked to another field called music information retrieval (MIR) and are still very popular today, extracting semantic information from songs based on audio features, artists, lyrics, cover art, etc.
From a user's point of view, who consumes music on a day-to-day basis, the contentbased approach may be decisive, however, several studies in the field of music psychology [17] show how short-term preferences are influenced by their emotional, physical, or social context. This context can be determinant as it is evident that certain types of music can be linked to different emotional states as well as other music can be tied to certain locations, user activities, moments of the day, etc. For this reason, incorporating contextual information into recommender systems and exploiting it effectively has become one of the challenges in this field.
However, before tackling how new contextual recommenders in the domain of music can be effectively designed, it is necessary to understand and explore existing research lines and solutions regarding this domain. The analysis and study of case studies from the literature and the identification of the most relevant contextual information is vital for the creation of new systems that build on the strengths and eliminate the weaknesses of existing work. In order to achieve this, a systematic literature review about contextual recommender systems in the domain of music has been carried out.
This paper follows the following structure: Section 2 describes the methodology chosen for the systematic literature review and details each of the steps and protocols established for conducting the review. Section 3 shows the selection process of the research articles following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) approach, later, in Section 4, the results obtained from the analysis of the selected Electronics 2021, 10, 1555 3 of 30 publications and the answers to the research questions initially posed are presented. After the analysis, Section 5 discusses the results presented and finally Section 6 presents the conclusions drawn from the study and future lines of research.

Methodology and Review Planning
We have developed this systematic literature review following the guidelines of Kitchenham and Charters in [18]. Prior to this systematic literature review, a preliminary revision has been carried out to verify that a state-of-the-art review of context-aware recommender systems in the music domain has not been carried out recently. This review has been carried out in different scientific databases such as IEEE Xplore, Springer, Web of Science (WoS) or Scopus using search terms such as "slr" "state of the art", "survey" together with terms related to the object of this review such as "context-aware", "music", "recommender systems", etc. The results of these searches confirmed that there were no reviews such as that discussed in this work, justifying its development.
The following describes the review process and their foundations in order to achieve the maximum degree of reproducibility. The protocol followed, together with the methodological steps of the review, are defined below.

Research Questions
We defined the following research questions (RQ) that would allow us to understand the current state of the art of CARS in the domain of music recommendation. These research questions are listed below:

Inclusion and Exclusion Criteria
Once the scope of the systematic review has been established, following the methodology of [18], it is necessary to indicate the inclusion (IC) and exclusion (EC) criteria in order to select only those papers that are relevant to answer the research questions posed above. The inclusion criteria established are as follows: • IC 1: The paper covers a recommender system. • IC 2: The recommender system is applied in the music domain. • IC 3: Most of the exclusion criteria are related to the inclusion criteria previously shown: • EC 1: The paper is not concerned with a recommender system. • EC 2: The recommender system is applied to a domain other than music. • EC 3: The recommender system does not use context information from the user, even though it is a music recommender system. • EC 4: The work is not written in English. • EC 5: The work is not published in peer review conferences, books or articles. • EC 6: The work was published prior to 2010.

Search Strategy
Upon establishing the inclusion and exclusion criteria, it is necessary to find the most relevant databases in which to search for research papers that meet the established criteria. At the same time, appropriate selection of the concepts is required in order to subsequently perform the queries in these databases.
The following databases have been selected for this work: IEEE Xplore, Web of Science, SpringerLink and Scopus. The main reasons for this decision are that they are databases in the scope of the research of this work and allow searches using search queries in a similar way to each other. The search concepts used were as follows:

•
The term "context" or "aware" to search for context-aware recommendation algorithms.

•
The term "recommender", "recsys", to search for papers presenting recommendation algorithms. • Terms such as "song", "music", "playlist", "album" and others related to the music recommendation domain.

•
Other terms such as "social", "location", "time", "emotion", "activity" that refer to the context of the user.
In the final search, some terms such as "recsys" have been omitted as wildcard searches such as "recommend*" are flexible enough and some searchers have a limited number of wildcard terms in the search query.

Query Strings
The query strings for the search in each of the databases have been elaborated using the terms derived from the PICOC method [20] and with the help of the Parsifal software [21]. Parsifal is a tool designed to help researchers plan and conduct systematic reviews of the scientific literature. It allows the collaboration of several researchers in the review and filtering of articles in a distributed way.
The terms are connected using logical operators such as AND/OR/NEAR and special characters such as * (wildcard). The queries performed in each database are presented below (Table 1).

Quality Criteria
Although the inclusion and exclusion criteria are a good tool for the selection of papers, the overall quality of the papers obtained after querying the various established databases is not taken into account. For this reason, it is necessary to establish a quality criterion for the articles before deciding whether to add them to the study. Using the Parsifal tool, three different ratings have been established to score each of the criteria: • Score 1: The article meets the established quality criterion. • Score 0.5: The article partially meets the quality criterion. • Score 0: the article does not meet the quality criterion.
The quality criteria are as follows: 1.
The purpose of the work is the development of a music recommender system.

2.
The paper applies the recommendation algorithm to datasets and presents a case study. 3.
The work includes the user's contextual information when making recommendations. 4.
The paper describes how contextual user information is obtained and the technology used.

5.
The paper presents used evaluation metrics for the proposed algorithm 6.
The work uses publicly available datasets.
After applying the quality criteria to the previously filtered papers, the papers that may be taken into account for the systematic review are selected. This allows the selection of those papers that can best help answer the research questions initially formulated.

Review Process
This section describes each of the steps followed in the process of extracting and reviewing papers from the literature using the process described by the PRISMA flow diagram shown in Figure 1.

1.
The results after querying the selected databases (identification phase) report the following results: 783 papers collected: 57 (7.28%) from Web of Science, 167 (21.33%) from SCOPUS, 315 (40.23%) from IEEE Xplore and 244 from Springer Link. After removing 139 duplicates (17.75%) the total results of unique registers are 644 (82.25%). All these results are included in a Google Sheets [49] and the BibTeX files are included in a git repository [50].

2.
We have imported the BibTeX files to Parsifal tool in order to conduct the selection of papers which meet IC (screening phase) after reading title and abstract. In this step we have accepted a total number of 181 papers (23.12%) and we have rejected 463 (59.13%). After this step we have continued the systematic review directly in the Google Sheet mentioned previously.

3.
The next step involves analyzing the articles according to the quality criteria. We have set the minimum threshold score for this point at 5.0. In this phase we have included eight new papers discovered during the review. After reviewing the quality criteria (eligibility phase) we have filtered a total of 100 papers out of 189.

4.
Papers selected after reading the full text (phase of final papers inclusion): 100 (12.77% of the total papers considered, 52.91% of the papers read). The analysis performed to achieve this number of papers was only based on their content, and without concerning bibliometric measurements (journal source, number of citations) or other aspects.
After reviewing and analyzing the articles finally selected, it is possible to begin to answer the research questions initially posed.  After reviewing and analyzing the articles finally selected, it is possible to begin to answer the research questions initially posed.

What Are the Kinds of Contextual Factors More Commonly Employed in the Music Recommendation Domain?
In previous reviews of the state of the art such as [16,51,52] or articles such as [53] oriented to CARS in a non-specific domain, different categories of context are identified at a more general level: individual context of the user, time, activity carried out by the user and relational context. In the chapter of the book [16] dedicated to music recommender systems the context in the music domain is classified into two major groups: environment-related context and user-related context. In this paper we have taken the two main categories above and added further sub-categories of context in this domain as follows: Environment-related context: this group covers the context of the users' physical environment and the types of devices they use. We could include in this category the following sub-types of contexts:

•
Physical context: factors related to the user's physical environment at the time the recommendation is made or when the item is consumed. This category includes factors such as the following: Location: certain places are associated with a particular type of music and it is possible to take advantage of this contextual information for recommendation. Weather: the weather conditions of the day (cloudy, rainy, sunny, etc.) can influence the user's preference for different types of music, differing greatly from a sunny day in summer to a snowy day in winter. Time: Time slot (morning, afternoon, evening), time of day, type of day of the week (work or holiday), are very influential factors in the domain of music recommendation because music preference may vary between leisure and work periods. Other environmental factors: other external factors such as noise level, current traffic or ambient light may have an influence on the music preference for a specific situation.
• Interaction media context: these types of element can be treated as a type of context and are in many cases a proxy to identify a certain context of user activity. The following contexts of interaction can be identified: Mobile devices: whether the user is using mobile devices such as a smartphone may imply a specific context of interaction that may influence the type of music the user prefers to listen to as doing so on a mobile device or a desktop device may be related to different user activity contexts. Desktop devices: activity performed from a desktop device may involve a specific user context linked to a music preference. Vehicles: it is possible to extract context information if interacting with an invehicle device while driving such as vehicles with Android Auto or other software. Wearables: smartwatches, headphones and other wearable devices can indicate a specific interaction context by providing contextual information about the user, such as their physical activity. Virtual assistants and conversational agents: the use of this type of interaction context is currently growing and attracting the attention of large companies for recommendations through conversational assistants that capture the user's context through dialogue. It is possible to extract emotions from the user's voice implicitly or explicitly through questions to the user. Session context: in several works, the authors refer to session context relating to songs that are often found together in the same playlist. In this case the context of the item during playback.
User-related context: all the factors most closely linked to the user, such as their current activity, their background in terms of knowledge or their social context (reflected in social networks or other sources of information). We classify this context into the following types: • Social context: the presence of or relationship with other people that may influence the user's musical tastes. Usually information extracted from social networks or systems where it is possible to identify the user together with a group of users and provide recommendations based on their environment (this could be related to another broad field such as recommendation systems for groups [54]). Within social networks the following factors can be taken into account: People with whom the user meets at a given time.
People with whom the user has connected or interacted. Roles of people in the user's environment and relationships of trust or reputation.
• Modal context: related to the state of mind of the user, the user's goals, mood, experience, and cognitive capabilities. We could highlight: Emotional state of the user: musical tastes can change a lot depending on the emotional state of the user. Many studies focus on detecting this emotional context either directly or indirectly through the gestures on the user's face or the comments they have recently posted on their social networks. User experience or skills: different music recommendations based on the user's experience of a certain genre or type of music. Cultural background: this type of information is about the context of the user's origin. • User activity: this contextual factor refers to the action or task performed by the user while listening to music. We can distinguish between: Task or physical activity performed among a possible set of actions: resting, swimming, running, cooking, studying, etc. Physiological state of the user: this physical state can be defined by biometric variables obtained by sensors such as: heart rate, oxygen saturation, brain wave activity, etc. This may be more related to identifying a type of music suitable for a particular physiological state.
Furthermore, based on the knowledge [53] that the recommender system has about the context, it can be classified into:

•
Fully observable: contextual factors are explicitly known. For example, the time when a song was played, the day of the week and whether it was a bank holiday or not.

•
Partially observable: only some context information is known to the recommender system, but it is not complete.

•
Unobservable: there is no explicit contextual information available, but the recommender system could infer it implicitly using latent knowledge.
Moreover, given how contextual factors can change over time [53], it is possible to distinguish between two types of contextual factors: • Static: contextual factors are stable over time (they do not change too often). In the case of music, for example, a context of study activity is very often related to a type of music and this type remains stable over time regarding this contextual factor. • Dynamic: contextual factors change over time in some way. In the music domain this could be related to the user's changing preferences over time in the same context.
Regarding how this contextual information can be acquired by the recommender system, we can distinguish between different types: • Explicit: when users explicitly state in the recommender system their contextual information of any kind: mood, location, etc. This can be done through different interfaces such as a text box or a question in a dialogue. • Implicit: when the recommender system observes the user's activity and context directly, such as their geo-positioning via the Global Positioning System (GPS) or another sensor on their smartphone, or the time at which a certain playlist or song was played. • Inferred: this approach is widely used as contextual information is often difficult to obtain. By using data mining or machine learning techniques, it is possible to use the user's interaction with the system or their song playing patterns to predict their contextual state.
In the papers reviewed in this study, we have found a large number of papers that acquire information implicitly as shown in Figure 2. In the category "inferred" we have included those who expressed those contextual factors such as emotions were obtained from implicit contextual data as well as those who used other contextual information to obtain, for example, session contexts. Table 2 shows the classification of the articles included in this work and the different contexts according to the above classification. Figure 3 shows several charts illustrating the number of CARS works in the music domain covered in this study and the different types of contexts they use. It is worth pointing out that the above contexts can be used at the same time in the same work. acquire information implicitly as shown in Figure 2. In the ca included those who expressed those contextual factors such from implicit contextual data as well as those who used othe obtain, for example, session contexts.  Table 2 shows the classification of the articles included in contexts according to the above classification.    Figure 3a shows the number of papers using the different physical contexts in recommender systems. It is possible to highlight how the contexts of time and location are those mostly used in the works covered. In the case of time, it is a type of context that can be acquired in a relatively simple way, since it represents the time at which the song was listened to, and from this other context attributes can be derived, such as the time of day, the day of the week, whether it is a weekend or a public holiday. Moreover, geographical location has been widely used in the reviewed works, highlighting the importance of the effect that this context has on users' music preferences. Similarly, weather and other environmental factors have been used for music recommendations, but to a lesser extent than in the previous contexts.
Modal context 64,67,69,[73][74][75][76]78,79,83,[86][87][88]92,93,98,99,104,111,116,119,120] User activity [39,42,57,58,61,[63][64][65]69,70,[73][74][75]78,84,105] Figure 3 shows several charts illustrating the number of CARS works in the music domain covered in this study and the different types of contexts they use. It is worth pointing out that the above contexts can be used at the same time in the same work.  Figure 3a shows the number of papers using the different physical contexts in recommender systems. It is possible to highlight how the contexts of time and location are those mostly used in the works covered. In the case of time, it is a type of context that can be acquired in a relatively simple way, since it represents the time at which the song was listened to, and from this other context attributes can be derived, such as the time of day, the day of the week, whether it is a weekend or a public holiday. Moreover, geographical location has been widely used in the reviewed works, highlighting the importance of the effect that this context has on users' music preferences. Similarly, weather and other environmental factors have been used for music recommendations, but to a lesser extent than in the previous contexts. Figure 3b shows the interaction media contexts that have been used in the reviewed papers. Note the large number in the context session category and that this is linked to the temporal context, as it exploits this information. Furthermore, there is a large amount of  Figure 3b shows the interaction media contexts that have been used in the reviewed papers. Note the large number in the context session category and that this is linked to the temporal context, as it exploits this information. Furthermore, there is a large amount of research that uses mobile devices to obtain context, as many physical contexts are obtained through the sensors of these devices. Likewise, mention should be made of the high number of works related to context in vehicles, which is due to the large number of works using the InCarMusic [64] dataset which will be referred to in later sections. Finally, new forms of interaction mentioned in recent works such as conversational agents (virtual voice assistants or conversational text bots) and the attraction that this type of context is currently having. Figure 3c shows the works using the user-related context. The modal context, and within this context especially the emotional state of the user, is likely to be the factor mainly exploited in the works reviewed. Mood or emotional state is mostly obtained through sensors or photographs from facial gestures or inferred through social media elements such as recently taken texts or photographs. Furthermore, there is a large body of work that exploits the user's social context and friendship relationships in order to find more relevant recommendations. Many of the studies analyzed exploit social networks not only to find relationships but also to obtain latent context by using social tagging information from songs, posts and photographs published by users on these networks.

What Are the Devices or Technologies Employed for Extracting This Contextual Information?
Once the different types of context have been identified, it is necessary to know which technologies, devices or sensors have been used in the scientific literature to extract these contextual factors that are later incorporated into recommender systems.
To classify the different sensors that can be used for this task, we have used the taxonomy proposed by [121,122]. Therefore, sensors can be classified into the following categories: • Physical or hardware sensors: sensors that provide a raw measurement of the environment. Within this category we can find sensors, devices and technologies such as: GPS: employed to extract information about the geographic position of the user. Accelerometer and Gyroscope: used to detect the user's movement in order to later be able to infer the user's physical activity. WIFI, Bluetooth: to extract information about the user's position or the presence of nearby devices. Camera: used to infer the user's state of mind by recognizing the user's facial expressions or to detect the geographical position of the user based on images of the environment. Microphone: used for the recognition of ambient noise or to obtain information from the user in a spoken conversation. Biosensors: such as EEG which is an electrophysiological monitoring method to record electrical activity on the scalp and is used in some works to detect the state of the user in order to recommend a certain type of music. Other sensors such as heart rate monitors have been used for this purpose.
• Virtual or software sensors: provide measurements of the user's context, but at a higher level of abstraction and combining measurements from different sensors. Here it is possible to find several examples such as the use of external services and their APIs, for instance, geolocation based on an IP or information fusion techniques through different physical sensors to obtain a more precise measurement of a contextual factor. • Social sensors: this category refers to all the information which is possible to be extracted from the content posted on social networks and from users' interactions on these networks. For example, the metadata in a given photo, the detection of emotions based on the recognition of expressions on people' faces in photos or the sentiment analysis of the text published in recent posts. • Human sensors: this category is related to how contextual information is explicitly elicited from the user, such as the textual description of a playlist or explicitly expressing a mood the task being performed at that moment.
In the music domain, many of these devices and technologies have been used to obtain valuable contextual information. Table 3 shows the selected works and the types of sensors they employ according to the above categorization. In Figure 4a it is possible to see the number of papers in each of these categories. It is necessary to highlight the large presence of works within the "Social Sensor" category, since much of the contextual information (of different types) is extracted thanks to social networks. The second category with more works is the one corresponding to "Virtual Sensor", this category is very present since much of the implicit information derives from this type of sensors. Likewise, the category human sensor, which refers to jobs in which information is explicitly requested from the user, such as their emotional state, has a large presence. Finally, physical sensors is the category with the least amount of work, but it is still a relevant number.
networks. The second category with more works is the one corresponding to "Virtual Sensor", this category is very present since much of the implicit information derives from this type of sensors. Likewise, the category human sensor, which refers to jobs in which information is explicitly requested from the user, such as their emotional state, has a large presence. Finally, physical sensors is the category with the least amount of work, but it is still a relevant number.  In Figure 4b shows the physical sensors in the latter category, and there is a general category that we have called "smartphone" in order to include those works that only refer to the smartphone and not to the sensors it includes. Among the sensors shown, it is possible to observe works in which explicit reference is made to signals from the accelerometer, gyroscope, ambient light, WiFi signal, magnetic field and others linked to bio signals such as ECG and EEG.

How Is Context Information Exploited along the Recommendation Process and Which Algorithms Are Employed?
According to Adomavicius et al. [126] the different paradigms in which information can be added into a CARS throughout the recommendation process are: In Figure 4b shows the physical sensors in the latter category, and there is a general category that we have called "smartphone" in order to include those works that only refer to the smartphone and not to the sensors it includes. Among the sensors shown, it is possible to observe works in which explicit reference is made to signals from the accelerometer, gyroscope, ambient light, WiFi signal, magnetic field and others linked to bio signals such as ECG and EEG.

How Is Context Information Exploited along the Recommendation Process and Which Algorithms Are Employed?
According to Adomavicius et al. [126] the different paradigms in which information can be added into a CARS throughout the recommendation process are: • 2D Methods: these methods are used by the vast majority of recommender systems. They work on a two-dimensional space; therefore, the rating function would correspond to (Equation (1)): However, contextual attributes that can be taken into account at different points in the recommendation process are also considered, giving rise to the paradigms of: Contextual prefiltering: in this paradigm, contextual attributes are taken into account to filter the data before applying traditional algorithms used in recommender systems. The main advantage of this approach is that all classical recommendation algorithms can be applied; however, if the initial filtering reduces the available data too much then the model may not have enough information to generate relevant recommendations. To mitigate such scenarios, it is possible to use generalization techniques to obtain less specific contexts and to group contextual attributes into hierarchies or to use latent factor models and dimensionality reduction models. In this approach we can also highlight approaches such as item splitting, user splitting and user-item splitting aimed at splitting the profile of an item or a user if the ratings are very different depending on the context, creating new entities of items or users linked to a given context. Contextual postfiltering: in this paradigm contextual attributes are ignored in the initial part of the process and only considered in the last phase of the recommendation process; therefore, this contextual filtering is applied on recommendations obtained with classical recommendation methods. The recommendations obtained are contextualized according to the contextual attributes in different ways: Filtering or selection: recommendations that are irrelevant to a given context are discarded.
Ranking adjustment: the ranking is modified according to a given context.

Furthermore, these techniques can be classified into different approaches:
Heuristic: seeks the common characteristics of an item for a specific user in the given context. Model-based: predictive models are built to calculate the probability with which the user chooses a certain type of item in a given context.
As with the prefiltering approach, this type of approach allows the use of traditional recommendation techniques.

•
Contextual modeling: in this approach, contextual information is added directly into the recommendation algorithm, so that the working space is a multidimensional space (Equation (2)): This context can be made up of multiple dimensions, incorporating this information into the recommendation model, with these contextual dimensions acting as predictors of the user's rating of the item. Two types of approach can be found: Model-based: in this approach the contextual dimensions are added directly to the recommendation space, and it is possible to employ a variety of machine learning techniques such as classification and regression like decision trees, SVM (Support Vector Machines), probabilistic models, etc. It is also possible to extend collaborative filtering based on matrix factorization using prominent approaches in this category such as tensor factorization (TF), factorization machines (FM) and context aware matrix factorization (CAMF). Heuristic: such approaches employ an extension of k-nearest neighbor (kNN) techniques.
It is worth mentioning that in the literature it is possible to find works that combine these paradigms or carry out contextual modelling following their own type of approach. In this paper we have included many of these papers in the category of contextual modelling as many of them have characteristics of this paradigm.
The following Table 4 shows the papers analyzed in this study according to their focus.  Figure 5a shows the percentage presence of each of the paradigms described above in the papers analyzed. It is worth mentioning that the contextual modelling approach is the most widely used due to the multitude of works that directly include context as predictors of the songs to be recommended. This is followed by the pre-filtering approach that is widely used as it has the advantage of being able to use traditional recommendation methods and includes the context at an earlier stage; and finally the post-filtering approach, which is the least used in the works reviewed.
in the papers analyzed. It is worth mentioning that the contextual modelling approach is the most widely used due to the multitude of works that directly include context as predictors of the songs to be recommended. This is followed by the pre-filtering approach that is widely used as it has the advantage of being able to use traditional recommendation methods and includes the context at an earlier stage; and finally the post-filtering approach, which is the least used in the works reviewed.  Figure 5b shows the large number of different algorithms used in the reviewed works. It is possible to highlight in this graph, as mentioned above, the presence of approaches based on matrix factorization including the context such as factorization machines and matrix factorization. It is possible to highlight many works employing User splitting and Item splitting techniques together with classical collaborative filtering techniques in the prefiltering approach and the appearance of new approaches based on Reinforcement Learning in several works. As it is possible to see in this graph there is a great variety of different algorithms in this domain.

What Evaluation Metrics and Methods Have Been Employed to Validate the Effectiveness of These Music Recommender Systems (RS)?
When assessing the performance of a CARS, different evaluation criteria have been previously categorized in the literature [51,138] and in which we can identify three main  It is possible to highlight in this graph, as mentioned above, the presence of approaches based on matrix factorization including the context such as factorization machines and matrix factorization. It is possible to highlight many works employing User splitting and Item splitting techniques together with classical collaborative filtering techniques in the prefiltering approach and the appearance of new approaches based on Reinforcement Learning in several works. As it is possible to see in this graph there is a great variety of different algorithms in this domain.

What Evaluation Metrics and Methods Have Been Employed to Validate the Effectiveness of These Music Recommender Systems (RS)?
When assessing the performance of a CARS, different evaluation criteria have been previously categorized in the literature [51,138] and in which we can identify three main groups:

•
User studies: this type of study is conducted on a set of test subjects who are asked to interact with the recommender system. While the study subjects perform the tasks, their actions are observed, and quantitative data are collected about their interaction. In addition, satisfaction surveys are conducted before and after the experiment to qualitatively measure satisfaction with the system and to check whether the recommendations were relevant. • Offline evaluation: this type of evaluation is carried out when a dataset is available to design the system. The system uses this dataset to predict user ratings on certain items. The performance of the system is, therefore, measured in terms of its ability to correctly predict the rating of users on the items in the dataset. Within this group it is possible to find the following categories of performance evaluation metrics [139], divided into: Rating prediction metrics: these measure the accuracy of recommendations in terms of their error. This type of metrics has been broadly used in traditional recommender systems. Among them we can highlight: Mean Absolute Error (MAE): mean absolute error between prediction and rating. Root Mean Squared Error (RMSE): similar to the previous one but this error penalizes major errors more heavily.
A lower value for these metrics indicates a higher predictive power of the models used.
Usage prediction metrics: this type of metric is based on ratios between recommended and consumed items. In this approach, having a dataset with items that the user has consumed, a part is hidden, and the system is asked to recommend a number of items. From the results obtained after the recommendation it is possible to evaluate the number of true positives (relevant and recommended), false positives (recommended and irrelevant), true negatives (not recommended and irrelevant) and false negatives (not recommended and relevant). The following metrics are derived from the relationships between these results: Precision: proportion of items recommended to users that are relevant to them (true positives) over the total number of items recommended (true positives and false positives). If only a certain number of items are recommended it is possible to find the metrics Precision at K or Precision@K. Recall (sensitivity or true positive rate): this metric measures the proportion of consumed items (true positives) that were correctly recommended. This metric is usually displayed together with Precision. As before, it is possible to find the Recall at k or Recall@K metric. Specifity (true negative rate): measures the proportion of non-recommended items that are irrelevant to users. F-measure (F1): this metric combines the above metrics into a single metric to be able to compare different recommender systems. Area under the curve (AUC): the receiver operating characteristics (ROC) curve plots the recall against fallout (false positive rate) and the area under the curve is an indicator of the overall quality of the recommender system. Hit rate: number of items in the test set that were also present in the system's recommendation list for each user is the hit number. The hit rate is the number of hits over the total number of users in the system. Its version at k can be found as Hit@K.
Ranking metrics: these metrics measure the recommender system's ability to predict the correct order of items with respect to the user's preferences, known as rank correlation.
Therefore, this type of metric is used when the user is presented with a list of ordered items, as it only takes into account the relative order of these items in the list and not their exact ratings. The most commonly used of this type of metric are: Click-through rate (CTR): this is the count of recommendations that are clicked by the user. Bounce rate: percentage of users who have seen the system's recommendation lists, but instead of consulting those recommendations choose to exit the recommender system.
When evaluating the performance of a recommender system, there are other properties that need to be taken into account [140] beyond the above metrics [130] concerning the predictive power of the models to know whether the recommendations were satisfactory, useful or effective to the users. Among them, special mention may be made of the following:

•
Coverage: is defined as the proportion of items on which the system is able to generate recommendations. This metric is important in the music domain because much of the music catalogue is often not recommended due to cold start problems and item popularity bias. • Novelty: measures the ability of a recommender system to recommend new items that the user was not previously aware of. This property is also relevant in the music domain, in order to discover new artists, although in other domains such as cinema it becomes more important as consumed items are less frequently re-consumed. • Serendipity: tries to measure relevant and surprising recommendations. There is consensus on the need to increase this property, although the definition of the measure of serendipity is a matter of controversy in the literature. • Diversity: is another property that measures how different the recommended items are from each other, taking into account several aspects such as musical style, artist, lyrics, instrumentation, etc. Analogous to the previous property, it is possible to find different definitions in the literature. The preferred one is to measure the distance between pairs of items according to an established distance measure and then to sum these distances or average them. In the specific case of music, it is common that importance is given to this aspect since the user's preference may be linked to the fact that there is a variety of styles or artists in the list of recommendations. • Sequence-aware evaluation measures: in the case of evaluating a recommendation of a music playlist, it is necessary to have other properties such as the transition from one song to another between different styles. Although both songs may be rated very positively by the user, the transition from one style to another such as rock to classical may affect the user's preference. In this case, given a song playing at a specific moment and the presence of other songs that are also a good option to be played right after it, it is necessary to take into account the sequence of items [141,142] and a set of multi metrics (such as intern coherence and diversity) that take into account the list as a whole.
A summary of the evaluation metrics used in the papers analyzed in this study is presented in Table 5 below. Table 5. Evaluation protocols and metrics employed in the included works.
(a) In Figure 6b it is possible to highlight the wide use of metrics such as Recall a Precision in their classic versions and top N. Also note the scarce presence of metr beyond offline evaluation, being present only serendipity and coverage and the absen of metrics such as diversity, making evident the need to work on the use of the alternative metrics that in the domain of music have a special importance.

Which Publicly Available Datasets Are Mainly Used?
After reviewing the works included in this study, the main datasets used in CA applied to the music domain have been identified. Among all the works analyzed, it possible to find some of them that use non-publicly available datasets that have be collected by the authors themselves through users of applications in their studio through the use of APIs in websites such as Spotify and Last.fm among others. Table 6 presents the publicly available dataset that are currently used in the mu domain, showing their main information, where they can be found and the works in th review that use them.  In Figure 6b it is possible to highlight the wide use of metrics such as Recall and Precision in their classic versions and top N. Also note the scarce presence of metrics beyond offline evaluation, being present only serendipity and coverage and the absence of metrics such as diversity, making evident the need to work on the use of these alternative metrics that in the domain of music have a special importance.

Which Publicly Available Datasets Are Mainly Used?
After reviewing the works included in this study, the main datasets used in CARS applied to the music domain have been identified. Among all the works analyzed, it is possible to find some of them that use non-publicly available datasets that have been collected by the authors themselves through users of applications in their studio or through the use of APIs in websites such as Spotify and Last.fm among others. Table 6 presents the publicly available dataset that are currently used in the music domain, showing their main information, where they can be found and the works in this review that use them.  Figure 7 below shows the datasets used in the works analyzed in this study. It is possible to highlight how the InCarMusic dataset has a strong presence in CARS oriented to the music domain as it was one of the first datasets to include multiple contextual information. Similarly, datasets with information from the company Last.fm [160] with social tags and timestamps have become reference datasets in this type of context.  Figure 7 below shows the datasets used in the works analyzed in this study. It is possible to highlight how the InCarMusic dataset has a strong presence in CARS oriented to the music domain as it was one of the first datasets to include multiple contextual information. Similarly, datasets with information from the company Last.fm [160] with social tags and timestamps have become reference datasets in this type of context. Moreover, the aspect of social context has gained much relevance, with the appearance of datasets such as #noplaying-RS (among many others) which, from information published on social networks not directly dedicated to music, it is possible to infer moods, social contexts and events of consumption of musical items, suitable elements to exploit them in a CARS oriented to the domain of music.
It is common among the works reviewed in this study that the authors make the most of the information available in the datasets published by large companies such as Spotify, in some cases by extracting contextual information simply from the name of the music playlists.
The use of different sources of contextual information and the need to create datasets with all this contextual information that serve as a benchmark is one of the challenges in this domain.

What Are the Main Open Lines of Research in This Domain?
The work of Schedl et al. [161] provides an extensive review of the current challenges in the domain of music recommender systems; however, it is necessary to point out also the existing challenges in CARS oriented to the music domain: • Cold start problem: the cold start problem is one of the most common problems in recommender systems and could be aggravated in CARS by the difficulty of obtaining contextual information or because the latent representation of the context is biased [162]. This problem occurs when a new music item is added to the catalogue or Moreover, the aspect of social context has gained much relevance, with the appearance of datasets such as #noplaying-RS (among many others) which, from information published on social networks not directly dedicated to music, it is possible to infer moods, social contexts and events of consumption of musical items, suitable elements to exploit them in a CARS oriented to the domain of music.
It is common among the works reviewed in this study that the authors make the most of the information available in the datasets published by large companies such as Spotify, in some cases by extracting contextual information simply from the name of the music playlists.
The use of different sources of contextual information and the need to create datasets with all this contextual information that serve as a benchmark is one of the challenges in this domain.

What Are the Main Open Lines of Research in This Domain?
The work of Schedl et al. [161] provides an extensive review of the current challenges in the domain of music recommender systems; however, it is necessary to point out also the existing challenges in CARS oriented to the music domain: • Cold start problem: the cold start problem is one of the most common problems in recommender systems and could be aggravated in CARS by the difficulty of obtaining contextual information or because the latent representation of the context is biased [162]. This problem occurs when a new music item is added to the catalogue or when a user registers and no previous information is available. Linked to this problem is the level of sparsity [163], as the number of ratings is always much lower than the number of possible ratings, a fact that happens when there are platforms with many users and items, as is the case in the music domain. These problems can be mitigated by using contextual information such as social tags, temporal information linked to the plays or by using information from the content of the songs or by exploiting implicit information resulting from the user's interaction with the recommender system. • Popularity bias: or long tail problem [164] is another problem linked to the previous sparsity challenge. The recommendation system tends to recommend popular items and items with which users have no interactions are not recommended. Kowald et al. [165] develop an extensive work studying this problem in one of the previously mentioned datasets (LFM-1b) [147]. In this work they form three groups of users according to their level of mainstreamness score regarding all other users of the dataset. They conclude that both users of this dataset who have limited interest in popular items and users who are interested in unpopular items receive worse recommendations. This type of problem is being addressed in the literature with works such as [110,166,167], but there is still a need to investigate how context can help [68] to mitigate the bias produced by the popularity of songs. • Dimensionality: using all available contextual information is not the best option as the number of new contextual variables included in these systems leads to the manifestation of "the curse of dimensionality". It is necessary to balance this dimensionality and manage the level of importance of each contextual dimension. • Evaluation of CARS applied to music domain: there is a need for more contextual datasets that serve as a reference for the community and with aggregated data from different sources such as song content, social tags, friendship relations between network users, social media posts, etc. There is a lot of separate information that could be aggregated to create CARS datasets applied to the music domain that could become benchmarks like MovieLens datasets [168] in the movie domain. • Enhancing novelty and diversity: suggesting more novel, or less known, products along with diversity within recommendations. Recommendation systems generally provide recommendations of popular items, but it is necessary to introduce mechanisms to improve metrics such as diversity or novelty. In the studies analyzed in this review, only a few of these metrics are included and it is important to take them into account in the design of the CARS.
Furthermore, some of the open lines of research in this field are proposed below: • Better emotion-aware recommender systems: the emotional state of the user has a great impact on their musical preferences, and it is essential to be able to correctly identify this state at the same time as the recommendations are made. In this sense, the recent social information of the user can be exploited, such as texts with the latest natural language processing (NLP) techniques, photos with artificial vision techniques or even through the tone of voice in interfaces such as voice assistants. • Situation and intention-aware: accurately inferring the activity that the user is performing more precisely by fusing available information together with identifying the intention with which the user listens to music are contextual factors that can be determinant in the relevance of recommendations. • Voice-driven interaction with recommender systems in virtual assistants: the rise of devices that include a virtual voice assistant has led to the development of conversational recommender systems that allow recommendations to be made through a dialogue with the user. These devices are bringing the use of streaming services to new users who need to be engaged in dialogues that allow them to extract contextual information, thus providing more relevant recommendations. • Explainability: another line of work is the generation of reasons using NLP techniques for music recommendations provided [169]. While it is possible to find such explanations in terms of users who listened to a certain group, research is needed on how to find explanations at the context level, such as "users who do sports in the morning also listened to ..." [170]. • Biological data and wearables: the use of devices that monitor vital signs such as smart bands or wearables has not been extensively studied in this domain, or at least there are no large datasets with sufficient contextual information. Exploiting this information and creating a reference dataset could be a future line of research in this area.

Discussion
After answering the research questions initially posed in this study and presenting the results of the review, we are now able to discuss the results in each of the questions.
The first question asked was what kind of contextual information is mainly exploited in the music domain. The CARS in the music domain uses a wide variety of types of contextual information. At this point it is possible to highlight how a large number of the reviewed works extensively exploit the following three contexts: (1) temporal and session context of songs, mainly because this is implicit information that can be easily acquired and can provide other derived information from which to extract many contextual factors; (2) emotional context of the user, as moods directly impact the user's preferences, a large body of work seeks to incorporate these contextual factors when providing recommendations; and (3) contextual factors concerning the user's physical context and also the context of the user's activity, with music preferences being closely linked to the task being performed, since music is an item that can be consumed in the background. From the results extracted after reviewing the works in this study, it is possible to observe that mobile devices and music session contexts predominate in the Interaction media contexts. This highlights that studies increasingly tend to try to infer the user's context based on the music session (in their recent interaction with the system) or simply on the information they have available. This is mainly because contextual information is scarce, especially in the case of the user's mood. In this case, authors tend to infer this information through patterns in our playback behavior or from external information from social networks. There is a clear trend in the works to use contextual information inferred using machine learning techniques and especially with embeddings that capture such contextual information from different sources.
Secondly, regarding the devices used to obtain contextual information, since most of the works obtain it implicitly, it is logical to observe how the social sphere is exploited as a source, as well as the information derived from the user's interaction with the system (virtual sensors). However, the number of works that use contextual information from physical devices is not very abundant, since this type of context is much more costly to obtain in real time. Moreover, not many reference datasets with this type of information are publicly available, highlighting the need for the creation of datasets with contextual information obtained from sensors in order to be able to evaluate different approaches.
As wearables, sports wristbands or simply the continuous use of smartphones proliferate, obtaining contextual information about the user's physical activity from music playback applications (such as Spotify, Google Music, etc.) on mobile devices (Android [171] or iOS [172]) has become trivial. However, there are currently no large public reference datasets in the literature that provide this contextual information. The release of datasets by these large companies could help research into new ways of obtaining context, such as using part of the context when available and using pre-trained models in large datasets to complete partial contextual information.
Thirdly, in terms of how to incorporate contextual information into recommender systems, there is a clear focus on models that use context as a predictor of song ratings, including this information in the model itself under the paradigm of contextual modelling. Pre-filtering techniques are highly relevant in the literature and post-filtering techniques have also been used, although with a much smaller representation in this study. The myriad of techniques in the contextual modeling approach, seeking to make the most of context as a predictor of musical preferences, is evident.
There is a clear emphasis in the literature on the use of deep learning and embeddings to condense all available information, both contextual, from song content (lyrics, rhythm, valence, audio signal characteristics) and social user information (posts, images, etc.) with the aim of incorporating this information to predict the rating that a user is likely to provide. In most of the works on context-aware recommender systems in which deep learning techniques are applied, contextual states are obtained from topic modeling or image processing, although the literature on the application of these methods in the field of music is much more limited.
Fourthly, the evaluation protocols have been analyzed, highlighting the offline protocol over the online and user studies. There are very few studies that present both an offline evaluation and a user study, and only three studies address other types of metric such as serendipity or coverage. In the music domain, it is worth noting that users can search for new content and for this it is essential to enhance approaches that allow the latter metrics to be enhanced. Another aspect that is currently being explored in the literature is the bias introduced by the long tail effect, in which recommender systems are only able to recommend to their users the most popular items in the catalogue. In this point, several of the recent works analyzed focus on how to address this issue and how context can help in identifying new music beyond the most popular songs. Furthermore, in the specific case of playlist recommendations, there is a need to ensure that the items are not only suitable for the user, but that it is possible to control the diversity in the playlist.
With these results, it becomes quite evident that there is a need to work on the evaluation of metrics such as diversity in this domain.
Fifthly, the datasets used by the works included in this study have been analyzed and the publicly available datasets have been detailed. At this point, it is possible to highlight how most of the works use datasets based on Last.fm and how in recent years new datasets have emerged that incorporate social information from social networks such as Twitter. However, the information that is available in these datasets is partial and often needs to be enriched by researchers by inferring certain contextual factors from the external information available, which in many cases is incomplete (location of the tweet if it is geolocated, language of the tweet, time of publication, etc.). The information fusion from different sources and the increasing size of published datasets becomes a challenge for researchers, increasingly requiring big data processing approaches.
At the same time, new datasets need to be published to serve as a reference for the reproducibility of the results of the different approaches proposed in the analyzed studies. As shown in Section 4.5, some of the reference datasets are publicly available but a large number of research papers compile their own datasets and do not release them to the rest of the scientific community, hindering reproducibility and the creation of new recommendation approaches with such data.
Finally, some of the current challenges in CARS applied to the domain of music have been described and some of the future lines of work in the field proposed in the literature and others raised by this paper after the review of the selected studies have been pointed out.

Conclusions and Future Lines of Work
A systematic literature review has been carried out to analyze previous CARS work in the specific domain of music recommendation. This systematic literature review addresses different important topics such as the main types of contexts used in recent works in the literature and pointing out those that have not been explored deeply enough.
In addition, the main devices and technologies used in the literature to extract contextual information have been identified. It has been shown in what proportion they have been used in this specific domain, showing clear trends in the exploitation of social context. Subsequently, the different paradigms for incorporating this contextual information into the recommendation process were analyzed and the alternatives taken by the authors of the works analyzed, showing a wide variety of approaches, mostly under the paradigm of contextual modelling.
Afterwards, the evaluation protocols and metrics commonly used in each of them in the music domain have been identified.
The protocols and metrics used in each of the papers have been analyzed and, despite the fact that this study may be biased as the quality criteria for the inclusion of papers valued the use of public datasets, we are able to conclude a wide use of offline protocols and metrics from Precision and Recall and their @K versions as the most widely used to evaluate the performance of CARS in the music domain.
Next, an extensive review of the datasets used in the works included in the study has been carried out, showing a selection of the mainly used and publicly available datasets. This is a very useful resource for researchers new to this domain and encourages the use of reference datasets in the field in order to encourage reproducibility of research and comparison of different approaches. In addition, the datasets that have been widely used by the included works and other recent datasets with contextual information have been presented. The latter will allow new advances in this domain, exploiting information in social networks from different types of data such as text or images.
Finally, some of the challenges in this domain have been identified, such as the use of metrics that favor diversity and novelty or how to deal with the problem of excessive dimensionality introduced by contextual variables, among others. Likewise, potential new lines of research in this field have been presented, such as the development of systems capable of explaining recommendations or the presence of recommender systems with communication interfaces as conversational agents in voice assistants.
To the best of our knowledge there is no previous SLR in this specific domain and we consider that this work can support developers and researchers in music recommendation and context-aware recommender systems by helping them in their initial documentation, in the identification of the most used and referenced datasets in the literature, and in the exploration of new research approaches in the music domain.
Future lines of work following this systematic review of the literature include the construction of a dataset that combines information from the previously reviewed datasets with social and content information of the songs in order to perform automatic contextual labelling of new songs without previous contextual information, the use of new interaction methods to extract contextual information such as in the case of recommender systems in voice assistants and conversational systems, or to explore the context information in recommender systems for groups in the music domain, a kind of recommender system not addressed in this review.