Dataset of Search Results Organized as Learning Paths Recommended by Experts to Support Search as Learning

: In this article, we introduce a dataset of curated learning paths (LPs) to support search as learning. LPs were obtained through an online survey delivered to experts in di ﬀ erent domains. Data were then analyzed and described in terms of a set of variables. The resulting dataset comprised 83 LPs, each containing three web pages, for an overall collection consisting of 249 documents. The dataset is intended to provide information scientists, education researchers, and industry professionals, who provide information services in educational contexts, a valuable resource to (i) investigate patterns in the order of LPs, (ii) improve ranking models and / or re-ranking methods, (iii) explain the structure of the recommended LPs, and (iv) investigate alternative approaches to display search results based on the features of LPs. is CC-BY 4.0.


Summary
This article describes a novel dataset comprising 83 learning paths (LPs) curated by experts. Each LP in the dataset includes a list of three sequential web pages, experts' demographic information, experts' judgment and reasoning to include the selected sources and their order, LP extension, and content type. The uses of the dataset includes, but are not limited to, research and development in areas such as information science, education, information retrieval, linguistics, and industry. The dataset is available through the Mendeley Data repository.
The remaining sections of this article are structured as follows. First, Section 2 provides background information and rationale. Section 3 offers a detailed description of the files included in the dataset, definitions of variables, and data distribution. Then, Section 4 describes the methodology and instruments used to collect the data. Finally, Section 5 comprises conclusions and applications for future research. Additionally, the Spanish version of the questionnaire is provided as Supplementary Materials, and the Python script used to extract general features in the dataset is presented in Appendix A.

Background and Rationale
The concept of LPs-defined as a finite and organized sequence of learning objects (LOs)-can be linked to Vannevar Bush's notions of trails [1]. When trails are situated in learning contexts, a fundamental problem, known as curriculum sequencing, can be used to find the optimal sequence to maximize learning outcomes, which is considered as NP-hard [2,3] (i.e., problems that may not be decidable).
More than 50 years after Bush's influential piece, the impact of technology in education has been prolific. In fact, different tools and resources have been developed, which include learning management systems (LMS) (e.g., Moodle) and massive online open courses (MOOCs), among others. Although different in nature, a common characteristic of both is that the content they provide is the same for all users, despite their prior knowledge or learning style. To address this problem, some have focused on personalization within this type of platforms [4,5]; however, in spite of the efforts to pursue this goal, statistics indicate that 83% of students use search engines to meet their immediate information needs [6,7].
Current search technology is based on years of research and development on information retrieval and information science. Popular alternatives such as Google and Bing provide rapid access to a vast amount of content on the Web. Although there have been numerous advances in search systems, these still face major challenges when it comes to understanding searchers' complex information needs [8]. Search technologies exploit a wide range of features in the retrieval and ranking process, the latter being critical in the phase of organizing results in terms of their relevancy to users' needs. Unfortunately, most approaches mainly rely on topical relevance [9,10] with limited coverage of other manifestations of relevance such as cognitive, affective, and situational [11,12], which may be critical in learning scenarios.
Despite the active role of search technologies in educational settings, these were not designed to support complex learning processes [13,14]. This is evidenced by: (i) a mismatch between search engine results' presentation style and how people learn under high levels of uncertainty [15,16]; (ii) students' attitudes and behaviors toward information search [17,18]; and (iii) low levels of information literacy of students [19,20]. Thus, it is fundamental to investigate alternative solutions to enhance learning outcomes as part of online search (i.e., search as learning-SAL). Furthermore, this need becomes even more urgent as the current pandemic (COVID-19) evolves, since a large number of people worldwide are turning to online resources found through search engines to support their learning processes. In this case, it is fundamental to find suitable approaches (such as LPs) to better support learning in the context of searching for information on the Web.
Regardless of numerous efforts to build LPs, to the best of our knowledge, no research has been conducted to study the effects of search results presented as LPs-based on expert knowledge-on learning outcomes. Therefore, we have seen the need to build a dataset of LPs which could be used by information scientists, education researchers, and industry professionals that offer services related to information seeking and retrieval for educational purposes. The dataset will allow them to conduct studies with numerous applications in textual data extraction. For example, in research contexts, the dataset could be used to extract features to: (i) identify patterns in the sorted LOs recommended by experts, (ii) improve ranking models and/or re-ranking algorithms to fulfill immediate learning needs, (iii) explain the order of documents within LPs, and (iv) investigate alternative approaches to display search results based on the features of LPs. Conversely, in educational settings, the dataset could be used as a learning resource or to further investigate teaching and learning strategies of complex topics. For doing so, we asked experts on specific topics and domains to provide a sequence of three web pages (mostly based on text) that can be used to guide the learning process of incoming college students who know little to nothing about a selected topic. We considered two criteria to classify an individual as an expert in a specific area: knowledge and experience time. For this particular case, we considered any person with a bachelor's degree or higher and at least one year of experience in a specific topic to be an expert. The survey was specially directed, but not limited to, professors and researchers. Due to the geographical location of the research group carrying out the present study and the native language of the environment, the dataset was built containing mainly LOs in Spanish. It is also worth noting that Spanish is the third most used language on the Internet [21].

Data Description
We invited several experts from Hispanic countries to participate in an online survey. We obtained answers from seven different countries in six domains (i.e., computer science, physics, finances, laws, biology, and industrial engineering), as shown in Figure 1. LOs in Spanish. It is also worth noting that Spanish is the third most used language on the Internet [21].

Data Description
We invited several experts from Hispanic countries to participate in an online survey. We obtained answers from seven different countries in six domains (i.e., computer science, physics, finances, laws, biology, and industrial engineering), as shown in Figure 1.

Files
We provided two files in the repository: • A comma-separated value (CSV) file with all the data in Spanish. The file name was LP_dataset_spanish_version.csv.

•
A copy of the previous CSV file (LP_dataset_english_version.csv) with categorical data and variable names translated into English in order to facilitate analyses for English-speaking researchers.
In this article, we addressed the English version. Table 1 describes the features available in the dataset. The features were organized as follows:

Features
• The first fourteen features corresponded to demographic information provided by survey respondents.

•
The following twelve variables described the LP, considering three sorted LOs and the description of the selection criteria provided by the experts.

•
The last three characteristics were general features that we extracted from the recommended LPs-to facilitate the classification process-which are described in the following section.

Files
We provided two files in the repository: • A comma-separated value (CSV) file with all the data in Spanish. The file name was LP_dataset_spanish_version.csv.

•
A copy of the previous CSV file (LP_dataset_english_version.csv) with categorical data and variable names translated into English in order to facilitate analyses for English-speaking researchers.
In this article, we addressed the English version. Table 1 describes the features available in the dataset. The features were organized as follows:

Features
• The first fourteen features corresponded to demographic information provided by survey respondents.

•
The following twelve variables described the LP, considering three sorted LOs and the description of the selection criteria provided by the experts.

•
The last three characteristics were general features that we extracted from the recommended LPs-to facilitate the classification process-which are described in the following section.

Data Distribution
The dataset consisted of 249 LOs organized in 83 LPs recommended by experts from Argentina, Chile, Colombia, Ecuador, Mexico, Spain, and Venezuela. Table 2 summarizes demographic data by domain, level of education, and sex. As shown in Figure 2, 81.93% of the experts belonged to higher level education institutions and 18.07% belonged to research centers. In addition, 81.93% of the respondents were men and the remaining 18.07% were women, with their age distribution shown in Figure 3. Respondents were experts belonging to six different domains: biology, computer science, finances, industrial engineering, laws, and physics-63.85% of them had a doctoral degree, 19.28% had a master's degree, and the remaining 16.87% had a bachelor's degree. Figure 4 shows this distribution identifying two groups: students and faculty. Finally, Figure 5 shows a brief distribution of the collected Data 2020, 5, 92 5 of 11 data in relation to the last three extracted characteristics: 91.57% of the documents were in Spanish (the remaining were in English), 89.16% were in text, and 67.47% were short. domain, level of education, and sex. As shown in Figure 2, 81.93% of the experts belonged to higher level education institutions and 18.07% belonged to research centers. In addition, 81.93% of the respondents were men and the remaining 18.07% were women, with their age distribution shown in Figure 3. Respondents were experts belonging to six different domains: biology, computer science, finances, industrial engineering, laws, and physics-63.85% of them had a doctoral degree, 19.28% had a master's degree, and the remaining 16.87% had a bachelor's degree. Figure 4 shows this distribution identifying two groups: students and faculty. Finally, Figure 5 shows a brief distribution of the collected data in relation to the last three extracted characteristics: 91.57% of the documents were in Spanish (the remaining were in English), 89.16% were in text, and 67.47% were short.

Methods
In order to study the various aspects introduced in Section 1, we had to create a dataset to consider the aspects detailed in Table 3. These guidelines were based on the literature on searching as learning and information seeking introduced in Section 2. To carry out the creation of the dataset, we followed a method based on expert judgment, which is widely used in fields such as education (e.g., [22,23]), research (e.g., [24,25]), and industry (e.g., [26][27][28][29]), among others.  Table 3. Guidelines considered for the creation of the dataset.

Current Scenario Criteria
Lack of validation for search results. Consider experts' knowledge and criteria to select and organize web documents as LOs.
Endless search results and random reading order.
Organize search results as LPs-defined as a finite and organized sequence of documents (LOs)-considering that the order in which study material is presented can lead to different learning outcomes [30].
Observed common attitudes and behaviors among students toward web search contexts as little time and effort were invested in finding information [18].
Short LPs intended to satisfy an immediate learning need, since students spend 14:21 min on average in a search session to read text documents [18].
Most web content is in text format. LPs mostly based on text.
Most IR (Information Retrieval) research is based on information presented in English language.
Spanish is the third most used language on the Internet [21], so it is necessary to pay attention to these users.
Based on the guidelines shown in Table 3, we designed a semi-structured questionnaire, which was implemented using Google Forms. The application of the questionnaire was targeted to experts in six different domains. To select domains, we first identified top searched domains on the Internet [31]. After that, we selected six domains: computer science, finances, laws, biology, industrial engineering, and physics.
In order to define specific subjects in each domain, we first identified two experts per domain. More specifically, we located 12 faculty members from different universities and countries. Once the experts were identified, an appointment was made with each one. Every one of them was asked to suggest a topic of interest for society and formulate a general question related to it. Several interviews were scheduled with each expert until the structure of the questions and the language used were fine-tuned in order to be appropriate for students who have no prior knowledge of the subject. Once the questions were defined, we requested the assistance of an expert in formulating questions in educational contexts, with the purpose of validating if they were properly posed.
Once the validation process was completed, an online survey was designed and the study was presented to the Institutional Ethics Committee of the Universidad de Santiago de Chile. The research protocol for this project was approved on 16 April 2019 (Ethical Report No. 160.2019).
The overall questionnaire consisted of 23 items including 2 agreement questions, 11 closed-ended demographic questions, and 10 open-ended questions. The online survey was tested on a pilot study by 32 members of the InTeracTion (http://www.interaction-lab.info) research group. Data collection was carried out in three stages:

•
In the first stage, prestigious universities, research centers, and industries of Spanish-speaking countries in each of the six domains of interest were identified.

•
In the second stage, we created a list including faculty members, researchers, and professionals whose institutional email was available.

•
In the third stage, invitations were sent out to the experts via an email to participate in the online survey. In addition, the experts were asked to share the survey with senior students (with at least a bachelor's degree) who are proficient in the subject.
In the online survey, each expert was first required to fill in a demographic questionnaire. Second, experts selected a topic according to his or her field of expertise (Table 4). Third, we asked experts to provide a sorted sequence of three web pages (mostly based on text) that can be used to guide the learning process of students who know little or nothing about the selected topic. A restriction indicated for this task was that all three selected documents should be readable in a time span of 20 min (maximum). Finally, the experts were asked to describe their selection criteria. We invited faculty members, researchers, and senior students (with at least a bachelor's degree) from prestigious universities and research groups of Spanish-speaking countries to complete the online survey. The survey was available from 25 May 2019 to 31 January 2020.
In the 10 Hispanic countries that were invited to participate in the online survey, 3717 experts were enrolled in a university, research group, and/or industry; 109 experts completed the survey, for a response rate of 2.93%.
Collected raw data were filtered to eliminate observations containing broken URLs, duplicated URLs within a single register, or inconsistent data. Twenty-six observations were discarded during this process. In order to guarantee the selection criteria of the dataset, three variables were created using the Python script shown in Appendix A. The variables were the following: To make it easier to identify the LOs, these were linked to unique identifiers (ID_LO) according to the following template DNNO, composed of four digits: For example: the document with ID 4032 belongs to the physics domain (4) and it is in the middle of the LP (2).
Actual web documents were not included due to potential copyrights infringement. Access to actual documents will be provided upon request in case these are no longer available through the URL in the dataset.

Conclusions
In this article, we introduced a dataset of curated LPs. We provided detailed descriptions of the dataset structure, definitions of variables, data distribution, and methodological approach. Our dataset constitutes a valuable resource for researchers and educators dealing with problems related to information search and learning.
The dataset responds to current issues identified in the literature. First, the lack of curated search results linked to learning goals. Second, the current presentation style of search results implemented by popular search engines. Third, common students' attitudes toward learning complex topics using Internet resources. Fourth, the fact that most content on the Web is text-based. Finally, the lack of datasets in Spanish.
The proposed dataset has different types of applications. First, researchers on information science and education could investigate the effects of LPs on students' learning on a given topic. Second, researchers could use the dataset to find patterns that could be applied in the improvement of ranking algorithms, explain the order of documents, and investigate novel approaches to display search results in learning contexts. The dataset could also be used to investigate teaching-learning strategies of complex topics.
Finally, to the best of our knowledge, this is the first open dataset containing curated learning paths in Spanish. While relatively small compared to datasets in other domains, the methodological approach provided in this article can be followed by other researchers to further extend the current dataset with other topics and languages.