A Global Book Reading Dataset

: The choice of what to read is both inﬂuenced by and indicative of such factors as a person’s beliefs, culture, gender, and socioeconomic status. However, obtaining data including such personal attributes, as well as detailed reading habits and activities of individuals is difﬁcult and would usually require either (i) data from e-readers, such as the Amazon Kindle, or from library checkouts, both of which are hard to obtain, or (ii) distributing questionnaires and conducting interviews, which can be expensive and suffers from recall bias. In this study, we present a dataset of over 40 million reading instances of 1,872,677 unique individuals collected from Goodreads. Goodreads is a book-cataloging social media platform with millions of users, where users share comments on the books they have read, while creating and maintaining social connections. We enrich the dataset with gender and location information. The dataset presented in this study can be used to perform cross-national and cross-gender analyses of reading behavior among book enthusiasts.


Introduction
Reading is a globally popular pastime that has been shown to be a beneficial nonmedical strategy for improving mental health and well-being [1]. Reports from 2017 [2] and 2018 [3] have shown India to be the most active country with regards to reading in the world, with an average of more than ten hours a week spent on reading. According to both reports, 70% of Americans indicated having read at least one book during the past year, and the median number of books read in the U.S. per person per year is four, with an average of 12 [4]. The Pew Research Center explores the demographic traits that characterize the approximate quarter of the American population that does not read books in a given year, a percentage that has grown compared to a decade ago [5].
As reading is usually an activity performed alone and in the comfort of one's home, information on the reading habits and behaviors of nations and individuals is rarely recorded. While the growth of e-readers could result in the documentation of this data, the information would not be openly accessible to members of the research community. Additionally, print books are still more popular than digital books [6,7], meaning that such data would only account for a small proportion of reading instances.
Given the known, positive effects of reading on mental and psychological well being, there is continued interest in understanding factors that influence reading habits. Amid the COVID-19 pandemic, for instance, a web survey of the reading habits of Spanish and Italian readers was conducted [8], collecting a dataset of these habits during confinement. While the data can tell us a lot about these habits, the authors acknowledge the low response rates, showing that a large proportion of those who opened the questionnaire abandoned it before answering all the questions. Moreover, the data are limited to two countries, preventing a large scale cross-country comparison.
In this paper, we present data from Goodreads (http://goodreads.com, accessed on 20 July 2021) with the goal of enabling large-scale studies of reading behaviors. Goodreads is a book-centered social media platform, launched in 2007. Based on their own claims (https://www.goodreads.com/advertisers, accessed on 20 July 2021), they have "45 million unique visitors a month". Other statistics estimate that, as of 2019, the site had 90 million registered users [9]. Based on our explorations, the website has acquired well over 120 million users over its 13 years of activity, though many of these might no longer be active. We present a dataset of 41,253,535 book "reviews" (while Goodreads uses the terminology "review", these reviews do not require a numerical rating or textual evaluation, and so the term "posting action related to a book" might be more appropriate), left by 1,872,677 Goodreads users with public profiles. Upon collection, the data are enriched with country information and inferred gender. The collection process and the details of the dataset are reported in the Section 2.
Over the past couple of years, data from the Goodreads platform have been used for several academic studies. Thelwall and Kousha [10] explore the user base of the website, comparing the behavior and activity on the platform with respect to gender by, for instance, showing that females register more books and rate them less positively. As the study is conducted through the analysis of 50,000 random users, the dataset we share as part of this study could also be used to answer similar questions. More broadly speaking, we see this dataset as useful for supporting user-centric studies on reading behavior. Other researchers have focused solely on the reviews and ratings left on Goodreads. For example, [11] studies the sentiment, emotion and language expressed in reviews. Others have looked at what aspects of a book, e.g., characters or storyline, are being discussed [12]. Kousha et al. [13] investigate the feasibility of using Goodreads' book metrics and reviews as a means to assess the impact of books. Alghamdi and Ihshaish [14] address related questions of potential influence, looking only at Arabic book reviews. Maity et al. [15] study how much user behavior on Goodreads could be indicative of sales on other book retail platforms, such as Amazon. As we have chosen not to share the review text, or the book identities due to potential risk of abuse, our dataset does not directly support these review-centric studies.
Rather, we hope that these data will allow studies of reading habits and behaviors, and how these habits are impacted by various events and social movements. In particular, we believe that certain cross-cultural, cross-gender, and cross-country studies of reading are enabled through the use of these data.

Data Collection and Exploration
To collect the data in this study, we used the official Goodreads API [16], using a Python program to connect to and collect data from the API. As of 8 December 2020, the website has declared that it no longer provides new API keys (https://help.goodreads. com/s/article/Does-Goodreads-support-the-use-of-APIs, accessed on 20 July 2021), and it has since started to retire previously issued API keys (https://help.goodreads.com/s/ article/Why-did-my-API-key-stop-working, accessed on 20 July 2021).
For our data collection, we chose a user-centric approach, i.e., collecting a "complete" snapshot of data for a sample of users, rather than, for example, collecting all readers of a sample of books. To select the sample of users, we proceeded as follows. First, we observed that the internal Goodreads user ID seems to be consecutively assigned, with user ID 1 belonging to the Goodreads Founder Otis Chandler. The largest user ID, as of September 2020, was 121,761,242. We then proceeded to sample from the user ID space as follows.
We initially began by querying the user space by selecting a few user IDs at random and then continuing the collection by adding a constant number to these values and collecting those accounts. The precise details of this changed during the collection process as we gradually refined our data collection objectives. For example, initially, the clustering of many near-adjacent IDs, corresponding to users who registered around the same time, was not a concern. In fact, we were interested in investigating accounts that had joined during the COVID-19 pandemic, and a more rigorous collection of IDs (with additions of smaller numbers) was conducted for IDs in that space (causing the peak that is visible in Figure 1). However, later, we changed to uniformly sampling the entire ID space to avoid oversampling users who registered on particular dates. To obtain information about a given user through the API, we used the user.show method (https://www.goodreads.com/api/index#user.show, accessed on 20 July 2021). Note that only information for users who set their profile to be viewable by "anyone (including search engines)" was used, and the API does not support the collection of data for private accounts. Next, after making sure that the account was public, we collected all the books that the user had added through the reviews.list method (https://www. goodreads.com/api/index#reviews.list, accessed on 20 July 2021), paginating through long result lists where necessary. While we follow the Goodreads terminology and use the term "review", these reviews are not required to have any textual review, nor any type of rating. Instead they can merely indicate that a user posted a book to one of their shelves, for example, the "read" shelf.
The data were collected in 2020, encompassing any books that users had read since first joining the website until August 2020. Tables 1 and 2 display the fields available for each individual and review respectively (this dataset is publicly available at https://figshare. com/projects/A_Global_Book_Reading_Dataset/118854 (accessed on 20 July 2021)). Table 1. Explanation of fields available for each user.

Field Description Included in Public Dataset
User ID A unique, numerical identifier for the user on the website. A hashed version of the ID is made available.

Name
Name of this user. In contrast to many other pseudonymous social networks, Goodreads users tend to use real names and even full names, as the input form has separate first, middle, and last name fields. No

Username
The username that the user has selected. This field is optional; name is the field each user must fill in to create an account. No Profile Image URL of the user's profile picture. No

Friend Count
The number of friends that the user has. Being friends on Goodreads is a bidirectional property, independent of uni-directional following; 62% of users do not have any friends. Yes

Review Count
The total number of books added to any of the user's shelves, in other words, the total number of books in the user's automatically generated "all" shelf. Only 4.5% of users have more than 100 books in their shelves. Yes Table 1. Cont.

Groups Count
The number of groups the user is part of. Some groups can be freely joined, for others the user needs to be admitted. Yes

Location
An optional self-reported location of the user. By default, Goodreads seems to infer a user's country, presumably based on IP address. This selection can be changed later and a drop-down list of countries is available. Only 3.7% of users have left the field empty.
Self-reported locations are not reported but detected country-level values are.

Age
Self-reported age of the user; 97% of users have not completed this value. No Gender Self-reported gender of the user. Only 7735 have filled in this value.
Options, from a drop-down, include male, female and custom, which supports free-text.
Inferred gender values are included, but not the self-reported ones.
About An optional self-description of the user. The numerical length of this section is included, but not the textual content.

Favorite Authors
Favorite authors of the user.
Yes, but author IDs are replaced by hashed values.

Website
An optional field, allowing users to share their website or any other link. No

Joined
The month and year in which the user joined the platform. Yes

Last Active
The month and year that the user was last active on this website (since our collection was conducted in 2020, dates within this year do not necessarily indicate that the user has abandoned the website). Yes The data obtained through the Goodreads API were then enriched, in particular by attempting to infer a user's gender. To infer gender from a user's self-declared name and/or username, we used the Name2GAN tool [17] to detect the most probable female or male gender of the name. Unlike name dictionaries from the U.S.A. Social Security Administration (https://www.ssa.gov/oact/babynames/limits.html, accessed on 20 July 2021), Name2GAN is trained on multi-lingual Wikipedia and social media data and recognizes names from many cultures. While the tool only supports a binary male-or-female classification, as well as an "unknown" option for unrecognized names, we are not implying that gender is binary, and we acknowledge that many people self-identify as non-binary. Goodreads also supports free-text, non-binary gender in the user profile. However, for the small set of users where a self-reported gender was available, including self-reported non-binary gender, we decided not to include this information in the shared dataset so as not to provide an easy way to identify users based on their gender identity. We believe that the inferred binary gender still provides a meaningful signal for studying gender differences between women and men in reading behavior, without exposing vulnerable minorities to the risk of identification.
Using this approach, we inferred a gender of either male or female for 87% (1,634,103 out of 1,872,677) of users. To estimate the accuracy of the gender inference, we compared the detected gender against the self-declared gender for the set of users who added their genders manually. We found that, among those with self-reported binary gender values, 86.4% of the instances are labeled correctly. Upon inspection of the not-correctly-classified values by hand, we found that these names are often either abbreviated versions of the person's name (e.g., E. M.), truly ambiguous names (e.g., Mallia Chris), or not people's names at all (e.g., DR, International School). The distribution of gender values is displayed in Figure 2. We can observe that there are disproportionately more female users in our dataset than male users. This is, however, on-par with other statistics on the users of the website showing that the user base of the website is predominantly female [18].
Data 2021, 1, 0 6 of 11 differences between women and men in reading behavior, without exposing vulnerable minorities to the risk of identification.
Using this approach, we inferred a gender of either male or female for 87% (1,634,103 out of 1,872,677) of users. To estimate the accuracy of the gender inference, we compared the detected gender against the self-declared gender for the set of users who added their genders manually. We found that, among those with self-reported binary gender values, 86.4% of the instances are labeled correctly. Upon inspection of the not-correctly-classified values by hand, we found that these names are often either abbreviated versions of the person's name (e.g., E. M.), truly ambiguous names (e.g., Mallia Chris), or not people's names at all (e.g., DR, International School). The distribution of gender values is displayed in Figure 2. We can observe that there are disproportionately more female users in our dataset than male users. This is, however, on-par with other statistics on the users of the website showing that the user base of the website is predominantly female [18]. Next, we analyzed the location values of users, aiming to detect country of origin based on the unstructured texts users have shared on the platform. By default, Goodreads appears to automatically infer a user's country, most likely based on the user's IP address. (This is based on the authors' own observation when creating a test account.) After sign-up, users can then choose to edit this location information, which includes selecting a country from a drop-down menu, including an "-" (empty) option. They can also provide free-text city and state information. Given the enforced country-level scheme, almost all users have a clearly identifiable country. To extract country information, both in the majority of easy cases, as well as in a smaller number of harder cases, we used a combination of rule-based approach, followed by the use of GeoPy (https://github.com/geopy/geopy, accessed on 3 August 2021). GeoPy is a Python client for several popular geocoding web services. More specifically, the system makes use of the Google Maps Platform, OpenStreetMap Nominatim , Next, we analyzed the location values of users, aiming to detect country of origin based on the unstructured texts users have shared on the platform. By default, Goodreads appears to automatically infer a user's country, most likely based on the user's IP address. (This is based on the authors' own observation when creating a test account.) After sign-up, users can then choose to edit this location information, which includes selecting a country from a drop-down menu, including an "-" (empty) option. They can also provide free-text city and state information. Given the enforced country-level scheme, almost all users have a clearly identifiable country. To extract country information, both in the majority of easy cases, as well as in a smaller number of harder cases, we used a combination of rule-based approach, followed by the use of GeoPy (https://github.com/geopy/geopy, accessed on 20 July 2021). GeoPy is a Python client for several popular geocoding web services. More specifically, the system makes use of the Google Maps Platform, OpenStreetMap Nominatim , and Bing Maps, among others, to work. We performed the following steps, one after the other, stopping if we found a country: • Comma separate the string, checking only the last part of the string against a list of countries and state names, labeling the country if the value is on that list. This is because most people use the convention of mentioning their country as the last part of their address. A total of 96% of locations are detected in this manner. • Comma separate the string, checking only the first part of the string against a list of country names, labeling the country if the value is on that list. Similar to the intuition of the last part, this time, consider those who start their address by writing their country name. A total of 0.07% of locations are detected in this manner. • Input the entire string to GeoPy. A total of 0.06% of locations are detected in this manner.
Eventually, a total of 96.2% of user locations were detected. As 3.7% of users had an empty location field, this means that only a tiny fraction of users did not have a usable location that could be mapped to a country. Figure 3 shows the distribution of these locations across the world. We can see that U.S.-based (711,889 users, making up 38% of the users in our dataset) and India-based (163,521 users, making up 8.7% of the users in our dataset) users make up a large proportion of our dataset. As shown in Figure 4, most users who join Goodreads are not active after the first month they join. However, there are users who have been active on the website for 13 years. Table 3 displays the statistics of our dataset. Among the book additions in our dataset, Harry Potter and the Sorcerer's Stone by J.K. Rowling is the most added book in our dataset, followed by The Hunger Games by Suzanne Collins and To Kill a Mockingbird by Harper Lee. (While we have decided not to include any book titles in the public dataset, we include these three titles here to provide a sense of what is popular on Goodreads).  The most used tags for female and male users of the platform are shown in Table 4.

Anonymity
As previously mentioned, all data were collected through the Goodreads's official API, only including information about public accounts. Despite the public nature of the data, we believe that data about individuals must be published only in an anonymized form to minimize the risk of harm to users whose information is included in the data release. Correspondingly, the released data do not include a user's name, their username, their precise location, or, in fact, any text input at all, as any free text field might leak personally identifiable information. The user ID in the data release is a hash of the original ID, where the hash function includes a random "salt" to guard against lookup attacks. A particular type of risk that we have tried to mitigate relates to identifying incidents of users reading "forbidden books". This in particular relates to books that are banned for their political, religious, or sexual content. Even though no personally identifiable information was included in the data release, we have chosen not to include the identity of any book or author in the released dataset to minimize the particular risk of identifying users reading "forbidden books". For the same reason, we have decided to remove information about shelf names used by fewer than 200 distinct users, which included such shelf names as "LGBTQ" or "Erotica" (shelf names used by more than 200 distinct users, as well as the number of unique users who have used them, are shown in Table 5). However, we acknowledge the risk that globally popular books might be reidentifiable through their popularity level and their global distribution pattern.  Furthermore, we accept the reidentification possibility with another Goodreads data collection. In other words, if an attacker was to recollect a dataset similar to the one that we are sharing, then they would likely be able to link users on things such as their activity patterns. However, in that scenario, the attacker would not gain any additional benefit from having access to our particular data.

Potential Use Cases
In its current form, we see the biggest values in user-centric studies that make use of the international nature, as well as of the inferred gender. While fine-grained book information was withheld, knowing when, where, and what type of popular genre (i.e. shelf name) is being read and posted about could still serve as the basis of important studies on the interplay between country, gender, book genre, and activity patterns over time. For example, this dataset enables knowledge of how temporal patterns affect the reading behavior of book enthusiasts. In fact, the initial motivation for the creation of this dataset was to observe the gender-specific impact of the reading of women during the COVID-19 pandemic. Surprisingly, we did not find a clear pattern here, possibly due to the dominance of of book enthusiasts, who might continue reading, even when faced with calamity. Future work could examine the differences between the most popular book genres (see, for example, Table 4), analyzing temporal changes to what each gender uses most, and investigating if and how they are affected by real-world events.
We acknowledge that many interesting use cases would require knowing additional information, such as a book's identity, or the text of a review, both of which were withheld from this data release as explained in the previous section. For well-specified use cases with a mission of social good, and where external, ethical review can be demonstrated, we invite researchers to contact the authors to discuss additional data access options.

Data Limitations
The collected dataset has certain limitations, including the following. Firstly, at the beginning of the collection, the API was queried, using constantly-spaced user identifiers rather than randomly sampled numbers. This method of collection was then later changed to sampling IDs uniformly from the ID space. However, due to the initial approach, some ID ranges, and hence, some sign-up periods, were over sampled, compared to others (please see Figure 1).
A technical limitation is that the API does not allow us to capture the dates of re-reads of the same book. In other words, if a user reads a book more than once, that "review" instance is updated to now reflect the new dates and status of the review. No new instance regarding that book is created; instead, each user only has one review instance for each book, which is updated whenever a change to the status of the book is made. Consequently, while we can detect the number of times they have read the book (using the "read-count" field in the dataset), we are not able to find when each round of reading took place, and only have access to the dates for the last time the book was marked as read. Information regarding the date at which the book was re-read is available on each review page on the website. However, to the best of our knowledge, this information cannot be collected through the API.
Finally, it is important to remember that the data represent the reading habits of avid readers, as joining a social network for books is not something done by the majority of readers. Any findings derived from these data will correspondingly need to be interpreted with the underlying user selection bias in mind.