Analysis of Concentrations of Loans by Using Book Circulation Data in Korea University Library

In this paper, data of almost 8 million loans of books recorded for 15 years by the Korea University Library are analyzed by using big data analytic techniques. During this period, book circulation decreased with an average annual rate of decline of 4.4%. The use factor of books in each Dewey decimal classification (DDC) class was evaluated to measure how efficiently books were used by library users. Loan frequencies of books were analyzed and meaningful results regarding loan concentrations and the half-lives of books were obtained. It was observed that 50% of the total loans in each year were for 20% of all borrowed books in that year. This phenomenon will be called the 20/50 loan rule, and the set of the top 20% most borrowed books, whose cumulative loan frequencies reach 50% of total loans, will be called a core collection. The 20/50 loan rule shows the loan concentration of library books. The extent of loan concentration gets stronger if loans for two or more consecutive years are concerned. It was found that with high probability, books in a core collection at a specific year are also categorized as a core collection in next years. Moreover, books categorized as a core collection in consecutive years have longer half-lives compared with all other circulating books.


Introduction
The library collection is the most important part the various elements that make up a library. Most libraries have difficulties in providing users with optimal collections due to their limited budgets and space problems caused by increases in the volume of books. So, collection development has become one of the most difficult tasks for librarians. To overcome these issues, library are trying to organize user-centered collections by analyzing book use patterns and grasping users' preferences for collections. Libraries have also invested a lot of effort and time in the development of collections so that users can effectively use the library materials needed for research and learning activities, since the collections of academic libraries are an important basis for research and education.
In addition, recent changes in the library information environment have diversified resource types and information delivery methods, and have changed the information-seeking behaviors of users affected by information technology [1][2][3]. Due to these effects, the use of e-books or electronic resources has increased every year, while the use of printed books has decreased every year. A number of studies have been conducted on comparing the utilization rates of printed books and e-books as well as on the decreasing purchasing of physical materials such as books [4]. It was found that developing a collection with printed books and e-books has different effects on users' book usage [5]. Rose-Wiles showed that the circulation of printed books in university libraries has been reduced by 23% in five years [6]. Thus, a more detailed analysis of circulation data is needed to develop print collections effectively as the use of electronic materials increases. For this, circulation data are often used to analyze the characteristics of a library's collection or to identify the current status of books' use. A library has a vast collection of materials, and at the same time, users produce vast amounts of circulation data every day using library materials. These circulation data in libraries are both big data and structured data that can be analyzed immediately. Because of these points, it has become increasingly important to identify the books borrowed intensively by library users and utilize them efficiently for the purpose of collection development and management in libraries.
Collection evaluation through the analysis of circulation statistics is also very important for accurately grasping the current status of collections in the library and predicting users' collection usage patterns to keep the collections up to date and improve the quality. It is the basis for solving space problems, which are seriously considered by academic libraries, and increasing the usability of collections by weeding out low-use collections [6,7]. Similarly, circulation statistics are an invaluable source of information for figuring out users' preferences for a specific subject in a collection in that books are still an important communication channel between library users in all subject areas. Both librarians and users consider circulation statistics to be the primary criteria for assessing library values. That is, the analysis of circulation data is an important factor that gives actual users an actual insight into which books they use, which can be used as basic data for organizing collections and improving library services. Therefore, it is necessary to develop a method of analyzing data accumulated in a library by utilizing circulation data in various ways.
This study aims to find circulation patterns of library users based on the circulation data of print books with a big data analysis method. This paper is organized as follows: Section 2 reviews relevant studies on circulation patterns and analyses of library books and users. Section 3 introduces the data collection and methods. Section 4 describes the results of our research with analytical techniques. Then, Section 5 discusses our findings and directions for the future research.

Related Work
Many studies have been conducted to analyze the circulation patterns of users based on a library's book collection statistics and circulation statistics. Walters et al. studied the differences among librarians, faculty, and students in the selection of library books [8]. Dinkins evaluated the collection development policy by using circulation statistics gathered in five departments for five years in order to check whether the books purchased through the library acquisition met the needs of users. The utilization rates of books selected by librarians were generally higher than those of books selected by professors in the departments. Therefore, it could be more desirable to select library materials through collaboration of librarians and professors in each department [9].
Using the one-year circulation data of university library users, Kim analyzed the use pattern of library collections of 16 subject areas and five user groups [10]. The preference for new materials was investigated for each subject and user group. Yoo et al. classified one year of circulation records of university libraries into 16 subjects and five user groups, and analyzed them in terms of use factors and publication years by using statistical analysis methods, such as correlation coefficients [11].
Galvin et al. analyzed the usage of books and academic journals to contribute to establishing the library development policy of the Pittsburgh University Library [12]. In research-oriented university libraries, interrelationships were analyzed in the aspect of budgets, usage of books and academic journals, and the aging rate of literature. Of the total collections, 40% of books were never checked out and 73% of the loaned books were borrowed within two years of purchase. This is clear evidence of the preference for the recently published books. Moreover, such a phenomenon had shown almost the same pattern for six years. Based on this analysis, books checked out for the first time within 6-7 years of purchase and the aging patterns of book usage could be predicted.
Adams et al. used one-year circulation statistics obtained from the Indiana University Science Library to analyze circulation patterns for collections depending on publishers, publication years, and subjects [13]. Books published earlier tended to be checked out more than recently published books. It was also found that books of a specific publisher were rarely checked out, and thus, the publisher is a factor influencing the circulation pattern. Using five-year book loan statistics from the Lied Library at Nevada University, Tucker conducted a comparative analysis of collection usage patterns for books purchased by liaison librarians and books purchased through the library acquisition [2]. The lending of the entire collection had been decreasing every year for five years, and the usage rate of books purchased through the liaison librarian was also low. So, there was a need to redistribute the book budget because of the decrease in book usage.
Cheung et al. analyzed the loan patterns of books obtained from the university library's 15-year checkout statistics in various categories defined by year, language, and Library of Congress classification to determine the utilization rate of books [14]. It was found that the utilization rate for the first 5-7 years after the book was purchased tended to increase, and then decreased slightly. In addition, books that had not been loaned within the first 1-2 years after they were purchased had never been loaned for 15 years, and the usage rate of books purchased by the library was higher than that of the donated books. In addition, one-third of all books purchased at the library were never checked out. So, books that were not circulated within the first 5-7 years could be stored in a separate storage library to solve the problem of insufficient space.
Some studies focused on developing a core collection using citation analysis for the purpose of collection development. Knievel et al. used 9131 citations included in 2002 journals to analyze citation patterns [15]. The type of literature mainly cited in the science and technology field was predominantly in academic journals, but most of the citations in the humanities field were in books. It was found that such citation analysis could be immediately applied to the library collection development policies. Black derived core journals based on the form of citations in specific subject areas and the distribution of cited academic journals by publication year [16]. Scholarly journals and books were typical forms of citations, with 13% of scholarly journals accounting for 80% of citations. In addition, the period for which articles in academic journals were mainly cited was observed to be at least eight years. Keeping track of important journals and books through citation analysis helped researchers save time when they could find the literature.
Trueswell conducted a number of studies to determine a core collection of books [17,18]. He analyzed circulation transaction data to form the optimal core collection out of total library holdings. He introduced a ratio of the last circulation date and the loan period, and obtained a cumulative distribution of the resultant ratio. It became possible to predict the core collection satisfying 99% of users' requirements and to reduce the volume of library holdings by 60% to 70% at the same time. He also refined the pattern of determining the core collection by using circulation transaction data obtained from university libraries [19]. He asserted the necessity of selecting a core collection as an objective criterion for distinguishing between books that were highly circulated and books to be stored in stacks in the library. In addition, by examining the patterns of library users' behaviors, Trueswell demonstrated the 80/20 rule, meaning that 80% of usage can be attributed to 20% of the collection [20].
Oh et al. analyzed the usability of collections for each subject in terms of Bonn's use factor, turnover rate of the collection, and Trueswell's 80/20 rule by using circulation data for 10 years of the Gwangjin District Public Library [21]. They found core collections from the borrowed books according to use patterns and presented the basis for effective collection development policy. This analysis of circulation data could be used as a collection evaluation index to contribute to collection development and collection evaluation.
Through the previous studies introduced above, it was found that the circulation data can be used for establishing collection evaluation and collection management policies by analyzing the usage patterns of books based on various criteria, such as subject, language, publication year, etc. This study, however, is not limited to describing the technical analysis of the loan data, but rather investigates core collections in which loans intensively occur and users' loan patterns from specific perspectives. Furthermore, the variations in loan concentration depending on all subject areas are explored and the half-lives of books are also analyzed.

Data Collection and Methods
In this paper, circulation data of books recorded for 15 years by the Korea University Library, which is one of top-ranked research-oriented universities in Korea, are analyzed. The Korea University Library held 2,616,202 books as of 2019, which is at a very high level among universities in Korea. Collection statistics and circulation data were gathered for each fiscal year from 2005 to 2019. The circulation data were extracted through Tulip, a library integration system, and written as a text format. Each entry of data contains various fields, including title, Dewey decimal classification (DDC), author, publisher, publication year, loan and return date for each transaction, number of renewals, number of multiple copies, etc. The original circulation data of all fields acquired were pre-processed and refined to exclude entries containing erroneous records or missing entities. In-house use of books could not be quantified because there is no system for monitoring in-house uses of books in the library. Thus, this study analyzed only books that were checked out and did not include in-house use.
Library users were grouped according to their affiliations with humanities/social sciences, science, engineering, and others. Circulation data were analyzed by subject categories based on the DDC, which is a system for organizing all subjects of books from general to specific, indicated by numbers. The DDC has ten main classes, and the first digit in each three-digit number represents the main class: 000 stands for computer science, information, and general works, 100 for philosophy and psychology, 200 for religion, 300 for social sciences, 400 for language, 500 for science, 600 for technology, 700 for art and recreation, 800 for literature, and 900 for history and geography. To analyze the 15-year circulation data collected for this study, the open-source language R, which is widely used for big data analysis and data mining, was utilized.

General Statistics of Circulation Data
The number of books registered in the Korea University Library increased gradually every year. It increased from 1,527,355 in 2005 to 2,616,202 in 2019, and the average number of books per year was approximately 2.2 million during this period. On the other hand, the number of loans extracted for each year decreased gradually. It decreased from 631,346 in 2005 to 330,584 in 2019, and the total number of loans for 15 years reached 7,983,869. As for the collection status by subject, the largest collection was in the fields of 300 social sciences and 800 literature, while the lowest was in the fields of 400 language and 200 religion. Considering the large variations in the number of books in each subject area, it is not appropriate to simply analyze the number of loans to measure the efficiency of book use. To resolve this problem, this paper uses the use factor proposed by Bonn to measure how books in a specific category are utilized by library users [22]. The use factor of books in a specific category m is defined as (1) The high ratio of the number of loans over the number of books in a specific subject category indicates that books in that category are actively used in terms of loans. The higher the use factor of a specific category is, the more actively books in this category are used by library users. If the use factor is greater than 1, it is evaluated as an actively used collection. The number of books in the library and the number of loans recorded from 2005 to 2019 are summarized in Table 1.
It was observed that books in 300 social sciences and 800 literature were the most frequently borrowed, while books in 200 religion were the least borrowed among all subjects. Considering the number of books in each subject, however, books in 500 science seemed to be borrowed relatively less than others. The books in 500 science showed the lowest use factor among all DDC classes. On the other hand, 800 literature, 700 art and recreation, and 100 philosophy and psychology showed relatively high use factors. The use factor can be effectively used by libraries to establish their policies for purchase and acquisition of new books. For example, a library needs to purchase books in DDC classes with high use factors each year because they are relatively frequently borrowed compared with the number of books.

Concentration of Loans and the 20/50 Loan Rule
The number of borrowed books and the loan frequency of each borrowed book were investigated for 15 years from 2005 to 2019. Books with multiple copies were considered a single book, and the loan frequency was obtained by adding loan frequencies of all multiple copies. After each book's loan frequency was computed in this manner, books were ranked in descending order of loan frequency. Then, the loan frequencies of books were arranged in descending order and plotted as in Figure 1, where the x-axis denotes the percentage ranking of books in terms of loan frequency. Note that Figure 1 was obtained by using book loan data of the year 2010. It was observed that the arranged loan frequency in descending order formed a power law distribution. This implies that only a small proportion of books were intensively borrowed, and they occupied high proportion of total loans. In short, loans were highly concentrated on a small proportion of books. To see the extent of loan concentration quantitatively, the cumulative distribution of loan frequency with respect to the percentage ranking of books was investigated. In Figure 2, cumulative distributions of one-year loan frequencies arranged in descending order are plotted, where for visual clarity, cumulative distributions for only four sample years, 2007, 2010, 2013, and 2018, are plotted. It was observed that every year, the cumulative distribution of loan frequency had an identical shape, and 50% of total loans were concentrated on the top 20% most borrowed books. This observation will be called the 20/50 loan rule [23]. The top-ranked most borrowed books occupying half of the total loans for each year will be called the core collection [23]. The proportions of the core collection out of all borrowed books for each year from 2005 to 2019 are listed in Table 2. The proportion of the core collection among all borrowed books varies year by year, but it is around 20%. Thus, the 20/50 loan rule is observed to be satisfied every year. For more detailed analysis of the loan concentration, loans in each DDC class were investigated. Loan frequencies of individual borrowed books belonging to each DDC class were counted. Then, the cumulative distribution of loan frequency arranged in descending order was obtained, and the percentage ranking of books corresponding to the cumulative distribution of 50% was determined. The set of books whose rankings were higher than or equal to the determined ranking were defined as a core collection in the DDC class under investigation.  Table 3 lists percentage rankings of books corresponding to the 50% point of cumulative distribution in each DDC class. It was observed that 50% of loans of one year in each DDC class were concentrated on approximately 20% of borrowed books, and thus, the 20/50 loan rule was generally satisfied in each DDC class. It is notable that the equivalent level of loan concentration was observed for all DDC classes. The circulation pattern of books in each DDC class was expected to be identical, although each DDC class shows different numbers of books, different numbers of loans, and different use factors. It was also observed that proportions of core collections in 500 science and 800 literature were lower than those in other DDC classes. Thus, 500 science and 800 literature experienced higher loan concentration than others. Note that low proportion of core collections indicates high loan concentration. On the other hand, 200 religion and 700 art and recreation showed lower loan concentrations because the core collections were defined by higher proportions of borrowed books than others.

Variation of Loan Concentrations by Adding or Merging Categories
When books are grouped by a certain classification, it is possible that a single book can belong to multiple groups. If this happens in the core collection, the extent of loan concentration is expected to get weaker. As introduced in Section 4.2, classifying books as DDC classes does not significantly change the extent of loan concentration of all books. Other kinds of classification, however, can influence the extent of loan concentration.
To investigate the variation of loan concentration as a result of adding categories, grouping books according to users' affiliations is considered. Table 4 lists the proportions of core collections according to the affiliations of users and for each year, where it was observed that the proportion of core collection exceeded 20%. This shows that weaker concentrations of loans resulted from dividing books into groups. Nevertheless, the proportions of core collections did not deviate seriously from 20%, which shows that the 20/50 loan rule was still satisfied in general. Next, we considered merging categories, with the resultant loan concentration getting stronger. Merging categories was achieved by merging loan data of consecutive years when counting loan frequencies of borrowed books. The number of consecutive years merged together will be called a window size. The relationship of loan concentration and the window size was investigated by varying window size from one year to 15 years. By applying a sliding window technique, the average value of loan frequency for a given window size was computed. Figure 3 shows the cumulative distributions of average loan frequencies arranged in descending order with respect to the percentage ranking of books. As the window size got longer, 50% of loans were accomplished by lower proportions of borrowed books. In other words, the core collection had higher dominance as the window size got longer. This tendency is also shown in Table 5, which lists the average proportion of books occupying half of the total loans for a given number of consecutive years, i.e., window size. The above observation is explained by the fact that books in a core collection at a certain year can belong to a core collection in the next year with a high probability. Table 6 lists the average proportion of books in a core collection at a certain year that are also categorized as a core collection in the next years. For obtaining average proportions, a sliding window technique was applied. At a certain year, a core collection was determined and stored. Then, proportions of books in the core collection that satisfied the following three properties were obtained: • They were also categorized as a core collection in the next year, • They were also categorized as a core collection in at least one year in the next two years, • They were also categorized as a core collection in next two consecutive years.
It was observed that, with high probability, books in a core collection in a certain year were also in a core collection in the next year. Thus, doubling the window size does not guarantee doubling the volume of core collection. The growing speed of the volume of core collection was slower than the growing speed of the window size of consecutive years. It follows that the loan concentration gets stronger as the window size of consecutive years grows. The 500 science and 800 literature classes showed a higher extent of loan concentration with respect to the window size than other DDC classes.

Half-Lives of Books
In general, books are intensively borrowed right after they are published and shelved in the library [10][11][12]14]. Thus, the loan frequency of books is highly correlated with their book ages, where a book age is defined by the number of years elapsed after the book is published. The typical loan frequency of books exponentially decays with respect to the book age, as shown in Figure 4. A slowly decaying pattern of loan frequency implies that library users utilize the book for a long time.
In physics, the half-life or half-life duration was introduced as the period of time required for one half of the atomic nuclei of a radioactive material to decay [24]. It can also be regarded as the time interval required for a certain entity to have the cumulative distribution of 50%. In addition, many previous studies have defined the half-life as a measure to indicate the life span of academic journals [25,26]. People want to evaluate books in terms of the time duration in which they are intensively utilized by library users [23]. As a result, the half-life was adopted in the field of library and information sciences as a useful parameter to measure characteristics of library books.  Figure 5 shows a typical cumulative distribution of the loan frequency of books, which was obtained by using loan data in 800 literature during the years from 2005 to 2019. The half-life is defined as the book age corresponding to the 50% point of cumulative distribution. After plotting the cumulative distribution of loan frequency as in Figure 5, a curve-fitting technique was applied to obtain an equation describing the cumulative distribution with respect to the book age [23]. Then, by solving the equation, the half-life was determined as the book age corresponding to 50% of the cumulative distribution.  Table 7 lists the values of half-lives of books belonging to each DDC class for the years from 2005 to 2019. It was observed that books in 200 religion and 400 language had longer half-lives than books in other DDC classes. It is notable that books belonging to 300 social sciences have short half-lives, though they have high loan frequency, as listed in Table 1. The values for the half-life of a core collection in consecutive years are listed in Table 8. As shown in Table 8, core collections in consecutive years had longer half-lives than other books. This observation follows the observation provided in Table 6 that books in a core collection at a certain year were classified as a core collection in the next years with high probability. Thus, the analysis of half-lives can also explain why loan concentration gets stronger as the time window size grows.

Discussion
In this paper, the loan concentrations of print books in a library were intensively analyzed. It was found that a majority of loans were attributed to a small portion of the collection, called a core collection. Library users steadily utilized this core collection, causing the core collection to have a longer half-life than other books.
Circulation data regarding loan activities recorded for 15 years in the Korea University Library were thoroughly investigated and analyzed by using big data analytic techniques. The Korea University Library is one of top-ranked research libraries in Korea, and it held 2. The use factor of books in each DDC class for each year was evaluated to quantify how efficiently books were used by library users in terms of loans. A high use factor of books in a specific DDC class means a high number of loans per book in that class. Thus, a library needs to purchase books in this class to improve the efficiency of book usage.
The loan frequencies of books were the main ingredients of this study, and they were analyzed from various perspectives to obtain useful results. The loan concentrations and half-lives of books are examples of practical observations gained from the analysis of loan frequency. Every year, 50% of total loans were accomplished by 20% of all borrowed books. This observation is named the 20/50 loan rule. It was observed that the 20/50 loan rule was valid for the 15 years from 2005 to 2019. This means that usage patterns of library users were highly consistent and very predictable for the core collection. This rule also held for each DDC class every year. A core collection is also defined as books whose cumulative loan frequencies correspond to 50% of total loans in each year. The existence of a core collection may provide libraries with a guideline for plans of service, acquisition, and collection management. As such, libraries may operate a core collection section to offer library users quick and easy access to popular books.
It was observed that the extent of loan concentration gets stronger if multiple categories of books are merged as one. Lengthening the window size for looking into loan frequencies is an example of merging categories. If loan frequencies of books are counted for two or more consecutive years, the proportion of borrowed books accomplishing 50% of loans is lower than 20%. Thus, the loan concentration gets stronger in this case. On the other hand, if books are further classified by extra categories, the loan concentration gets weakened. If books are further categorized according to affiliations of users, the proportion of books regarded as a core collection in each category is higher than 20%. This indicates weaker concentration of loans.
The reason for the stronger loan concentration resulting from considering longer window size in loan frequency analysis is provided. With high probability, books in a core collection at a specific year are categorized as a core collection in the next years. Books included in core collections of consecutive years have longer half-lives than ordinary books. These can explain why loan concentration can be stronger by increasing the window size used for loan frequency analysis. In this paper, loan frequency is thoroughly analyzed and many meaningful results are obtained. Loan duration is regarded as another meaningful factor to judge how actively books are utilized in terms of loans. Thus, an analysis of the loan durations of borrowed books will be performed as future research. Since library users have distinct maximum loan periods according to their status, e.g., undergraduate student, graduate student, faculty, and staff, more sophisticated analysis techniques may be required in future studies.