Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets

Recent technological and computational advances have enabled the collection of data at an unprecedented rate. On the one hand, the large amount of data suddenly available has opened up new opportunities for new data-driven research but, on the other hand, it has brought into light new obstacles and challenges related to storage and analysis limits. Here, we strengthen an upscaling approach borrowed from theoretical ecology that allows us to infer with small errors relevant patterns of a dataset in its entirety, although only a limited fraction of it has been analysed. In particular we show that, after reducing the input amount of information on the system under study, by applying our framework it is still possible to recover two statistical patterns of interest of the entire dataset. Tested against big ecological, human activity and genomics data, our framework was successful in the reconstruction of global statistics related to both the number of types and their abundances while starting from limited presence/absence information on small random samples of the datasets. These results pave the way for future applications of our procedure in different life science contexts, from social activities to natural ecosystems.


Introduction
In the last few centuries, the proceeding of human knowledge in natural and life sciences has been affected by the limited possibilities in collecting data and making them available for the entire research community. Thanks to technological developments in the modern era, science is facing opposite issues. Nowadays, it is possible to collect large amounts of data at a very high speed and lower costs [1][2][3]. Hence, the problem has shifted to the storage of an extensive volume of information into datasets with feasible dimensions for fruitful analysis [4,5]. Moreover, in some cases, this abundance has paradoxically slowed down scientific research, since it is becoming harder to analyse such a huge quantity of data with standard techniques and to extract meaningful information from them [1, 2,4,6,7]. This is particularly true in the novel research field of genomics-at the dawn of these studies, the collection of data from DNA sequencing was indeed a limiting factor for making new discoveries and observations. In contrast, thanks to next-generation technologies, today the main obstacle is the storage of new fast-coming data that require massive technological and computational resources to be processed and analysed [5,[8][9][10][11][12][13].
We remark that other procedures exist in literature to deal with the storage problem with big data, such as data compression techniques [32]. These usually stores all the datastream but reduce the bits needed to encode and register the information. The implementation of this kind of algorithms requires advanced computational knowledge and large technological efforts. In contrast, our starting point is the reduction in the data storage requirements. Indeed, we propose an upscaling procedure that, starting from local binary information on the presence/absence of types in small random samples of the whole incoming data allows us to reconstruct basic statistical properties of the entire dataset. This step of storing only partial information of a dataset usually leads to a loss of quality in the predicted quantities. Nevertheless, since the framework is based on basic statistical notions, the advantage in doing so consists in an easier realisation of the method.

Materials and Methods
In this section, in a completely general setting, we derive a new upscaling framework for the estimation of key quantities and statistical properties of a system, be it natural or artificial. In particular, depending on the considered dataset, we will refer with the term items to the forest trees, the sent emails, the Twitter posts, the Wikipedia/Gutenberg words and the human polymorphisms. We then call type the label to which each item is associated: the species to which the tree belongs, the user who sent the email, the hashtag contained in the Twitter post, the word occurring in the Wikipedia pages or in the Gutenberg books and the nucleotide variant observed along a human genome.

Theoretical Framework
Let us now assume we have access to a database containing a huge number of items, N, among which T types are observed. Thanks to this information, one may have insights into relevant statistical patterns characterising the global system, as the type frequency of frequencies distribution (TFFD), which describes how the total number N of items characterising the system are distributed among the T types. Our goal is to reduce the amount of information we need to store without preventing the possibility of an accurate statistical description of the whole database, in order not to miss key properties of the system under study. Let us thus extract from our original dataset S non-overlapping and equally-sized samples, each consisting of a random fraction q of the N system items. Moreover, let us suppose that, for each sample, we only save information on the list of distinct types observed among its qN items. Let us denote with T (s) the total number of different types found in the s th sample. We remark that, with this operation, we passed from the storage of N elements to the storage of T (1) + T (2) + · · · + T (S) N elements, apparently losing information on the total number of types T originally present in the database and on their frequencies. We wish to show that such fundamental statistics can be successfully recovered from the small amount of data we chose to store.
First, since we have preserved information on the type labels observed in each sample, it is possible to compute, at least locally, how the number of types grows with the fraction of the collected system items. More precisely, at each scale mq, m = 1, 2, . . . , S, one can compute the average number of types observed when aggregating m of the S samples. By denoting this latter quantity by T mq , this procedure leads to the curve of points (mq, T mq ) describing how the number of observed types increases with the portion of collected items. This curve is a well-studied pattern in theoretical ecology going under the name of species-accumulation curve or species-discovery curve [25,26,33,34]. In our more general setting, we will call it type-discovery curve (TDC). Our first goal was to find an estimator for T, the total number of types characterising the global system.
Following the statistical framework developed in [25,26,35], we started by considering the global TFFD, which is modelled according to a truncated negative binomial of parameters r and ξ: where c(r, ξ) = (1 − (1 − ξ) r ) −1 is the normalisation constant. We recall that the r parameter, usually referred as the clustering coefficient in ecology due to its relation with individuals' aggregation, lies in (0, +∞), whereas the ξ parameter stands in (0, 1). TFFD represents another widely-known pattern in ecology, the relative species abundance distribution, describing how the total number of trees populating an ecological system are distributed among the present species [25,26,36,37]. Distribution (1) has been proven to provide a very good description of the empirical data both within an ecological context and in human activity data [25,26,35]. Indeed, by tuning the value of the clustering parameter r, it allows us to capture both exponential (positive r) and heavy-tailed (r ∈ (−1, 0)) behaviours. Moreover, it has the advantage of satisfying the so-called form-invariance property, which guarantees that it maintains the same functional form at different scales. Indeed, denoting with p a fraction of items randomly collected with respect to the total, the local TFFD is also a negative binomial with the same clustering parameter r as for the global scale and with a re-scaled ξ p parameter, which is a function of the global ξ and p [25,26,35]: This form-invariance property has been empirically observed in data from both forest ecosystems [25,35] and human activities [35] and it made distribution (1) suitable for inference problems, since it allows us to find an easy estimator for the number of system types on the basis of a rigorous mathematical framework. Indeed, let us now assume that one has collected N p * items, together with the associated types. Then, one could perform a maximum-likelihood estimation to recover the r and ξ parameters via the fitting of the empirical TFFD, as done in [25,35]. Of course, this step could only be performed when local information on types abundances were available. In contrast, when only information on the presence/absence of types in samples is stored, an alternative approach is needed since the TFFD cannot be extracted from the data. In [26] a different method based on the fit of the local empirical TDC was been proposed to overcome this lack of information. Here we wished to combine the two approaches and to develop a novel estimator for T.
We started by computing the log-likelihood function associated to the TFFD (1), given the vector of abundancesn = (n 1 , n 2 , . . . , n T ) of the T types occurring in the whole system. Denoting with T n the number of types having abundance n at the global scale, the log-likelihood function associated to (1) can be computed as follows: The maximum-likelihood estimators for the TFFD parameters ξ and r are those maximizing the log-likelihood function: In particular, since both r and ξ are continuous parameters, the maxima of L r,ξ (n) are critical points and can thus be found by setting the partial derivative of L r,ξ (n) equal to zero. Focusing on ∂ ∂ξ L r,ξ (n), one ends up with the following equation:

−Tr
( Under the assumption of N 0, Equation (5) can be understood in a deterministic sense. Indeed, since the TFFD expressed in (1) is uni-modal, the relative errors on the maximum-likelihood estimators of the parameters are small [38]. Equation (5) can thus be inverted to heuristically obtain an estimator for T, hereby denoted withT:T Let us see how to exploit the above formula. We have already noticed that having stored information on the presence/absence of types in S scattered samples of a system each containing a fraction q of the global number of the system items is sufficient to construct the empirical TDC from the local scale q up to the scale p * := Sq. Moreover, with similar arguments, such as those used to obtain Equation (6), we have that, at any given sub-scale p = mq, m = 1, . . . , S the number of items T p at that scale is given byT where N p is the total number of items observed at the scale p. Such a quantity can also be expressed as a function of the total number of items counted by aggregating all the S samples. In fact, let us call N * such a quantity, which corresponds to the maximum available TDC scale, p * . Then we have the relation which, inserted into (7), leads toT Finally, plugging Equation (2) into Equation (9), we get Thus, we can fit the empirical TDC up to the scale p * through Equation (10) above to obtain an estimatê r andξ of the TFFD parameters at the global scale. At this point, we have all the ingredients to upscale the empirical TDC up to the global scale. In particular, to recover the total number of types T of the system, it suffices to insert the fitted parameters into Equation (6). Moreover, let us observe that, although we only saved information on the presence/absence of types in samples, sincer andξ are the negative binomial TFFD parameters, we can also retrieve the distribution of items among the estimatedT types. In the following section. we will show how estimator (6) performed against the two datasets from tropical rainforests and the six from social activities.

Tests on Rainforest and Human-Activity Data
In order to test the performance of estimator (6), we first applied it to six datasets, ranging from forest ecosystems to human activities (see Table 1).
We considered a random fraction p * = 10% for the ecological data and a random fraction p * = 5% for the social data, miming the data reduction process. We then sub-sampled each limited dataset at 100 equally-spaced scales x ranging from 0.01 to 1, each corresponding to sampling a fraction p = xp * of the global number of items. We thus constructed the empirical TDC of each system up to the p * = 10% or p * = 5% scale, counting how many different types are found at each sub-scale p and we fitted it via Equation (10). We thus got the values ofr andξ that best described the observed pattern. We finally exploited our estimator (6) to recover the total number of types catalogued within each system and we compared the results with the actual known values. We repeated the procedure ten times for each of the six datasets. In Table 1 we reported the average values among the ten replicates of the fitted parameters and the estimates of the total number of types together with the relative percentage errors with respect to the true values. As we can see, the new estimator performs quite well for retrieving the total number of types within each considered system from the storage of a limited portion of it. Table 1. Upscaling results for the rainforest and human-activity datasets. For each dataset, the number of different types and items, T and N, respectively, are reported together with the average of the TFFD parameters,r andξ, at the global scale. These latter were obtained by fitting the local TDCs constructed from the ten random p * samples of each dataset trough Equation (10). Note that ther parameter was positive for the two forests and negative for the human-activity datasets, consistently with the results presented in [35]. Moreover, in all of the six datasetsξ is very close to 1. Finally, we inserted the average predictionsT of the number of different types at the global scale using estimator (6) and their relative percentage errors with respect to the real values, T. We compared the obtained results with those obtained in [26] for the forest ecosystems and in [35] for the human-activity data, respectively (see Table 2 and Discussions), where other upscaling approaches based on the negative binomial model have been proposed. We found that, although the information used as input was further reduced with respect to the previous methods, the novel method gave comparable results for all datasets, except for the BCI, against which it performed worse than the others. Figure 1 shows the empirical TDC and TFFD patterns for the entire datasets against those we reconstructed by inserting the fitted parameters into (10) and (1), respectively. At each scale for the TDC and at each abundance for the TFFD, the corresponding predicted points displayed represent the averages among the ten replicates. We can see from Figure 1 that, although the functional form of the TDC given by the estimator is independent of the origin of the data we aim to describe, the empirical patterns display a different trend between the ecological and human-generated datasets. This is reflected by different values of the clustering parameterr in the TFFD pattern our estimator retrieves. In fact, it acquires positive values for the rainforests cases and negative values in the four human activities networks considered (see Table 1).

Dataset
We also wish to stress the fact that the theoretical predictions for both the patterns (corresponding to the solid lines in Figure 1) are not the result of the fit of the global empirical TDCs, but they are computed by fitting the local ones up to the smaller sample scales p * (equal to 10% for the forests and to 5% for the human datasets). Nevertheless, the theoretical predictions and empirical observations are in good agreement, both for the TDC and TFFD patterns. We finally remark that, as we can see from the inset plots in Figure 1, our procedure is not only able to accurately reproduce the TDC behaviour, but it also faithfully retrieves the TFFD pattern, thus allowing us to recover abundant information from the storage of binary presence/absence data. Table 2. Comparative results for rainforest and human-activity datasets. Here we show the predictions for the Negative Binomial parameters,r andξ, and the relative error of the estimateŝ T with respect to the true value T. In particular, we compared the results obtained by using the estimator Equation (6) with some benchmark results provided by Refs. [26,35] for the rainforest and human-activity datasets, respectively.

Application to Human Single Nucleotide Polymorphism Data
Human single nucleotide polymorphisms (SNP) are substitutions of a single nucleotide along the genome sequence that may occur in both non-coding and coding DNA regions, potentially leading, in this latter case, to erroneous protein productions. A given SNP is characterized by the position at which it occurs as well as by the specific nucleotide mutation it brings with respect to the reference genome [39]. With an average occurrence frequency of one every 1000 nucleotides, SNPs constitute the most abundant sources of genetic variation among people and they have become particularly relevant in genome association studies which aim at identifying high-resolution markers related to complex diseases via gene causal mapping [40,41]. Therefore, predicting the existence of unseen SNPs and gaining an estimate of their abundance within a genetic region may be informative in these kinds of analyses and may help to highlight phenotypic variation and to detect genetic patterns for treatment responses [42,43]. Due to their importance in medical studies, genomic data have been recently collected at an extraordinary high rate, also thanks to the modern sequencing techniques nowadays available to the scientific communities. This has resulted in huge databases needing non-trivial resources to be investigated. Here, inspired by the results obtained in the ecological and social contexts, we wish to apply our upscaling procedure in this context of genomic variants to both address the problem of data storage and to open the possibility of new insights into this kind of databases. In particular, we analysed a SNP dataset from the 1000 Genome Project [29][30][31]. The project, carried out from 2008 to 2015, had the scope of creating the largest public catalogue of genetic variation in human population. Indeed, the revolution accomplished in genomic analyses by the development of modern high-throughput sequencing techniques allowed to sequence genomes from a huge number of individuals, paving thus the way for the creation of a comprehensive resource on human genetic variation.
The dataset analysed here contained 58,681 SNPs observed between positions 2,655,180 to 28,770,931 of the Y chromosome of 1233 sequenced males. Given the fact that this chromosome is present with a single copy in male individuals, in each subject any SPN may be either observed once or not at all. In order to test estimator (6), we proceeded as follows. We firstly extracted a random fraction p * = 5% of the total number of nucleotide variations N = 839,459 observed within the 1233 subjects. We then constructed the local TDC at 100 equally-spaced scales from 0.01p * to p * , similar to what we did for the natural ecosystems and the human-activity databases. We then fitted the resulting pattern via Equation (10) in order to obtain the values of the local TFFD parametersr andξ. Finally, we applied (6) to recover of the total number of items T (number of different SNPs present in the database) at the global scale-i.e., of the original big dataset. As for the other analysed systems, we repeated the procedure ten times. In Table 3, we reported the average values among the ten replicates of the fitted parameters and the estimates for T together with the relative percentage errors with respect to the true values. We conducted the test also for other values of the local scale p * . In particular, in Table 3, we inserted the results for a storaged fraction p * = 20%, whereas in Table 4 we displayed the recovered values for T and the corresponding relative errors obtained at different sub-scales ranging from p * = 5% to 50%. We can notice that, as it would be desirable for an upscaling estimator, which explicitly depends on the stored data fraction, the errors for the estimate of the number of types decreases with the sub-scale p * considered, since the more information we store, the better the recovering procedure perform should be. In Figure 2, we plotted the empirical TDC and TFFD patterns for the entire datasets against those retrieved from the scales p * = 5% and p * = 20%. As for the other datasets, the points displayed represent the averages values ofT p among the ten replicates.  We remark that, up to now, we considered as a sample at scale p * the stored portion of the total number of variants observed among the sequenced population. Therefore, the TDC patterns shown in Figure 2 represent how T p increases with the fraction p of N, the total number of variations observed. However, in real applications, it would be more intuitive and useful to recover information on (and eventually predict at even larger scales) how the number of different genomic variants grows as a function of the number of sequenced genomes-i.e., with the number of participants to the study, I. Thus, let us definep as the number of sequenced individuals corresponding to a fraction p of I: p = pI withp = 1, · · · , I = 1233. Let Np be the corresponding total number of variants observed among their genomes. The empirical values for Np at differentp were computed as follows. Givenp, we sampled 100 different random combinations ofp subjects among the total ones and computed, for each subset, the total number of observed mutations among itsp individuals. Finally we set Np as the average of this quantity among the 100 samples. We found that Np strongly correlates top via a linear relation. In particular, performing a linear fit between the two variables and constraining the intercept to equal zero, we found a slope of 680.8 (p-value < 10 −16 ) with a corresponding R-squared value R 2 ≈ 1 (see Figure 3). Note that this result allows us to estimate how the number of different mutations (here denoted with Tp) grows with the number of sequenced individual genomes,p. Indeed, thanks to the linear relation between Np andp, we can easily connect the total number of mutations N with the total number of study subjects I via N = 680.8I. Thus we have that the percentage of mutations, p, can be rewritten via the following: p =p/I = 680.8p/N. This relation allows us to write the TDC in (10) as a function ofp by just re-scaling the independent variable. In Figure 4 we compared the pattern recovered with this procedure against the empirical points, computed as the number of different mutations observed among a sub-sample of dimensionp, averaged over 100 independent sampling procedures of thep subjects. As we can see, our prediction faithfully reproduces the empirical data. We remark that here we only tested the performance of the framework in recovering statistical patterns of interest of big databases from the storage of a limited portion of them. Nevertheless, one could, of course, apply the very same approach with the aim of upscaling such patterns from the data scale up to any desired scale, thus extending the range of applicability of the method so to infer the statistical properties of an entire population, beyond the scale at which current technologies allow us to collect data. one order of magnitude lower than the one fitted in [26] (r = 0.20, ξ = 0.9991 in [26] against r = 0.073 and ξ = 0.9998 here). As for the human activity data, we benchmarked our estimator with the one adopted in [35]. Remarkably, although our framework reduces the needed amount of input information to binary presence/absence data, for all datasets both the values of r and ξ and the corresponding errors between predicted and true values of T between the two approaches referenced in [35]. Only for the emails dataset, the new error, although small, is ten times the previous one (1.5% against 0.11%). We then applied our procedure within the new context of genomic DNA. More precisely, we considered data from the 1000 Genome Project containing the single nucleotide polymorphisms along the Y chromosome of more than a thousand subjects. We found that, also in this case, the negative binomial model allows for the recovery of both the number of types and their relative patterns (type-discovery curve and type frequency of frequencies distribution) after the data reduction process. We remark that the application of our method in genomics to recover SNPs from smaller databases as well as to possibly predict the unseen SNPs beyond the up-to-now collected data scale follows a current trend in clinical research. Indeed, nowadays many efforts are put in gaining accurate assessment and quantification of DNA mutation variability and, more specifically, of intra-tumour heterogeneity, due to their clinical implications [44]. For instance, the level of genetic diversity within an individual tumour could be informative of aspects of internal dynamics, dealing with future evolution and treatment responses, that should be taken into account in ad hoc therapeutic designing [43,45]. However, this latter is usually based on results collected from single biopsy samples, that are statistically unlikely to accurately capture the complete mutational profile of the cancer. Thus, there is an increasing need for technologies capable of obtaining a more reliable and extended genetic description of the disease that may help in taking crucial medical decision, with no adding cost, time waste and ethic implications as drawbacks. Our method, specifically developed for recovering global measurements from local ones, could meet the mentioned real need and could prove to be a useful tool within such a clinical context. Finally, we also explored how our procedure performed at different data storing fractions p * . As expected, we observed that, by increasing the portion of the dataset used as input, the framework performs better. Therefore, depending on the level of precision one would like to reach in the final estimates and the amount of storage space and computational resources available, one must tune the quantity of data s/he needs to collect and save. In other words, the accuracy of the global description of the system under study and the reduction process of the corresponding data are in trade-off. As an example, for the single nucleotide polymorphisms dataset considered heref we found a relative error of only 0.3% when storing 50% of the original database, against an error of 6.4% when decreasing, by ten times, the input data fraction.

Conclusions
Big data revolution had a major impact on all scientific research fields. One of the most important issue emerged from it stands on the fact that huge amounts of data require lots of space capacity to be stored and advanced techniques to be analysed. Sometimes it is necessary to reduce the quantity of available data in order to be able to manage them. This process, of course, may lead to the loss of details. Therefore, it is essential to design novel methods allowing for the recovery, with small errors, of all the information contained within the original big dataset only having information on a smaller sample of it.
Here we exploit the ecological idea of upscaling to deal with the big data storage and analysing problems in effective way. More precisely, we propose a novel procedure based on a maximum-likelihood-inspired estimator for the number of types in a system, which, starting from reduced binary information in the presence/absence of types across small scales, is able to recover global information of interest in the original dataset.
We tested our method on several big datasets from different research fields. In particular, we mimic a reduction in the storage for the datastream by considering only small random fractions of the originally collected data and try to retrieve global information on the entire dataset. In all the cases we studied, we found that our procedure allows for the successful recovery, with acceptable errors, of some major statistical patterns of the original system, as the type frequency of frequencies distribution and the discovery curves.
Thanks to the generality of the setting, within which our estimator is derived, our approach could be of help whenever the massive quantity of data, nowadays available, comes at a too high cost in terms of storage and energies needed to manage and analyse them. Indeed, our method allows for the reduction in big databases without the loss of essential information, which could be crucial to have a clearer view of the systems they represent. These results pave the way for the application of our approach in contexts beyond those of ecology and human behaviour, for which it was originally conceived.
Supplementary Materials: All data used in this study are available online or upon request. In particular, the Pasoh and Barro Colorado Island datasets were provided by the Center of Tropical Research Science of the Smithsonian Tropical Research Institute (https://stri.si.edu/). The email dataset can be found in [27], while the other human activity data (Twitter, Wikipedia and Gutenberg) are publicly available [28]. Finally, data on human polymorphisms can be freely downloaded from the 1000 Genome Project website (see ftp://ftp.1000genomes.ebi. ac.uk/vol1/ftp/release/20130502).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: TFFD Type Frequency of Frequencies Distribution TDC Type-Discovery Curve SNP Single Nucleotide Polymorphism