Empirical Application of Generalized Rayleigh Distribution for Mineral Resource Estimation of Seabed Polymetallic Nod- ules

An effective empirical statistical method is developed to improve the process of mineral resource estimation of seabed polymetallic nodules and is applied to analyse the abundance of seabed polymetallic nodules in the Clarion Clipperton Zone (CCZ). The newly proposed method is based on three hypotheses as the foundation for a model of “Idealized Nodules”, which was validated by analysing nodule samples collected from the seabed within the Tonga Offshore Mining Limited (TOML) exploration contract. Once validated, the “Idealized Nodule” model was used to deduce a set of empirical formulae for predicting the nodule resources, in terms of Percentage Coverage and Abundance. The formulae were then applied to analysing a total of 188 sets of nodule samples collected across the TOML areas, comprising box-core samples and towed camera images collected by one of the authors and detailed in [4]. The analysis also relies upon detailed box-core sample measurements from other areas reported by [7]. Numerical results for resource prediction were compared with field measurements, and reasonable agreement has been achieved. The new method has the potential to achieve more accurate mineral resource estimation with reduced sample numbers and sizes. They may also have application in improving the efficiency of design and configuration of mining equipment.


Introduction
Polymetallic nodules are mineral particles found in many of the world's oceans [1]. A major deposit lies within the Clarion Clipperton Zone (CCZ) of the tropical North Pacific [2]. Nodules grow via precipitation in an organized manner in and on clay-ooze at the seabed [2] and they are often found with others of similar size and form [3][4][5]. Nodule 'abundance' is the kg (usually wet) weight of nodules per square metre of seabed and is used to estimate tonnage of nodules in a mineral resource estimation (as the surrounding clay-ooze should be able to be disregarded at the first step of mining [4]). Interest in the deposit, from the perspectives of development, marine environment and regulation, has increased over the last ten years [5,6].
In Section 2, three hypotheses are proposed as the basis for an idealized model of seabed polymetallic nodules. The hypotheses made for the "idealized nodule" model are based on analyses of nodule samples collected from the seabed at CCZ. One of the key hypotheses is based on numerical evidence that long axes of the seabed nodules follow the Generalized Rayleigh Distribution (GRD). Section 3 presents the mathematical characteristics of the GRD pertaining to the analysis of nodules samples. The traditional statistical methods for estimating the parameters of the sample distribution, and for performing the Goodness-of-fit test for GRD are discussed. While they are found useful for analysing nodule samples, the traditional numerical procedure is too complex for practical applications.
In Section 4, based on the "idealized nodule" model discussed in Section 2, a simplified practical approach is developed to replace the complex numerical methods in Section 3. As a result, empirical formulae are derived to directly predict the parentage coverage and abundance of seabed nodules.
Section 5 shows the numerical results of testing the three hypotheses as the basis for "idealized nodule" model. Strong numerical evidence is found, providing validation to hypotheses and consequently the "idealized-nodule" model.
In Section 6, a total of 188 sets of samples of long axes of seabed polymetallic nodules from the CCZ are analysed using the empirical method developed in Section 4. The resource estimations in terms of percentage coverage and abundance are compared with the field measurements.

Three hypotheses for an idealized model of seabed polymetallic nodules
Polymetallic nodules from the CCZ are found in a wide range of forms [4], but within parts of the central and eastern CCZ covered by an exploration contract held by Tonga Offshore Mining Limited (TOML) they often form irregular slightly prolate spheroid-like forms ( [4]; Figure 1). Growth around the horizontal axes (X and Y) is believed to be a function of horizontal space and mineral supply, and growth along the vertical axis is also a function of permissive layer of chemical conditions term the geochemically active layer [4]. Nodules have a very consistent density [4] and a relationship between the major horizontal axis and nodule weight (i.e. volume) has been long recognized [4,8,9]. Based on the above observation, to allow mathematical modelling of seabed polymetallic nodules, we have made the following three somewhat severe fundamental hypotheses: 1. Each nodules piece is of ellipsoidal shape (e.g., (a), (b) in Figure 2), which is defined by its three axes , and , where = 1, 2, 3. . . , with being the number of nodules. Here is the long or major axis, which is usually in the horizontal plane while and are the two typically shorter minor axes in the horizontal and vertical planes.
2. Within a certain boundary (domain) on the seabed, the ellipsoidal nodules are similar in shape, i.e., the ratio between two minor axes and the major axis, = and = are constant.
3. Within a certain boundary on the seabed, the long axis of nodule follows a Generalized Rayleigh Distribution (GRD), which is defined by a pair of parameters α and β (See Section 3). The above idealization is supported by analysis of nodule data and they were found accurate to certain degree. Specifically, the hypothesis 1 and 2 above will be justified using regression analysis of nodule dimensions and weights of seabed nodules samples in Section 5.1, which the hypothesis 3 will be validated by "Goodness-of-Fit" tests in Section 5.2 using nodule samples collected from TOML areas.

Generalized Rayleigh Distribution (GRD) and the traditional method
The Rayleigh distribution has been widely used to model phenomena in various technical fields. For instance, in the field of oceanography, Longuet-Higgins [10] showed the heights of narrow-banded random ocean waves follows the Rayleigh distribution. The Generalized Rayleigh Distributions (GRD), a family of two-parameter variations, have also been proposed although its practical application is limited. For a random variable X following the GRD, its probability density function (PDF) ( ) is in the form: where α > 0 and β > 0 are shape and scale parameters, respectively. The cumulative distribution function (CDF) ( ) is given by  Figure 3 shows the PDF of Generalized Rayleigh Distribution for various values of parameters α and β. For the statistical analysis of seabed polymetallic nodules, the parameter range is ≥ 1 .

Mean and standard derivation of the Generalized Rayleigh Distribution
As derived in Appendix A, the mean of and the variance of the Generalized Rayleigh Distribution can be written as where Formally, Eq.(3) can be used to estimate and when and are known. However, due to the complexity of functions ( ) and ( ) in Eq.(4), the solution process is rather tedious. Due to the complexity in evaluating ( ), ( ), an empirical method is developed below in Section 4 to simplify the calculation for practical applications.

Parameter estimation by Maximum Likelihood Estimation (MLE)
For a random sample , , . . . , of size n , to determine the two parameters and , which determine the Generalized Rayleigh Distribution (GRD), the Maximum Likelihood Estimation (MLE), which maximizes the log likelihood function, gives the following a pair of equations (Abd-Elfattah [11]): Eq.(5) is first solved iteratively by the Newton-Raphson Method to yield , and Eq.(6) is then used to calculate . It is obvious that the solution process for Eq.(6) is quite tedious. An empirical alternative is, therefore, devised in Section 4 below to simplify the process for practical application.

The Anderson-Darling Test Statistics
Once the parameters and are estimated as above, it is important to test whether they will yield a Generalized Rayleigh Distribution which give a "good-fit" for the sample. For computational purpose, the Anderson-Darling (AD) test statistics and can be written as: Here = ( ), where ( ) is the empirical Probability Density Function (PDF), calculated using Eq.(2) above, and arranged into ascending order.

The Test Criteria for Hypothesis
The value of and calculate above are then compared with their corresponding critical values and respectively. If the null hypothesis that sample data follow generalized Rayleigh distribution, is accepted at the particular significance level γ (or at 1 − confidence level). The critical values and are simulated by the Monte Carlo Method, as discussed in Appendix B.

A new empirical method and its application to nodule resources
As shown above in Section 3, the traditional formulation can be used for (1) analysing whether a numerical sample follows a Generalized Rayleigh Distribution, and (2) if it does, for estimating the parameters which yield the best fit. However, the complexity of the numerical process makes it difficult to be applied in practice. In this Section, we introduce a simplified numerical procedure for predicting the mineral resource of seabed nodules.

Empirical estimation of parameters of the Generalized Rayleigh Distribution
As shown in Section 3.2.1, Eq.(3) and Eq.(4) can formally be used to estimate and when and are known. To simplify the calculation of ( ) and ( ), using the technic of nonlinear regression, it can be shown, for the interested range of 1 ≤ ≤ 9, the following empirical formulae: , for 1 ≤ ≤ 9 (9) are accurate to an accuracy of 10 (See Figure 4).  Also included in Figure 4 is ( ), which will be discussed further in Section 4.2 below. The empirical form of ( ) is Combining Eq.(3) and Eq.(9), it is straightforward to derive: ) . .
For a random sample , , . . . , of size n, the mean and standard deviation are first estimated from the sample, and then the two formulae in Eq. (11) can be used in sequence to estimate parameters and . It is evident from the first equation of Eq.(11) that shape parameter α is determined by the ratio of and (in effect the reciprocal of the coefficient of variation or signal-tonoise-ratio). The scale parameter β, on the other hand, is related to the mean . This is consistent with the observation in Figure 3. It can be posited that geologically these two parameters may in turn relate to the stability and thickness of the geochemically active layer in which the nodules grow, as described in [4].

Resource Estimation for Seabed Polymetallic Nodules Using Coverage and Abundance
To estimate the percentage coverage and abundance of seabed polymetallic nodules using the measurements of their long axes, the first two hypothesis, as in Section 2 above, are applied: 1. The nodule is assumed to be in an idealized ellipsoid shape. 2. Nodules within a certain boundary are assumed to be "similar" in shape, the ratios between the lengths of the two minor axes and the major axis (denoted by and ) are constant.
Assuming the projections of nodules in the photo are elliptical with = 1 being the short axes (constant ≤ 1), Eq. (12) can be re-arranged as: Using (35) in Appendix A and taking m = 2, Eq.(13) becomes: Assuming is the area of the photo, the nodule percentage coverage becomes:

Prediction Nodule Abundance I: Based on "Idealized Nodule" Model
Similarly, the total Weight of nodules in the photo can be calculated by adding the weights of nodules in the photo: Assuming the two minor axes and are related to the major axis by = 1 and = 2 , Eq.(16) can be re-arranged as: Assuming is the area of the photo, the nodule abundance becomes:

Prediction Nodule Abundance II: With Empirical Long-Axis-Weight Relationship
According to [9], [4] and Eq. (3) (also see Section 5.1 below), the weight of the nodule is correlated to its long axis by: where and are constants and is usually smaller and close to 3.0. Eq.(20) can be re-arranged as Then, the total weight of nodules in the photo can be calculated by adding the weights of nodules in the photo: Using (35) in Appendix A and taking m = k, Eq.(20) becomes: ( ), with 2 < < 3, can be approximated by interpolating ( ) and ( ): Assuming is the area of the photo, the nodule abundance becomes: Using the using the technique of nonlinear regression, Eq.(21) can be rewritten into the form: For the range of 1 < < 4, of interest to nodules, Eq.(28) can be further approximated, within 0.2% error, by The above formulation, based on the "Idealized-Nodule" model, provides two efficient ways to estimate the nodule resources: 1. Eq.(15) and Eq.(19) can be used independently to calculate the nodule percentage coverage and the abundance , respectively. 2. If an estimation of the is already computed using digitization technique from seabed imagery, then Eq.(28) or Eq.(29) can be used to compute .

Test of hypotheses of the Idealized Nodule Model
The three fundamental hypotheses that define the "idealized nodule" in Section 2 find considerable support from numerical measurements of samples of seabed nodule.

Test of hypothese 1 and 2: linear regression analyses on nodule dimensions and weights
In order to examine the validity of the first two hypotheses for the "idealized nodules", linear regression analyses were carried out on published nodule major and minor axes, volume and weight data from two sites in the central and eastern CCZ [7] (Figure 9). Specifically, the following cases of linear regressions were carried between: 1. Case 1: Nodule long axis and its horizontal minor axis; 2. Case 2: Nodule long axis and its vertical minor axis; 3. Case 3: Nodule weight and its volume; and 4. Case 4: Nodule long axis and its weight.
The numerical results are shown in Figure 6 and are summarized in Table 1. The linear regression was performed for 1-3 with intercept forced be zero to match the fact that the two variables vanish at the origin.  [7]. For location refer also Figure 9. Note that intercepts of Felix [9] (3) and TOML [4];1,2) based long-axis formulae are converted from cm to mm (in logarithmic scale).
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 March 2021 doi:10.20944/preprints202103.0639.v1 Figure 6: Linear Regression between Nodule Dimensions and Weights for GSR Central Region. Data source [7]. For location refer also Figure 9. Note that intercepts of Felix [9] (3) and TOML [4];1,2) based long-axis formulae are converted from cm to mm (in logarithmic scale). The results from Cases 1 and 2 show the two minor axes are correlated to the long axis in statistically significant ways. With R 2 > 95%, a great majority (>95%) of the data points supports the hypothesis that, with a certain boundary, the ratios between the length of the major axes X and the lengths of the horizontal and the vertical minor axes Y and Z can be considered as constants. It is also noted the ratios vary between regions. The result from the 3 rd Case indicates a nodule density of 1.93 g/cm 3 although it is only supported by a small sample. The Case 4 indicates the nodule weight is strongly correlated with the 2.5056 and 2.7210 power of the long axes (versus 3 for the original "idealized nodule" model), supported by about 88% and 93% of the data points from the two regions, respectively.
While the above results give broad, yet strong support to the first two hypotheses for "idealized nodules" in Section 2 (these being on ellipsoid form and similarity of form within a domain), we use additional analysis in Section 2 to validate the 3 rd hypothesis regarding a GRD distribution.

Test of hypothses 3: Goodness-of-Fit test of Generalized Rayleigh Distribution for Nodule Long Axes
In this Section, the traditional "Goodness-of-Fit" tests, as outlined in Section 3, are carried out to check the validity of the 3 rd hypothesis for the "idealized nodules". It is to check whether the long axes of seabed nodules follow the Generalized Rayleigh Distribution (GRD) in a statistically significant way. A total of 9 samples of nodule long axes were analysed. While Table 2 shows the key statistical properties of the 9 data sets, Table 3 presents the results of "Goodness-of-Fit" tests.  For each set of data in Table 3, and are solved iteratively by Eqs. (5) and (6), using the MLE method described in Section 3.2.1. The Anderson-Darling (AD) test statistics and are then calculate by Eq.(7) and they are in turn compared with their critical values and , which are computed by the Monte Carlo Simulation in Appendix B. According to the test criteria in Eq.(8), among the 9 sets of samples, 7 of them have passed the AD tests at 95% confidence level. It indicates a strong probability that the samples of long axes of polymetallic nodules do follow Generalized Rayleigh Distribution. This conclusion may be conditional (e.g. by geological domain), and more research may be needed to identify the conditions. Figure 7 and Figure 8 show the visual comparison of distribution of nodule long axes from raw data, computed by the traditional method, and by the new empirical method. Probability density functions (PDF) and cumulative distribution functions (CDF) are plotted in Figs. 7 and 8, respectively, using Sample ID: 2015_08_29_131349 as an example. For the raw data, a bin size of 0.25 was selected to create a histogram of the sample of long axis, and the PDF and the CDF are computed. The dark blue line shows the raw counts of the original data set, with the light blue line showing the smoothed data using the Savitzky-Golay filter [12] for the PDF in Figure 7. The green line shows PDF and CDF based on parameters and calculated iteratively by MLE method described in Section 3.2.1 above. The red line shows PDF and CFD based on parameters and calculated by the empirical formulas in Eq. (11). The empirical formulas, while much more straightforward to use, do give reasonably accurate results for the statistical distributions in practical applications.

Comments on the level of support
Significantly, the Linear Regression Analysis in Section 5.1 and the Goodness-of-Fit Test in Section 5.2 do support the three seemly drastic fundamental hypotheses made in Section 2 for "idealized nodules". However, more analyses are needed to check the generality. Nonetheless, as the hypotheses have been validated, the empirical method developed in Section 4 can be used for nodule resource prediction in the next Section.
Possible reasons for achieving the statistically significant validations, include that the samples are located within a particular growth domain (the CCZ), and that the conditions of growth within this domain are remarkably consistent (nodules grow in effect from the ocean's epibenthos, and slowly enough to "average out" short term (millennia scale) variances in growth conditions).

Numerical results of nodule resource prediction using the new empirical method
Once newly proposed "idealized nodules" model has been validate in the previous Section, the empirical formulation as developed in Section 3, particularly the Eq.(19) (or Eq.(25)) for nodule abundance are applied to a total of 188 samples of seabed polymetallic nodule collected in 2015 as part of the TOML CCZ15 marine expedition in this Section.

Sample datasets
The 188 samples of seabed polymetallic nodules used for the empirical analyses can be grouped in three datasets: 1. Dataset 1: regional scale box-core sample dataset (physical weights). This involves 4 TOML exploration contract areas (TOML B, C, D, F; Figure 9) spanning some 2,000 km of longitude and 700 km of latitude. The dataset thus allows for examination of a general relationship. 2. Dataset 2: local scale box-core sample dataset (physical weight) of two distinct facies types but only within the TOML F area (~200 x 200 km). Type 1 nodules are smaller and often densely packed, type 2 nodules are significantly larger and more variable (cf [5]). The dataset thus allows for differences in nodule types from an area where the distinction between type is simple and straightforward. that actual nodule weights cannot be compared, but it allows for larger datasets from two distinctly different areas to be compared. Coverage was also measured for datasets 2 and 3 from seabed photographs (boxcore mounted and towed respectively) using Image J software. Dataset 1 was not able to be measured due to a lack of images (the box-core camera had frequently malfunctioned).
A summary of the datasets used for analysis is shown in Table.4 and Figure 9.  1.6 to 7.8 1.5 to 6.1 0.24 to 0.96 0.25 to 0.83 * For datasets 1 and 2 long axes measured from grid photos of the nodules after collection, separation from the host clay-ooze and washing. For dataset 3 long axes measured from photos of the seabed as detailed in [4].

Prediction of Abundance of Seabed Nodules
To make a thorough assessment of the accuracy of abundance prediction made by the new empirical method, particular Eq.19 (or Eq.25) , Figure 10, 11 and 12 show the ratios between the abundance prediction and those from the actual box-cores measurements (for datasets 1 and 2) or the best available estimate from long-axis measurements (for dataset 3). In the Figures, ratios are  2. Chart (b) showing the ratios in Chart (a) corrected by a "linear adjustment". As the results in Chart (a) display biases of slightly over-estimating abundance for larger nodules. The result of linear regression of the ratios in Chart (a) are used to adjust the ratios, and the corrected results are shown in Chart (b); and 3. Chart (c) showing ratios based on abundance calculated by the empirical formula Eq.
(19), which has incorporated the long-axis-weight relationship observed by several researchers (e.g., Felix [9] (3)). Figure 10 shows the results for Dataset 1, which is collected across the TOML areas. It is observed that empirical formula Eq. (19), which is strictly based on the three hypotheses for Idealized Nodule model in Section 2, gives a slight bias of over-predicting the abundance for larger nodules. Unbiased prediction can be achieved once either the "linear adjustment" or the long-axis-weight relationships are applied. Figure 11 depicting Dataset 2 shows that nodule types can have an influence in the prediction. While empirical formula Eq. (19) results in a similar bias to that seen for Dataset 1, facies specific "linear adjustment" results in unbiased estimates with slightly higher levels of scatter. Use of empirical formula Eq. (25) based on the coefficients of [9] works better for Type 1 nodules than for Type 2, suggesting that different coefficients may work better. Figure 12 with Dataset 3 shows that the accuracy of prediction made by empirical formula Eq. (19) may vary for nodules between different areas. In effect the nodules from TOML B1 appear to be "less ideal" per Section 2 than the nodules from TOML C1. This may be due to the fact that the TOML B1 nodules are likely older and more often formed from multiple generations of growth (i.e. fragments of nodules with younger concentric growth phases). This could predispose them to be more equant in shape. Again, the linear adjustment addresses the size-bias seen in the direct application of Eq. (19). Application of Eq.(25) gives broadly similarly agreeable results, but it is worth noting that the compared estimate for this dataset is from long-axis weight measurements.
The slight biases of over-estimation of abundance for larger nodules reveals a limitation of the empirical formula Eq. (19), which is directly based on the three fundamental hypotheses for the "idealized nodules". While the first two hypothesis state that the nodules are in ellipsoidal shape and they are "similar" in shape, in realty, it is obvious that the nodule shapes are complex and for nodules of various sizes, the ratios between the minor axes and the long one may vary with nodule size. However, this bias seems much less severe while empirical formula Eq. (25) is applied, which is based on an empirical relationship between the nodule long axis and its weight (e.g., Felix [9] (3)). Estimates of coverage using the new empirical method show mixed but encouraging results when compared with field measurements from three areas. Dataset 2 from TOML F shows a systematic bias independent of nodule facies types. In contrast, Dataset 3 from TOML B and C do not display any appreciable bias. This is likely related to the degree of sediment cover between the areas (Figure 14).

Conclusions
It is concluded that: 1. An idealized module for the dimensions of seabed polymetallic nodules is proposed for mineral resource prediction. This "idealized nodule" model is based on three hypotheses. One of the key hypotheses is, with in certain boundary, the nodule long axes follow a two-parameter Generalized Rayleigh Distribution (GRD). These three hypotheses were tested using field measurements from available nodule samples collected from CCZ. Strong numerical evidence was found, providing validation to the three hypotheses, possibly due to the relatively stable seabed environment and the long growth period of the nodules, removing short-term transient effects. 2. Once the "idealized nodule" model has been validated, a new empirical method has been developed based on the three hypotheses. Specifically, explicit empirical formulae have been derived for direct calculation of GRD parameter α and β (Eq.(11)), for Percentage Coverage CN (Eq.(15)), and for Abundance AN (Eq.(19) or Eq. (25)) . These formulas are found accurate for mineral resource estimation and are much easier to use than the traditional analytical methods for GRD. 3. To check the effectiveness of the new empirical method for predicting nodule mineral resource, the predicted nodule Abundances are compared with field measurements for a total of 188 nodule samples collected across the TOML areas. While the general agreement is acceptable, the direct application of the "idealized nodule" module does display a slight bias of over-estimating Abundance for larger nodules. However, unbiased accurate prediction of nodule Abundance can be achieved by applying either a "linear adjustment" or a long-axis-weight relationship (Section 6). 4. Despite agreeable numerical results for nodule samples from parts of the TOML exploration contract area, analyses of samples from other regions will be needed to better understand the generality of the empirical model and its derived formulae. Such analysis is needed in any event to calibrate the model in other areas. 5. Estimates of coverage using the new empirical method show mixed but encouraging results when compared with field measurements from three areas. For two areas the new empirical method provides close agreement but from the third area there is a consistent offset. This may be related to the degree of sediment cover in that area. 6. The new empirical method with derived explicit formulae have shown the potential of achieving more accurate mineral resource projection with reduced sample numbers and sizes. The new understanding of the nodule size distribution can also improve the efficiency of design and configuration of mining equipment. The sample images used are property of Tonga Offshore Mining Limited and their permission to use the images for this research is also gratefully acknowledged.
Appendix A: Functions ( ) , ( ) and ( ) For a random sample , , . . . , of size N, following the Generalized Rayleigh Distribution (GRD) with its probability density function (PDF) ( ) in the form of ( ) = 2 ( ) 1 − ( ) , > 0, () its mean value or can be expressed in the integral form below Similarly, the means of the sums of square and cubic and be written respectively as: Eqs. a), (30) and (31) can then be combined formally as below where m = 1, 2, 3 . Further defining the Eq. (33) can be written as For Eq. (35), when m = 1, the mean of the sample is The variance of the Generalized Rayleigh Distribution can be calculated as  It is of practical importance to estimate the minimal sample size necessary to achieve desired accuracy of estimation. To test whether a sample size of 100 is sufficient to yield an accurate estimation of statistical distribution, 500 sets of samples with size = 100 were randomly selected from the whole sample with size = 440, and for each set of selected samples, parameters and are computed empirically using Eq. (11), and a corresponding GRD is generated.
For a typical simulation, the mean values of and , calculated by averaging results from 500 simulations, are within 4%, compared with values of and calculated based on the whole sample (size=440). The maximum errors for and between a small-sample (size=100) and the whole sample (size=400) is about 10% and 5%, respectively. Figure  17 and Figure 18 show CDFs and PDFs from 10 simulations (randomly selected from a total of 500 simulations), compared with the CDF and PDF, calculated from the whole sample (size=440). The reasonably good agreement between the results from samples with size=100 and those from the whole sample (size=440) indicates that a smaller sample size of 100 can generally produce a good estimation of the statistical distribution.