Application of General Linear Models (GLM) to Assess Nodule Abundance Based on a Photographic Survey (Case Study from IOM Area, Paciﬁc Ocean)

: The success of the future exploitation of the Paciﬁc polymetallic nodule deposits depends on an accurate estimation of their resources, especially in small batches, scheduled for extraction in the short term. The estimation based only on the results of direct seaﬂoor sampling using box corers is burdened with a large error due to the long sampling interval and high variability of the nodule abundance. Therefore, estimations should take into account the results of bottom photograph analyses performed systematically and in large numbers along the course of a research vessel. For photographs taken at the direct sampling sites, the relationship linking the nodule abundance with the independent variables (the percentage of seaﬂoor nodule coverage, the genetic types of nodules in the context of their fraction distribution, and the degree of sediment coverage of nodules) was determined using the general linear model (GLM). Compared to the estimates obtained with a simple linear model linking this parameter only with the seaﬂoor nodule coverage, a signiﬁcant decrease in the standard prediction error, from 4.2 to 2.5 kg/m 2 , was found. The use of the GLM for the assessment of nodule abundance in individual sites covered by bottom photographs, outside of direct sampling sites, should contribute to a signiﬁcant increase in the accuracy of the estimation of nodule resources.


Introduction
Deposits of polymetallic (manganese-bearing) nodules occurring at the bottom of all oceans are an attractive alternative to onshore deposits from the point of view of metal resources such as Ni, Co, Cu, Mn, Li, REE (the rare earth elements), and others. The Clarion-Clipperton Fracture Zone (CCZ) in the tropical NE Pacific is the area of greatest economic interest for nodules [1][2][3] (Figure 1A). It is expected that the demand for some metals will soon exceed their supply due to the depletion of onshore ore deposits resulting from the intensive development of modern branches of the economy (high technology, green technology, emerging industries, and military applications). In the future, the shortage of some metals can be covered by the exploitation of offshore deposits. This issue was widely discussed in many publications, e.g., [1,2,[4][5][6][7][8][9][10][11][12][13][14]. The exploitation of these deposits, apart from their proper recognition and estimation of their resource [15], requires solving a number of problems related to the technique of mineral extraction and ore processing [7,[16][17][18].  [19] (A) and H22 exploration block (B); the location of the box corer sampling sites and seafloor photographs in the H22 exploration block (three variants of training and test subsets) (C).
The development of an appropriate scenario and schedule for the future short and medium-term exploitation of nodules, after meeting the environmental protection requirements and solving technical problems of mining, depends on a detailed recognition of the distribution of nodule abundance and resources and an analysis of the ocean floor topography based on reliable contour maps.
An accurate assessment of the abundance of polymetallic nodules at seafloor sites located far away from direct sampling stations causes many problems. They result mainly from the large distances between sampling stations (which vary depending on the stage of recognition of the nodule-bearing areas), high variability of nodule abundance, and to a lesser extent, from the inevitable errors during the sampling process [20]. Therefore, the attempts to combine the estimation of nodule abundance based on the classical direct sampling (e.g., using box corers) [15,21] with indirect methods, such as photographic surveys [22][23][24][25] or widely understood hydro-acoustic methods [26,27], seem natural and rational.
The routine and continuous video and photo-profiling (photographic survey) of the ocean floor along the course of a research vessel from which direct sampling is carried out provides a huge number of photographs. Their analysis provides indirect, approximate information on the percentage of seafloor nodule coverage (i.e., the percentage of seafloor covered by the nodules, hereinafter abbreviated NC-S), the degree of sediment coverage of nodules (SC), and the dominant genetic type of nodules (GT) between sampling stations [28]. The data obtained from the photographs in sampling sites are correlated with various strengths with the nodule abundance based on box core samples. The previous attempts  [19] (A) and H22 exploration block (B); the location of the box corer sampling sites and seafloor photographs in the H22 exploration block (three variants of training and test subsets) (C).
The development of an appropriate scenario and schedule for the future short and medium-term exploitation of nodules, after meeting the environmental protection requirements and solving technical problems of mining, depends on a detailed recognition of the distribution of nodule abundance and resources and an analysis of the ocean floor topography based on reliable contour maps.
An accurate assessment of the abundance of polymetallic nodules at seafloor sites located far away from direct sampling stations causes many problems. They result mainly from the large distances between sampling stations (which vary depending on the stage of recognition of the nodule-bearing areas), high variability of nodule abundance, and to a lesser extent, from the inevitable errors during the sampling process [20]. Therefore, the attempts to combine the estimation of nodule abundance based on the classical direct sampling (e.g., using box corers) [15,21] with indirect methods, such as photographic surveys [22][23][24][25] or widely understood hydro-acoustic methods [26,27], seem natural and rational.
The routine and continuous video and photo-profiling (photographic survey) of the ocean floor along the course of a research vessel from which direct sampling is carried out provides a huge number of photographs. Their analysis provides indirect, approximate information on the percentage of seafloor nodule coverage (i.e., the percentage of seafloor covered by the nodules, hereinafter abbreviated NC-S), the degree of sediment coverage of nodules (SC), and the dominant genetic type of nodules (GT) between sampling stations [28]. The data obtained from the photographs in sampling sites are correlated with various strengths with the nodule abundance based on box core samples. The previous attempts to develop a simple linear regression between the nodule abundance and the percentage of seafloor nodule coverage did not yield unequivocal results. In some parts of the studied area of the Interoceanmetal Joint Organization (IOM), a statistically significant and strong linear correlation between these parameters was found (with linear correlation coefficients of 0.6-0.7), while in other parts there was no statistically significant correlation and the coefficients of linear correlation were close to zero [20,28,29]. The main reasons for the weaker correlation of both parameters can be seen in the varying the degree of sediment coverage of nodules [28][29][30], diversity of the nodule genotypes [31,32], small scale variability of nodule abundance, different geometrical basis of measurements (area of the bottom covered by the photograph several times larger as compared to the area of the horizontal section of the box corer), and the variation in the quality of seafloor photographs. For these reasons, some researchers introduced various coefficients correcting the relationship between the nodule abundance and the percentage of seafloor nodule coverage determined on the basis of photographs [30,33].
Improvements in the accuracy of the estimates of nodule abundance can be expected when additional independent qualitative variables (ordinal variable), determined based on the photographs, such as the distribution of nodule fractions associated with the genetic type of nodules and the degree of sediment coverage of nodules, are introduced to the relationship model. For the data from the H22 exploration block in the IOM area ( Figure 1B), a statistically significant and relatively strong correlation was found between the nodule abundance and the genetic type of nodules, while the correlation between the nodule abundance and the degree of covering the nodules with bottom sediments was weak [28].
The coexistence of independent variables of different types (continuous and ordinal) requires the use of an appropriate mathematical model linking them with the nodule abundance used as a dependent variable. This can be achieved using general linear models (GLM). The results of their application to a dataset from a part of the area administered by the IOM [34] in the Clarion-Clipperton Zone in the Pacific are the subject of this article.

Research Objective
The main aim of the research was to assess the accuracy of the prediction of nodule abundance (APN) at ocean floor points outside the sampling stations based on the multivariate regression technique called General Linear Models (GLM) (Figure 2). In the present case study, GLM was used to determine the form of the relationship linking the nodule abundance (continuous dependent variable) with the percentage of seafloor nodule coverage NC-S (independent continuous variable) and two qualitative variables (independent ordinal variables)-the genetic type of nodules in the context of their fraction distribution (hereinafter referred to in the text abbreviated as the genetic type of nodules and marked as GT) and the degree of sediment coverage of SC nodules. The values of all independent variables were determined based on photographs of the ocean floor at box corer stations. The values of the NC-S variable were determined automatically with the use of computer software, while the values of the GT and SC variables were determined visually based on expert evaluation. To evaluate the effectiveness of GLM as a method for predicting nodule abundance, the obtained results were compared with the results of the prediction for the simple linear model (SLM) linking the nodule abundance APN and the percentage coverage of ocean floor NC-S. For comparative purposes, analogous analyses were also performed for nodules from the box corer samples after their removal and washing and arranging on a laboratory grid, i.e., for the percentage of grid coverage with nodules NC-T ( Figure 2).

Materials
The usefulness of GLM was assessed based on 68 measurements of the nodule abundance in samples collected from the ocean floor using box corers and the results of the analysis of 68 photographs taken at the direct sampling sites. The datasets are derived from the H22 exploration block (4151 km 2 ) (located in the central-eastern part of the B2 sector), best recognized in the area administered by the IOM (Figure 1).
The data were obtained during two cruises of a research vessel in 2014 and 2019. Both cruises used the same methods of sampling and photographing of the ocean floor and for computer determination of the nodule coverage of the seafloor based on bottom photographs. Therefore, from the point of view of the accuracy of determining the values of the variables, the homogeneity of the initial dataset can be assumed. The box corer sampling covered a 0.25 m 2 (0.5 m × 0.5 m) square section of the seafloor, while 62 photographs covered a bottom section of approximately 1.6 m 2 . In 6 out of 68 sampling sites, no bottom photographs were taken before the box corer sample was collected; therefore, the seafloor The fraction distribution of genetic type of nodules (GT symbol, ordinal variable) acts as an identifier and expresses differences in the probability distributions of nodule sizes characteristic for the distinguished genetic types of nodules. The exact characteristics of both ordinal variables and other continuous variables describing the nodule-bearing areas in the H22 exploration block were presented by Wasilewska-Błaszczyk and Mucha [28]. The analysis of variables included in the cited article was the basis for the selection of independent variables in GLM.
To evaluate the effectiveness of GLM as a method for predicting nodule abundance, the obtained results were compared with the results of the prediction for the simple linear model (SLM) linking the nodule abundance APN and the percentage coverage of ocean floor NC-S. For comparative purposes, analogous analyses were also performed for nodules from the box corer samples after their removal and washing and arranging on a laboratory grid, i.e., for the percentage of grid coverage with nodules NC-T ( Figure 2).

Materials
The usefulness of GLM was assessed based on 68 measurements of the nodule abundance in samples collected from the ocean floor using box corers and the results of the analysis of 68 photographs taken at the direct sampling sites. The datasets are derived from the H22 exploration block (4151 km 2 ) (located in the central-eastern part of the B2 sector), best recognized in the area administered by the IOM (Figure 1).
The data were obtained during two cruises of a research vessel in 2014 and 2019. Both cruises used the same methods of sampling and photographing of the ocean floor and for computer determination of the nodule coverage of the seafloor based on bottom photographs. Therefore, from the point of view of the accuracy of determining the values of the variables, the homogeneity of the initial dataset can be assumed. The box corer sampling covered a 0.25 m 2 (0.5 m × 0.5 m) square section of the seafloor, while 62 photographs covered a bottom section of approximately 1.6 m 2 . In 6 out of 68 sampling sites, no bottom photographs were taken before the box corer sample was collected; therefore, the seafloor photographs obtained from photo-profiling (the device Neptun C-M1, Russia [35], covering an area of about 5 m 2 , taken approximately 5-50 m from the box corer sampling sites, were used instead. The statistics of continuous variables, i.e., the nodule abundance based on wet nodule weight (APN symbol, dependent variable) and the percentage of seafloor nodule coverage (NC-S symbol, independent variable), are presented in Table 1, and their empirical distributions in graphical form are presented in Figure 3. Table 1. Statistics of the nodule abundance (APN) in the box core and the percentage of seafloor nodule coverage (NC-S) determined from the photographs. photographs obtained from photo-profiling (the device Neptun C-M1, Russia [35], covering an area of about 5 m 2 , taken approximately 5-50 m from the box corer sampling sites, were used instead. The statistics of continuous variables, i.e., the nodule abundance based on wet nodule weight (APN symbol, dependent variable) and the percentage of seafloor nodule coverage (NC-S symbol, independent variable), are presented in Table 1, and their empirical distributions in graphical form are presented in Figure 3.  The values of all statistical measures of central tendency (average, median, 20% trimmed mean) both within the APN and NC-S sets differ only slightly ( Table 1). The variability of both parameters (APN and NC-S) measured by the coefficient of variation is similar (32-35%) and can be described as moderate. Standardized skewness and kurtosis are within the range expected for the data from a normal distribution (the range is from The values of all statistical measures of central tendency (average, median, 20% trimmed mean) both within the APN and NC-S sets differ only slightly ( Table 1). The variability of both parameters (APN and NC-S) measured by the coefficient of variation is similar (32-35%) and can be described as moderate. Standardized skewness and kurtosis are within the range expected for the data from a normal distribution (the range is from −2 to 2) [36]. The more precise normality test (Shapiro-Wilk test) did not provide grounds for rejecting the hypothesis of the normality of the distribution of the nodule abundance at the significance level of 0.05. An examination of the empirical distributions of APN and NC-S using the box and whisker method did not show the presence of outliers, which allowed us to consider both datasets as homogeneous ( Figure 3).

Statistics APN (kg/m 2 ) NC-S (%)
The other two independent variables used in GLM were categorical (ordinal) and defined as factors with some number of assigned levels. In the case of the sediment coverage of nodules (SC), four levels of this factor were distinguished based on a visual assessment of the seafloor photographs ( Figure 4), with numbers from 1 to 4 in the order corresponding to the increasing degree of coverage (low coverage (1), medium coverage (2), high coverage (3), and very high coverage (4)) assigned as identifiers. The other two independent variables used in GLM were categorical (ordinal) and defined as factors with some number of assigned levels. In the case of the sediment coverage of nodules (SC), four levels of this factor were distinguished based on a visual assessment of the seafloor photographs ( Figure 4), with numbers from 1 to 4 in the order corresponding to the increasing degree of coverage (low coverage (1), medium coverage (2), high coverage (3), and very high coverage (4)) assigned as identifiers. The nodules occurring in the CCZ are most commonly classified into the three genetic types [37,38]: • H (hydrogenetic)-small nodules up to 3 cm [37] or 4 cm [38] in diameter, most frequently spheroidal and with smooth surfaces; • HD (hydrogenetic-diagenetic)-nodules intermediate in size (by convention, from 3 to 6 cm in diameter) with a smooth upper surface and a rough lower surface, predominantly ellipsoidal, flattened, and plate-shaped; • D (diagenetic)-large nodules, 6-12 cm in diameter, predominantly discoidal and ellipsoidal in shape and with rough surfaces.
Based on the scaled seafloor photographs, it is possible to determine the dominant fractions of nodules, and thus, with high probability, the genetic type of the nodules [28]. In the IOM area, the correctness of the determination of the genetic type of the nodules The nodules occurring in the CCZ are most commonly classified into the three genetic types [37,38]: • H (hydrogenetic)-small nodules up to 3 cm [37] or 4 cm [38] in diameter, most frequently spheroidal and with smooth surfaces; • HD (hydrogenetic-diagenetic)-nodules intermediate in size (by convention, from 3 to 6 cm in diameter) with a smooth upper surface and a rough lower surface, predominantly ellipsoidal, flattened, and plate-shaped; • D (diagenetic)-large nodules, 6-12 cm in diameter, predominantly discoidal and ellipsoidal in shape and with rough surfaces.
Based on the scaled seafloor photographs, it is possible to determine the dominant fractions of nodules, and thus, with high probability, the genetic type of the nodules [28]. In the IOM area, the correctness of the determination of the genetic type of the nodules dominant in the photograph or its part (in the case of only locally increased sediment coverage of nodules) is usually not questionable. With regard to genotypes (GT), three levels of the factor were distinguished based on the nodule fractions dominating in the bottom photographs ( Figure 5A): 1-H (hydrogenetic), 2-HD (hydrogenetic-diagenetic), and 3-D (diagenetic). Figure 5B presents the nodule fraction distributions for the different dominant genetic types of nodules with the example of the three box core samples. To illustrate the specificity of the fraction distributions of a given genetic type, mean fraction distributions averaged based on 8 (H), 17 (HD), and 43 (D) of 68 all box core samples in the H22 exploration block were used ( Figure 5B). The fraction distributions of nodules for different dominant genetic types usually have characteristic shapes ( Figure 5B): strongly skewed to the right (H), moderately skewed to the right (HD), and close to symmetric (D). This factor (GT) related to the differentiation in nodule sizes directly translates into the value of nodule abundance (APN), which is confirmed by a strong positive nonlinear correlation between weight (mass) of the nodules and their surface area [28,29].
Their statistical description was limited to providing the number of observations for individual levels of factors due to the specificity of both ordinal variables (Table 2, Figure 6).
The SC factor levels are dominated by moderate sediment coverage (level 2), which constitutes 50% of all observations, while the GT factor levels are dominated by the diagenetic type D (level 3), slightly exceeding 63% ( Figure 6, Table 2). illustrate the specificity of the fraction distributions of a given genetic type, mean fraction distributions averaged based on 8 (H), 17 (HD), and 43 (D) of 68 all box core samples in the H22 exploration block were used ( Figure 5B). The fraction distributions of nodules for different dominant genetic types usually have characteristic shapes ( Figure 5B): strongly skewed to the right (H), moderately skewed to the right (HD), and close to symmetric (D). This factor (GT) related to the differentiation in nodule sizes directly translates into the value of nodule abundance (APN), which is confirmed by a strong positive nonlinear correlation between weight (mass) of the nodules and their surface area [28,29].  Their statistical description was limited to providing the number of observations for individual levels of factors due to the specificity of both ordinal variables (Table 2, Figure  6).
The SC factor levels are dominated by moderate sediment coverage (level 2), which constitutes 50% of all observations, while the GT factor levels are dominated by the diagenetic type D (level 3), slightly exceeding 63% ( Figure 6, Table 2).

Methods
The general linear models (GLM) procedure is used to construct statistical model describing the impact of any set of explanatory variables (X), quantitative (continuous) or qualitative (categorical), on one or more dependent variables (Y). An independent cate-

Methods
The general linear models (GLM) procedure is used to construct statistical model describing the impact of any set of explanatory variables (X), quantitative (continuous) or qualitative (categorical), on one or more dependent variables (Y). An independent categorical variable (nominal or ordinal) is called a factor, and its categories are called the levels of the factor [39].
In its simplest form, GLM determines the (linear) relationship between the one dependent (response) variable Y and the set of predictors (explanatory variables) X i and is expressed by the general formula: where b 0 -intercept, b 1-k -partial regression coefficients. This particular case of a general linear model limited (restricted) to one dependent variable was used in the research. In terms of the obtained results, the method is equivalent to the multiple regression model.
For use in regression, a categorical variable with k categories (levels) must be transformed (coded) into a set of (k-1) indicator variables also known as a dummy variable [40]. An example of such a transformation is given in the caption of the Table 3. Each of the determined regression models was tested for its statistical significance by calculating the so-called p-value. When the p-value ≤0.05, the model can be considered statistically significant with a risk of error not greater than 5%.
In addition, five measurements were determined to indicate the model's goodness of fit (the strength of relationships):

•
The adjusted coefficient of determination R 2 adj expresses the percentage of the variability in the dependent variable, which has been explained by the fitted model, ranging from 0% (lack of the dependency) to 100% (ideal, full relationship), adjusted for the number of coefficients in the model: where n-count of data, p-the number of estimated model coefficients,ŷ i -theoretical value of the dependent variable Y determined from the model equation for the observation "i", y i -empirical value of the dependent variable Y for the observations "i", y-arithmetic mean of the empirical values of the dependent variable Y.

•
The standard (prediction) error of estimation (SEE) characterizing the average scatter of the measured values of the dependent variable in the regression model: The mean absolute error (MAE) characterizing the mean absolute deviation of the measured Y values from the values indicated by the model: • Mean percentage error (MPE): • Mean absolute percentage error (MAPE): For comparison purposes, simple linear models (SLM) of the general form: linking the nodule abundance (Y) to the percentage of seafloor nodule coverage (NC-S) and box corer (NC-T) and their natural logarithms, were also analyzed. Logarithmically transforming variables in a regression model is a very common way of handling situations where a non-linear relationship exists between the independent and dependent variables [41]. The use of the natural logarithm in the analyzed cases resulted from preliminary studies which showed a slightly stronger correlation of APN with ln(NC-S) than with (NC-S) (Figure 7). S) and box corer (NC-T) and their natural logarithms, were also analyzed.
Logarithmically transforming variables in a regression model is a very common way of handling situations where a non-linear relationship exists between the independent and dependent variables [41]. The use of the natural logarithm in the analyzed cases resulted from preliminary studies which showed a slightly stronger correlation of APN with ln(NC-S) than with (NC-S) (Figure 7). The determination of simple linear models allowed a preliminary assessment of the improvement in the accuracy of APN prediction using GLM. The quality of all models was verified using the cross-validation method for a randomly selected training and test data subsets. The determination of simple linear models allowed a preliminary assessment of the improvement in the accuracy of APN prediction using GLM. The quality of all models was verified using the cross-validation method for a randomly selected training and test data subsets.

Results and Discussion
In accordance with the adopted methodology for APN prediction, the first step was to develop GLM equations, which, for the variables selected for the study, have the following form: and where b 0 -intercept, b 1-3 -partial regression coefficients, NC-seafloor nodule coverage (continuous variable) (%), SC-level of sediment coverage of nodules (ordinal variable), GT-genetic type of nodules (ordinal variable); NC-S, SC, and GT values were determined based on photographs of the bottom taken in the place (or near) of the box core sample sites. The continuous and ordinal variables used in the GLM analysis along with the adopted levels are shown schematically in Figure 2.
For the analyzed variables, the equations of simple linear models (SLM) are as follows: and where b 0 -intercept, b 1 -slope, NC-S-seafloor nodule coverage (continuous variable) [%]. For comparative purposes, identical variants of regression models were determined for the data obtained from the grid photographs (e.g., Figure 5): • GLM: and APN = b 0 + b 1 ·ln(NC-T) + b 2 ·(GT) (13) • SLM: and where NC-T-nodule coverage of the grid. The analysis of the results contained in Table 3 allows us to make a number of observations about the H22 IOM exploration block, which are presented below.
The use of GLM in place of the simple linear regression model (SLM), both for grid and seafloor photographic data, in all cases leads to a significant increase in the accuracy of the APN prediction, as evidenced by the increased values of R 2 adj for the grid and seafloor photographs by over 20% and approx. 50%, respectively, and reduced SEE, MAE, MPE, and MAPE values (e.g., SEE by about 1.0 kg/m 2 for the grid photographs and about 1.7 kg/m 2 for the seafloor photographs). The improvement in modeling quality is also visually confirmed by the plots of empirical and theoretical relationships shown in Figure 8. The linear relationship between APN and NC-T (or ln(NC-T)) is much stronger (with R of 50-60%) than between APN and NC-S (or ln(NC-S)) (with R of 15-25%). This can be explained, first of all, by the lack of sediment covering the nodules in the box corer after drainage and the identical surface area for which the abundance (APN) and nodules coverage (NC-T) are determined ( Figure 5).
The prediction of nodule abundance (APN) based on GLM with a high value of R = 61% (and 70% for ln(NC-S)) is associated with SEE values of 2.7 and 2.5 kg/m 2 , respectively, and MAE values of 2.1 and 1.9 kg/m 2 . These values, related to the average nodule abundance (13.5 kg/m 2 ), represent 20.0% (and 18.5% for ln(NC-S)) for SEE and 15.6% (and 14% for ln(NC-S)) for MAE.
Replacing NC-S with its natural logarithm clearly improves the quality of approximation of the empirical relationship with regression models, characterized by an increase in R by about 8-10% and accompanied by a corresponding decrease in SEE and MAE. The opposite effect is observed for NC-T since the use of the natural logarithm NC-T in The linear relationship between APN and NC-T (or ln(NC-T)) is much stronger (with R 2 adj of 50-60%) than between APN and NC-S (or ln(NC-S)) (with R 2 adj of 15-25%). This can be explained, first of all, by the lack of sediment covering the nodules in the box corer after drainage and the identical surface area for which the abundance (APN) and nodules coverage (NC-T) are determined ( Figure 5).
The prediction of nodule abundance (APN) based on GLM with a high value of R 2 adj = 61% (and 70% for ln(NC-S)) is associated with SEE values of 2.7 and 2.5 kg/m 2 , respectively, and MAE values of 2.1 and 1.9 kg/m 2 . These values, related to the average nodule abundance (13.5 kg/m 2 ), represent 20.0% (and 18.5% for ln(NC-S)) for SEE and 15.6% (and 14% for ln(NC-S)) for MAE.
Replacing NC-S with its natural logarithm clearly improves the quality of approximation of the empirical relationship with regression models, characterized by an increase in R 2 adj by about 8-10% and accompanied by a corresponding decrease in SEE and MAE. The opposite effect is observed for NC-T since the use of the natural logarithm NC-T in simple linear model results in a decrease in R 2 adj by about 7% and a corresponding increase in SEE and MAE.
Surprisingly, the addition of the ordinal variable SC to the GLM does not result in a significant increase in the accuracy of the APN prediction, which is confirmed by overlapping confidence intervals for APN determined for particular levels of this factor (Figure 9). It proves that this factor has little influence on the accuracy of determining the APN from the model. It is supposed that this due to a strong negative statistically significant correlation between the sediment coverage of nodules (SC) and the seafloor nodule coverage (NC-S) with Kendall's rank correlation coefficient of −0.57 ( Figure 10).  Theoretical measures of the accuracy of the approximation of empirical dependence by the applied models (GLM, SLM) are not fully sufficient to definitively confirm or reject their usefulness for APN prediction.
To strengthen the conclusions about the effectiveness of the obtained regression models, the cross-validation procedure was applied to independent data subsets. Although the dataset (of 68 observations) is large enough to determine a reliable form of the  Theoretical measures of the accuracy of the approximation of empirical dependence by the applied models (GLM, SLM) are not fully sufficient to definitively confirm or reject their usefulness for APN prediction.
To strengthen the conclusions about the effectiveness of the obtained regression models, the cross-validation procedure was applied to independent data subsets. Alt- Theoretical measures of the accuracy of the approximation of empirical dependence by the applied models (GLM, SLM) are not fully sufficient to definitively confirm or reject their usefulness for APN prediction.
To strengthen the conclusions about the effectiveness of the obtained regression models, the cross-validation procedure was applied to independent data subsets. Although the dataset (of 68 observations) is large enough to determine a reliable form of the APN dependency model on the analyzed factors, it is not large enough to reliably verify the quality of models for independent subsets of data distinguished within it. Nevertheless, such verification was performed by randomly selecting three subsets of 48 observations from the basic set, which were treated as training sets and for which the forms of three regression models (SLM, GLM, and GLM with ln(NC-S)) were determined. These models were used to estimate the nodule abundance in the remaining data subsets, consisting of 20 observations, treated as test sets ( Figure 1C).
The verification of the quality of the models for the training sets and their usefulness for predicting the abundance of nodules in the test sets consisted of the following:

•
Determining the statistical significance of the linear relationship between the nodule abundance predicted from the models (for the training data) with the real nodule abundance in the test sets (using p-value) and the strength of this relationship using the adjusted coefficient of determination R 2 adj ); • Determination of the arithmetic mean (MD) and mean absolute difference (MAD) between the nodule abundance predicted from the model and found in the test sets.
The results of the validation of the training models presented in Table 4 fully confirm the previous observations made for the complete initial data set (Table 3). Table 4. Estimation errors of the nodule abundance in three test subsets (of 20 observations each) based on three regression models determined from the training data subsets of 48 observations each); the location of training and test subsets is shown in Figure 1C. Explanations: APN-arithmetic mean of the nodule abundance in the test subset, R 2 adj -the adjusted coefficient of determination, MDmean difference, MAD-mean absolute difference, NC-S-seafloor nodule coverage, GT-genetic type of nodule, SC-sediment coverage.

Model
The linear relationships of the nodule abundance (estimated from the training models and found in the test data subsets) are highly statistically significant for GLM with a p-value of <0.001, while for SLM they are statistically significant only in two cases but at a significance level of ∝ = 0.05 (test subsets 1 and 3), and in one case there is no basis to reject the hypothesis that there is no such relationship (test subset 2).
The coefficients of the determination R 2 adj of linear relationships of the nodule abundance (both found and estimated using SLM) are many times lower than those determined using GLM.
The use of a more advanced general linear model (GLM) instead of a simple linear model (SLM) to predict the nodule abundance in the test subsets leads to a significant reduction in the random prediction error represented by the MAD value. These values, expressed as a percentage of the average nodule abundance in the test subsets, range from 25% to 28% for SLM, through 15% to 19% for GLM and 13% to 17% for GLM (with ln(NC-S)). With one exception, the MD values, used as a measure of systematic prediction error for GLM models, are also lower than for SLM.
Despite the limitations of the validation method used, related to the small number of test sets and partial overlapping of data in three variants of both training sets and test sets ( Figure 1C), the obtained results are unequivocal and confirm the usefulness of using GLM to predict nodule abundance with the use of ordinal variables and in particular GT, indirectly characterizing the nodule fraction distribution.

Conclusions
The results of using general linear models (GLM) to predict the nodule abundance based on seafloor photographs of the H22 exploration block (IOM) can be considered promising in terms of increasing the accuracy of prediction. The advantage of GLM is the possibility of including both quantitative continuous variables (seafloor nodule coverage) in addition to ordinal variables (the dominant size of nodules related to their genetic type, the degree of sediment coverage of nodule) in the regression model. All these variables can be quantified, albeit with a different accuracy, from seafloor photographs. The nodule coverage of the ocean floor (visible in photographs with areas ranging from 1.5 m 2 to 5 m 2 , depending on the technique used and the conditions of photographic recording) estimated with the use of computer programs is subject to error resulting mainly from at least partial nodule coverage with sediments, with which it is negatively and strongly correlated.
Determining the values of the two qualitative variables on an ordinal scale (GT and SC) requires some experience with the visual assessment and the analysis of photographs. This approximate visual assessment, however, seems to be sufficient to significantly increase the reliability of the prediction of the nodule abundance. Compared to the simple linear model linking the nodule abundance found in the box corer with the seafloor nodule coverage estimated based on the photographs, the use of GLM leads to a significant increase in the accuracy of the nodule abundance estimates. For the analyzed data set, it is expressed by a significant increase in the adjusted coefficient of determination (from 15.4% to 70.4%) and by a significant reduction in the dispersion measures around both models: for SEE from 4.2 kg/m 2 to 2.5 kg/m 2 and for MAE from 3.6 kg/m 2 to 1.9 kg/m 2 .
The accuracies obtained using GLM may seem not entirely satisfactory, but one should take into account factors affecting modeling quality resulting from differences in the horizontal surface of the box corer and the ocean floor covered by the photograph, the local variability of nodule abundance, and their sediment coverage and uneven quality of photographs resulting from a variable distance from the seafloor and its uneven illumination. The obtained results require verification on a larger dataset due to the limited dataset from the fragment of the nodule-bearing area in the Pacific administered by the IOM.
The use of a different number of levels of sediment coverage of nodules and photographic patterns facilitating the appropriate categorization of this factor should also be considered.
The final confirmation of the usefulness of GLM for the prediction of nodule abundance on the basis of data obtained from the photographic survey of the seafloor will enable the geostatistical estimation of nodule resources, e.g., with the use of kriging with the variance of measurement errors [42] integrating the measurements made with different accuracies: higher in the case of the box corer data and lower in the case of the data from photographs.