Quality Assessment of 3D Printed Surfaces Using Combined Metrics Based on Mutual Structural Similarity Approach Correlated with Subjective Aesthetic Evaluation

: Quality assessment of the 3D printed surfaces is one of the crucial issues related to fast prototyping and manufacturing of individual parts and objects using the fused deposition modeling, especially in small series production. As some corrections of minor defects may be conducted during the printing process or just after the manufacturing, an automatic quality assessment of object’s surfaces is highly demanded, preferably well correlated with subjective quality perception, considering aesthetic aspects. On the other hand, the presence of some greater and more dense distortions may indicate a reduced mechanical strength. In such cases, the manufacturing process should be interrupted to save time, energy, and the ﬁlament. This paper focuses on the possibility of using some general-purpose full-reference image quality assessment methods for the quality assessment of the 3D printed surfaces. As the direct application of an individual (elementary) metric does not provide high correlation with the subjective perception of surface quality, some modiﬁcations of similarity-based methods have been proposed utilizing the calculation of the average mutual similarity, making it possible to use full-reference metrics without the perfect quality reference images, as well as the combination of individual metrics, leading to a signiﬁcant increase of correlation with subjective scores calculated for a specially prepared dataset.


Introduction
Additive manufacturing, also referred to as the 3D printing, is one of the key technologies which revolutionizes the small series production in the Industry 4.0 era. The use of the 3D printers makes it possible not only to create some original 3D objects for entertaining purposes but also to launch an individual production of some unique parts of machines and other devices used to replace some damaged older elements. Some other areas of applications of additive manufacturing technology, utilizing plastic filaments usually based on polyactic acid (PLA) or acrylonitrile butadiene styrene (ABS), may be related to biomedical engineering (e.g., individual prosthesis), aerospace and automotive solutions, civil engineering and architecture (e.g., concrete 3D printing), reverse engineering in industry or even the protection of cultural heritage. An integration with 3D scanners enables making custom 3D CAD models and copies of various elements quite easily. Despite a growing popularity of 3D printing for home use, some important limitations should be considered, such as particle emissions [1,2], especially using the ABS filaments.
Regardless of the great progress in the use of machine vision for the monitoring and quality assurance of the 3D printed objects, an automatic visual surface assessment is still a challenging task, particularly correlated with subjective aesthetic perception of the 3D prints. Hence, in this paper, the adaptation of some similarity-based general-purpose full-reference image quality assessment (FR IQA) methods has been proposed together with their combination and optimization towards high correlation with subjective opinions. The experiments have been conducted using a specially prepared database containing 107 images of the 3D printed flat surfaces affected by the various amounts of distortions together with subjective quality scores gathered from 92 human observers.
Although in practical applications the correlation between the proposed objective quality scores and some mechanical properties would be much more desired, the measurement of some of them, e.g., strength of the materials, would require additional destructive experiments. Therefore, in this paper, we focus mainly on the aesthetic quality assessment, similarly as in general-purpose image quality assessment, based on the accordance with subjective opinions. Nevertheless, the developed methods may be further extended, particularly using the data fusion with some other non-destructive measurements, to find some dependencies between them and some physical properties of the manufactured objects.
The rest of the paper is organized as follows: Section 2 contains an overview of similarity-based IQA methods, whereas the description of the developed dataset is provided in Section 3. The idea of the proposed modification of the FR IQA metrics for the quality assessment of the 3D printed surfaces and their combination (Section 4) is followed by the experimental results in Section 5. The paper ends with a discussion (Section 6) and conclusions (Section 7).

Overview of Similarity-Based Full-Reference Image Quality Assessment Methods
General-purpose image quality assessment methods may be divided into two major groups, namely subjective and objective methods. The main differences between these two generic approaches are related to the assumed application and required assessment time. The usefulness of the subjective methods in many practical applications is limited by the necessary long time spent on gathering the opinions of individual observers. Hence, in practical applications, objective metrics that may be automatically calculated are more desired. Nevertheless, during the development of objective methods, subjective metrics may be used for verification of their correspondence to human perception of various kinds of image distortions, hence they may be considered mainly in view of aesthetic purposes.
The development of subjective methods is a troublesome and time-consuming task as it requires the engagement of many human observers who fill the questionnaires assessing the quality of the presented images containing various kinds and levels of image distortions. After the statistical analysis of their responses, together with outlier rejection and normalization of scores, it is possible to develop a database of images containing a number of images together with appropriate subjective quality evaluations expressed as Mean Opinion Scores (MOS) or Differential MOS values. During the last years, several such databases have been developed for natural images, as well as for stereoscopic images, video sequences, or even multiply distorted images, which are useful for the verification and performance evaluation of the objective image quality metrics. A comprehensive overview of many IQA databases, as well as various types of quality metrics, may be found, i.e., in the recent survey paper [27].
As the practical applicability of subjective methods is strongly limited by the time necessary to acquire the scores, for many tasks, including optimization of image filtering and lossy compression algorithms, the only possibility is the use of objective metrics which may be calculated automatically without the involvement of the human observers. However, such metrics may be further divided into three families: no-reference metrics (NR IQA), also referred to as "blind" metrics, which do not require the knowledge of the original image, reduced-reference metrics using partial reference information and the most popular full-reference metrics (FR IQA) based on the comparison of the distorted images with the "pristine" reference images without any distortions. Such perfect quality reference images are also included in the IQA databases, helpful mainly in the development of some new FR metrics. Despite the potentially wide areas of applications of "blind" metrics, their universality is still much lower in comparison to FR IQA solutions, causing a higher popularity of the latter approaches.
Most of the general-purpose FR IQA methods are based on the assumption that both compared images represent the same scene but one of the images is corrupted by one or more types of distortions, such as, e.g., the presence of noise, blur, lossy compression, transmission errors, etc. Hence, during the comparison of images, any image registering or spatial adjustment operations, including shifting or rotations are not necessary. Assuming the availability of the reference image, many FR metrics have been proposed during recent years, which utilize the similarity calculation between the original and distorted images. A milestone in the development of such metrics has been the idea of Universal Image Quality Index (UIQI) [28], being the first region-based metric without the disadvantages of the conventional pixel-based metrics, such as Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR), known as poorly correlated with subjective quality scores. Due to its potential instability, particularly for black or "flat" (with constant luminance) areas of images, the original version of the UIQI has been shortly replaced by the Structural Similarity (SSIM) [29], where the original formula has been extended as where the default values of the stabilizing constants (added in comparison to UIQI) for 8-bit grayscale images are: C 1 = (0.01 × 255) 2 , C 2 = (0.03 × 255) 2 and C 3 = C 2 /2. Their role is to prevent the potential division by zero without affecting significantly the overall quality metric's value, calculated as the average similarity value for all positions of the sliding window. The local structural similarity values are calculated as the product of three factors representing luminance distortions l(x, y), contrast loss c(x, y), and structural distortions s(x, y) between the fragments x and y of images A and B, according to the current position of the sliding window. These similarities are based on the local mean values (x andȳ) of the original and distorted images, the respective variances (σ 2 x and σ 2 y ), and the covariance σ xy . The default window is 11 × 11 pixels Gaussian window, considered as appropriate for typical standard definition images. Since the SSIM metric has been widely accepted in the community, its implementation has appeared in OpenCV and MATLAB R , for example, and many modifications based on similar formulas have been proposed by various researchers.
Considering the postulated independence of results of the image size without the necessity of changing the sliding window's size, the multi-scale version of this method has been developed by the same authors known as MS-SSIM [30], where the image details at different resolutions are assessed applying iteratively a low-pass filter and downsampling by a factor of 2. The contract and structure comparisons are conducted for each scale, whereas the luminance comparison is computed only for the highest scale. The default normalized weighting exponents have been provided after cross-scale calibration experiments.
During the next several years many other SSIM-based metrics have been proposed, which have also been used in the conducted experiments related to this paper. In 2006 the Quality Index based on Local Variance (QILV) has been proposed [31], where a comparison of the local variance distribution is considered as the extension of the use of mean-variance values in the original SSIM. Finally, the following formula has been proposed, assuming that A and B are the input images: where σ V A V B is the covariance between the variances of two images (V A and V B respectively), σ V A and σ V B denote the global standard deviations of the local variance with µ V A and µ V B being the mean values of the local variance.
Another modification proposed by Sampat et al. [32] is based on the use of the wavelet domain. The metric known as Complex Wavelet Structural Similarity (CW-SSIM) utilizes the assumption that the presence of some distortions causes consistent phase changes in the local wavelet coefficients. The metric is also more robust to small rotations and translations in comparison to the spatial domain SSIM. On the other hand, the application of information content weighting for the multi-scale SSIM (referred to as IW-SSIM), proposed in the paper [33], incorporates the Laplacian pyramid transform to calculate the appropriate weights based on the statistical Gaussian model of natural images. A similar information weighting may also be applied for MSE and PSNR metrics, significantly enhancing their correlation with subjective quality scores.
One of the most successful similarity-based metrics, known as Feature Similarity (FSIM), has been proposed in two versions (for grayscale and color images-denoted as FSIMc) [34] and verified as one of the best metrics for most IQA databases. It utilizes a similar formula as the SSIM but is applied for low-level features, namely phase congruency (PC-as a significance measure of a local structure) and gradient magnitude (GM), a complementary feature extracted using the Scharr filter. For simplicity, the equal importance of these two factors (α = β = 1) has been assumed in the similarity formula between the images A and B: where T 1 and T 2 are the stability constants preventing the division by zero and x is the sliding window position. The final formula obtained as the averaged S L for all locations is additionally weighted by the phase congruency, assuming that the Human Visual System (HVS) is highly sensitive to phase congruent structures and may be expressed as where PC m (x) = max(PC A (x), PC B (x)) and x denotes each position of the local window on the image plane A (or B). Its extension for color IQA utilizes the YIQ color space with two chrominance components (I and Q), used as the two other features, weighted using an additional exponential parameter γ used for tuning the importance of the chromatic data. A similar approach based on the comparison of gradient images, known as Gradient Similarity (GSM) has been proposed in the paper [35], where the gradient similarity term, calculated in the same way as the luminance or contrast factor in the SSIM, is combined with the luminance comparison (calculated as the l(x, y) term the Formula (1)). As the authors have assumed that the typical Prewitt, Sobel, or Scharr kernels are too small the gradient extraction has been conducted using the following 5 × 5 pixels mask: rotated by 90 • to obtain four directional masks for gradient calculations. Further metrics inspired by the idea of SSIM include a fast and effective spectral residual similarity (SR-SIM), based on specific visual saliency model [36] correlated with perceived image quality, as well as the Edge Strength Similarity (ESSIM) [37], assuming the edge detection based on Scharr kernel and further calculation of edge strength maps determined from the directional derivatives calculated for each pixel. Obtained edge strength maps for the reference and distorted images are compared using the SSIM-like formula, similarly as for the other metrics. The similarity analysis in the frequency domain has led to the idea of the DCT subband similarity (DSS) proposed by Balanov et al. [38]. This approach considers various importance of structural distortions in different DCT subbands from a perceptual point of view. Hence, the similarity values, obtained for individual subbands according to the formula similar to the other mentioned metrics based on the local variances (i.e., their product divided by the sum of squares with stability constants), are weighted providing the final DSS result.
Alternatively, good performance may also be obtained using the SSIM-like formula for measurement of contrast deviation as proposed in the paper [39], where the multiscale contrast similarity deviation (MCSD) has been defined as another relatively fast IQA metric with high accordance with MOS values for the most relevant IQA databases. An interesting fact is that although the contrast has been used in many other metrics, including the original SSIM, in this case, all the calculations are based only on contrast values combined with multi-scale image representation.
An interesting attempt to pooling following the measurement of local distortions has been presented in the paper [40], where the analysis of distortion distribution has been used to improve the performance of the Structural Similarity and Gradient Similarity (the modified methods are referred to as ADD-SSIM and ADD-GSIM, respectively). The authors of this approach have utilized the ranking-based weighting, frequency variation based on structural degradation measurement, and additional entropy gain multiplier.
Recently, some other metrics being the extensions of the original SSIM have been proposed in the paper [41], which utilizes additional predictability of image blocks. The use of the fourth multiplicative component (block predictability) in the metric referred to as SSIM4 is based on the minimal value of mean squared error between the current block and the neighboring ones. Another modification is related to the use of color components (metrics denoted respectively as CSSIM and CSSIM4) and assumes simultaneous computations of metrics for several image scales in the YCbCr color space. Such obtained partial metrics are then averaged with weights obtained by maximizing the rank-order correlations for three datasets.
Two other recent methods utilize contrast combined with visual saliency and Riesz transform with visual contrast. The first one, known as CVSSI (Contrast and Visual Saliency Similarity-Induced Index), has been proposed [42] by the authors of the above mentioned MCSD metric, as the result of the application of the weighted standard deviation of two quality maps obtained for the local contrast and the global visual saliency. The latter approach [43] called Riesz transform and Visual contrast sensitivity-based feature similarity (RVSIM) utilized a Log-Gabor filter for initial image decomposition followed by Riesz transform. Then, the similarity of local amplitude, phase, and direction characteristics is determined forming the similarity matrices subjected to weighting using the visual Contrast Sensitivity Function (CSF), forming a single similarity matrix. It is further combined with gradient magnitude similarity, calculated for the reference and distorted images in the same way as in the FSIM, and subjected to final pooling leading to the overall RVSIM index.
Many of these metrics, as well as some others, have been recently combined using the neural networks for the assessment of remote sensing images [44] leading to promising results. Nevertheless, the direct application of such an approach for the quality assessment of the 3D prints would be troublesome due to the time necessary for the calculation of several metrics as well as the necessity of network training. Considering the variety of the proposed similarity-based approaches to image quality assessment, as well as their relatively short computation time (analyzed also in [44]), the above-described metrics have been examined in this paper to verify their usefulness for an automatic quality evaluation of the 3D printed surfaces, as well as the possibility of their combination, additionally converting them into no-reference methods due to potential applications in real-time systems.

The Database of the 3D Printed Surfaces
Verification of the usefulness of various objective IQA metrics for the assessment of the 3D printed surfaces requires the development of a special database containing the images of various distorted and high quality manufactured surfaces together with subjective MOS values. To face this challenge some flat samples have been produced from 9 types of thermoplastic ABS filaments with different colors using three available FDM devices, namely RepRap Ormerod 3, Prusa i3, and XYZprinting da Vinci 1.0 Pro 3-in-1. The choice of the ABS filaments has been motivated by their good mechanical properties and lightness, as well as higher abrasion resistance in comparison with PLA polymers. Nevertheless, relatively high melting temperatures-over 220 • C-together with potentially toxic fumes [1,2] have slightly limited the development of the dataset.
Although the database contains the 107 photographs (together with depth maps obtained by a 3D scanner not used in this paper) of the flat surfaces, the metrics investigated in this paper may be successfully applied regardless of the surface flatness or roundness, hence in this sense, the proposed approach may be considered as universal and may be further verified also for some other surfaces. The images have been acquired in the controlled lighting environment (distributed illumination from three lamps to prevent strong reflections) using a Sony DSC-HX100V camera with the exposure time 1/125 s, 5 mm focal length, and an automatic white balance without flash, ensuring a fixed distance to the surface. The size of such obtained images is 1600 × 1600 pixels, being equivalent to the physical size of the samples equal to 35 mm × 35 mm. Their thickness is about 4 mm and the height of layers varies from 0.3 to 0.35 mm depending on the individual printer and the size of the nozzle [4]. The additional depth maps, however not utilized in this paper, have been acquired using the ATOS 3D scanner manufactured by the GOM company (more information may be found in the paper [3]).
Regardless of the influence of some independent factors, such as quality of filament and construction materials, the presence of distortions in some samples have been forced by changing the temperature, configuration parameters of the stepper motors, filament's delivery speed, as well as using the software model changes. Some of the samples contain cracks as well as the results of under-and over-filling. All the photographs have been independently assessed by 92 human observers using the five-element quality scale from 1 (very poor) to 5 (very good). Obtained results have been averaged to provide the MOS values, being additionally verified by comparisons with previously acquired experts' opinions (used for classification purposes in some earlier works). Some sample images together with MOS values are presented in Figure 1, where the MOS values approaching 5 correspond to perfect quality.
As the primary goal of the paper and the developed database is the proposal of the objective method for the quality assessment of the 3D printed surfaces highly correlated with the subjective perception of surface aesthetics, expressed as the MOS values, the mechanical properties of the samples have not been investigated. Nevertheless, the development of an extended database, useful for the analysis of the 3D structure of the manufactured object, supplemented with the results of terahertz or radiographic detection and identification of defects is planned as a part of the future work, considering also some non-planar objects.

Proposed Approach
The idea of the combination of various features is partially utilized in many metrics discussed in Section 2, starting from the Structural Similarity, being in fact a combination of three components, representing the luminance, contrast, and structure. Nevertheless, such combination may also be applied for individual (elementary) metrics instead of features, leading to a significant increase in the correlation with subjective quality evaluations. The most advantageous results may be obtained for the combination of metrics utilizing various kinds of features or different domains, which are somewhat complementary to each other.
The first such attempt for the general-purpose IQA, based on the nonlinear combination of three metrics, similarly as in the construction of the SSIM or FSIM formulas, has been proposed in the paper [45], where the weighted product of MS-SSIM [30], Visual Information Fidelity (VIF) [46] and R-SVD metric [47] has been used. A modified metric referred to as Combined Image Similarity Index (CISI) [48] has been presented assuming the replacement of the R-SVD metric with FSIMc. According to one of the recent papers [49], presenting a comprehensive overview and comparison of many IQA methods, this metric (actually based on two similarity-based formulas) seems to be still competitive with many recently proposed more sophisticated FR IQA approaches, also for multiply distorted image databases, confirming the validity of the assumptions made in this work as well. Hence, the idea of combination of similarity-based metrics is considered as worth investigating also for the 3D printed surfaces.
Some other approaches to multi-method fusion have also been proposed by some other researchers, e.g., based on the regression approach [50], also with the use of machine learning [51], application of neural networks [52], also for remote sensing images [44], or genetic algorithms [53]. Nevertheless, considering the assumed application for the quality assessment of the 3D printed surfaces during the manufacturing process, the use of a limited number of metrics combined using their weighted product is assumed according to the general formula: where N is the number of the weighted elementary metrics.
As the original similarity-based metrics, starting from UIQI and SSIM, belong to the group of full-reference IQA methods, their direct application for the considered tasks would require the knowledge of the reference image. In practical applications, it is usually impossible and even the use of the rendered model of the 3D printed object would require a precise image registering and phase adjustment to compare two images in the same way as for the general-purpose IQA. An additional problem might be related fo different colors and luminance of the acquired images and those rendered from the 3D models. Hence, the idea of the calculation of the average mutual similarity is proposed for the images of the 3D printed surfaces assuming the side location of cameras, initially investigated in the papers [54,55] exclusively for classification purposes. Such side-view mounting of a camera makes it possible to acquire images with visible individual layers of the filament, as illustrated in Figure 1.
The idea of the calculation of the mutual similarity requires the division of the image representing the manufactured sample into regions and the application of the IQA formula for each pair of such obtained blocks. In the conducted experiments the division into 4, 9, and 16 square blocks has been assumed, although the best results have been obtained for 16 blocks (grid of 4 × 4 regions) requiring 120 mutual comparisons. Since the division of the surface into regions is fixed, we may expect the same or very similar orientations of the patters in each region. Hence, due to the use of the mutual similarity based on the comparison of regions, a potential influence of the position and the orientation of the cropping regions would be significantly reduced. Due to the use of the sliding window approach in the SSIM-based quality metrics, the structural changes introduced by such relatively small rotations or translations influence the final results of individual (elementary) metrics not as much as the presence of physical distortions of the 3D printed surface. Additionally, a phase-shifting by a few pixels to adjust the phase of the compared patterns may also be applied as presented in one of our earlier papers [56]. The influence of the color of individual samples has been reduced using the color to grayscale conversion according to ITU Recommendation ITU-R BT.601-7 [57] (using the rgb2gray function in MATLAB R ). The illustration of the idea of the mutual similarity is presented in Figure 2, where two examples for the division into 4 and 9 regions are presented with the necessary 6 and 36 mutual similarity calculations. The number of similarities may be expressed as where M is the number of regions.

Average of 6 mutual similarities for 4 regions
Average of 36 mutual similarities for 9 regions Therefore, the average mutual similarity values obtained for the elementary metrics may be considered as the no-reference equivalents of the FR IQA metrics applied for the quality assessment of the 3D printed surfaces. Since we do not use the "pristine" reference images, we cannot apply FR IQA methods directly. Therefore, we have divided the images into parts (regions) and compare these regions using the FR IQA methods to make a workaround of the lack of the reference images. Hence, formally we can classify our approach as an NR ("blind") IQA approach, although all metrics used internally belong to the group of FR methods.
Such obtained quality scores have been used as the input metrics for the combinations according to Formula (6). The optimized weights have been obtained by the maximization of the Pearson's Linear Correlation Coefficient (PLCC) between the objective and subjective scores (expressed as MOS values in the developed database), representing the prediction accuracy. Since it has been assumed that the nonlinear combination of elementary metrics should compensate the potentially nonlinear perception of distortions, no fitting functions have been used, differently than for the general-purpose IQA being typically applied according to recommendations of the Video Quality Experts Group (VQEG). Additionally, two rank order correlations have been calculated, representing the prediction monotonicity, namely Spearman Rank Order Correlation Coefficient (SROCC) marker as ρ and Kendall Rank Order Correlation Coefficient (KROCC) denoted as τ. They are defined as and τ = n c − n d 0.5 · n · (n − 1) (9) respectively, where n is the number of images, n c and n d are the numbers of concordant and discordant, and d i is the difference between the position of the i-th image in two sequences ordered according to subjective and objective scores. Both rank-order correlations consider only the positions of the images in these sorted vectors ignoring the differences between the perceived and measured quality.

Experimental Results
The procedure of experiments consists of four main steps: production of test samples with various surface quality, subjective perceptual experiments and data analysis, calculation of individual IQA metrics using the assumed division into regions, and combination of metrics and optimization of weights, as illustrated in Figure 3. ...

Metric 3 Weight 3
Combined quality metric Subjective quality assessment Figure 3. Illustration of the experimental procedure.
The first experiments have been conducted assuming the use of individual elementary metrics for all 107 images converted to grayscale and divided into 4, 9, and 16 regions. For each of them, three correlation coefficients have been calculated to identify potentially the most useful metrics for further combinations. The obtained results are presented in Table 1. As may be noticed, in most cases similar values of PLCC and SROCC have been obtained for individual metrics, hence in further experiments, only Pearson's correlation has been used as the optimization criterion. For many metrics the choice of the number of regions does not change the correlation values significantly, hence further experiments have been conducted assuming the division into 16 regions. The best PLCC results obtained for various combinations of two metrics are presented in Table 2 (to save space only the values higher than 0.67 have been presented). As may be seen, in most combination the FSIM has been used, hence it has been used as the "fixed" metric for the combinations of three metrics (only the two others have been changed during the optimization of weights, although the FSIM exponent has been optimized as well). The best PLCC results achieved for such combinations of three metrics are presented in Table 3.  Table 3. The PLCC values obtained for 107 images using 24 "best" combinations of three elementary metrics assuming the division into 16 regions. In view of results presented in Table 3, the best results may be obtained using the combination of FSIM and CW-SSIM with a third metric (preferably MCSD), however good results may also be achieved using FSIM and MCSD as two basic metrics used in the combinations. Considering these results, the application of more metrics has led to an even slightly better correlation with MOS values as presented in Table 4, where the obtained rank order correlations are additionally presented as well for comparison with Table 1. As may be seen, the best PLCC values are not always equivalent to the highest SROCC and KROCC, although the differences may be considered as negligible. The additional illustration of the obtained increase of the correlation with subjective scores is presented in the scatter plots in Figure 4. Although the verification of correlations for the combinations of two or three metrics calculated assuming the division into 4 or 9 blocks has confirmed the validity of the choice of 16 regions (leading to better performance), the combination of four and five metrics for a smaller number of blocks gives an opportunity to obtain better results, particularly considering the prediction monotonicity represented by both rank order correlations, significantly decreasing the computation time as shown in Table 4. Table 4. The correlation coefficients obtained for 107 images using the "best" combinations of four and five elementary metrics assuming the division into 4, 9, and 16 regions.

Discussion
Analyzing the results, obtained by the unconstrained nonlinear optimization od exponent weights using the MATLAB R fminsearch function, which utilizes the simplex search method, additionally verified using gradient-based methods, it may be noticed that the application of an individual elementary metric leads to PLCC and SROCC values not exceeding 0.7 for the best metric (FSIM) regardless of the chosen division into regions. Nevertheless, the application of the combination of even two metrics increases the Pearson's correlation significantly reaching the values over 0.8 for three "best" combinations. However, removing the FSIM metric from the combination decreases this value to only 0.7081, hence Feature Similarity should be considered as the basic metric used in all combinations.
Further increase of the PLCC for the combination of three metrics allows one to specify the other metrics leading to even better performance (CW-SSIM and MCSD, used also in further combinations of four and five metrics) reaching 0.8472 with further increase to 0.8537 for five metrics, also leading to a substantial improvement of the rank order correlations. The increase of the linearity of the relationship between the combined metrics and MOS values may also be observed in Figure 4, particularly for combinations of 3 and 5 elementary metrics. As presented in Table 4, even better results may be achieved limiting the number of blocks for the combinations of four and five individual metrics, obviously with additional advantageous shortening of the overall computation time.
In comparison to previous studies, the obtained results are superior to those presented in the recent conference paper [58], where PLCC = 0.8353 has been obtained for the combination of various approaches, previously applied only for sample classification purposes. However, the methods applied in the aforementioned paper, apart from the use of FSIMc metric, are based on various additional calculations including preprocessing with the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm, entropy calculations [3,4], use of the Histogram of Oriented Gradients (HOG) and-more importantly-entropy of the depth maps obtained using the additional 3D scanning.
Considering the practical applicability of the proposed approach, the use of even a monochrome camera is effortless, being enough to acquire the necessary data without the need of the time-consuming 3D scanning of the samples, which makes the real-time applications almost impossible, or at least expensive and troublesome, especially in low-cost solutions.

Conclusions and Further Work
Presented results confirm the usefulness of the combined IQA metrics for the quality evaluation of the 3D printed surfaces, leading to a high correlation with subjective opinions of the surface aesthetics. The proposed approach to automatic objective quality evaluation of the 3D printed surfaces is based only on the full-reference image quality metrics that rely on the mutual similarity, hence its application is relatively easy since the values of the elementary metrics necessary for the combination may be calculated in parallel. The utilized individual metrics may be effectively calculated [44,49] and therefore the presented approach may be successfully implemented for the real-time quality monitoring during the manufacturing, understood as not introducing noticeable delays in comparison to the relatively slow printing process. An important remark in view of the computation time is the possibility of an increase in the correlation with MOS values using more metrics but decreasing the number of regions.
Application of the proposed combined metrics makes it possible to increase the correlation with the subjective quality evaluation of the 3D printed surfaces by over 0.17 in comparison to the use of the best single metric, considering the prediction accuracy. A similar increase may be observed for SROCC with the increase of KROCC by more than 0.15, providing significantly higher prediction monotonicity.
As the proposed method is color independent, the use of a monochrome camera is enough, although the application of some other color to grayscale conversion methods as well as various color spaces is planned in further research. Another direction of future work is the extension of the proposed method towards the use of some other image quality assessment methods and integration with previously developed methods used so far for classification purposes. Nevertheless, the development of even better-correlated metrics requires additional experiments, as well as the possible extension of the developed dataset using also some additional samples with non-planar surfaces (with the time-consuming acquisition of subjective quality scores for the new samples). Such an extension of the proposed method for the evaluation of non-planar surfaces is possible similarly as for previously investigated entropy-based approaches [4]. Obviously, reasonable results may be expected only for the division into relatively small regions, additionally avoiding the mutual similarity calculations for distant image fragments due to expected different orientation of both compared to local patterns.
Considering some open challenges of the quality control in 3D printing manufacturing analyzed in the papers [59,60], another direction of further planned experiments is related to the integration of the proposed vision-based methods with electromagnetic and radiographic non-destructive testing (NDT) methods, e.g., using terahertz methods, towards the acquisition of the full 3D structure of the manufactured objects in view of their mechanical properties.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

ABS
Acrylonitrile butadiene styrene ADD-GSIM Analysis of distortion distribution-based Gradient SIMilarity ADD-SSIM Analysis of distortion distribution-