SREVAS: Shading Based Surface Reﬁnement under Varying Albedo and Specularity

: Shape-from-shading and stereo vision are two complementary methods to reconstruct 3D surface from images. Stereo vision can reconstruct the overall shape well but is vulnerable in texture-less and non-Lambertian areas where shape-from-shading can recover ﬁne details. This paper presents a novel, generic shading based method to reﬁne the surface generated by multi-view stereo. Di ﬀ erent from most of the shading based surface reﬁnement methods, the new development does not assume the ideal Lambertian reﬂectance, known illumination, or uniform surface albedo. Instead, specular reﬂectance is taken into account while the illumination can be arbitrary and the albedo can be non-uniform. Surface reﬁnement is achieved by solving an objective function where the imaging process is modeled with spherical harmonics illumination and specular reﬂectance. Our experiments are carried out using images of indoor scenes with obvious specular reﬂection and of outdoor scenes with a mixture of Lambertian and specular reﬂections. Comparing to surfaces created by current multi-view stereo and shape-from-shading methods, the developed method can recover more ﬁne details with lower omission rates (6.11% vs. 24.25%) in the scenes evaluated. The beneﬁt is more apparent when the images are taken with low-cost, o ﬀ -the-shelf cameras. It is therefore recommended that a general shading model consisting of varying albedo and specularity shall be used in routine surface reconstruction practice.


Introduction
Reconstruction of 3D surface from multi-view images is of great interest in recent decades. The combination of structure from motion (SfM) [1][2][3] and multi-view stereo (MVS) [4][5][6][7] can reconstruct the 3D shape of a scene with multiple images. SfM helps to estimate the camera parameters including interior orientation parameters (focal length, principal point position and lens distortion parameters) and exterior orientation parameters (camera locations and orientations), while MVS attempts to reconstruct the 3D shape by searching corresponding pixels or other features from images. According to [8], the major challenges of the current MVS algorithms are texture-poor objects, thin structures, and non-Lambertian surfaces. Besides, it also makes MVS reconstruction harder when the images are captured with varying illumination conditions. As a result, reliable and accurate correspondences are difficult to establish and fine details cannot be well restored.
Unlike MVS, shape-from-shading (SfS) [9] can recover the detailed shape from a single image under proper assumptions, for example, known illumination, constant surface albedo and ideal reflectance model. Through modeling the imaging process as an interaction of illumination, surface albedo and surface normal, SfS can recover the detailed surface shape through determining the surface normal (vector). Since the effects of illumination, surface albedo and surface normal are multiplied, the illumination and the surface albedo are often considered known or constant to make the model solvable. As a variation of the SfS method, photometric stereo, makes use of images taken at a fixed location but under varying illumination conditions [10,11].
Considering the above reasons, SfS and MVS are complementary to each other. Several methods using shading to refine the surface generated by MVS have been proposed [12][13][14][15][16]. Generally, MVS is used to create the initial surface and then shading is applied to refine the initial surface. Besides keeping the illumination and surface albedo constant [17,18], the surface reflection model is often assumed to be Lambertian, which is an ideal diffuse reflection model that reflects equally in all directions. However, in reality, a lot of objects violate the Lambertian assumption and the imaging conditions might be very different such that the SfS methods mentioned above would underperform. There are several methods considering the specular reflectance in recent years [19][20][21][22]. Nehab et al. [19] considered the dense reconstruction of specular objects with controllable light sources. Or-El et al. [22] proposed an early depth refinement framework that explicitly accounts for specular reflectance when refining the depth of specular objects with an infrared image. Liu et al. [20] considered the specular reflectance when using SfS and improved the quality of 3D reconstruction of dynamic objects captured by a single camera.
Based on the above observation, we intend to achieve surface refinement under varying albedo and specularity (SREVAS). To this end, we explicitly model the specular reflectance by adding a specular component to the Lambertian reflectance model. The method will be implemented to refine surfaces with diverse multi-view images, including Internet images. SREVAS allows using images of a scene with non-uniform albedo under arbitrary illumination, which means an image may have one or more light sources and the illumination conditions of different images can be different. Furthermore, surfaces with considerable specular reflectance can be recovered due to the use of a comprehensive and realistic reflectance model. Based on the physics and geometry of the imaging process, we introduce an objective function that considers the surface normal, surface albedo and illumination. The method is tested with four benchmark datasets: the DTU Robot Image dataset [23] with multiple light sources (indoor), a synthetic dataset of Joyful Yell [24] with only pure Lambertian reflectance, a multi-view stereo dataset with ground truth [25] (outdoor), and an Internet photo dataset [26] with very different illumination conditions (outdoor). Experiments show that the proposed method can significantly recover more surface details in all these cases than the recently reported shading based surface refinement and reconstruction methods, namely, MVIR [24] and SMVS [27].
The rest of the paper is organized as follows: Section 2 briefly reviews the related work. Section 3 formulates the proposed SREVAS method, while Section 4 presents the experiments on all the datasets. Section 5 concludes our work.

Related Works
Considering the underlying surface models, MVS algorithms can be roughly divided into four types [5]: voxel-based [28,29], deformable polygonal mesh-based [30,31], depth map-based [32][33][34], and patch-based [35][36][37] methods. The objective of MVS is to find the relevant pixels of the same object in multiple images and to reconstruct the surface. As stated above, high-frequency surface details may not be well recovered by MVS algorithms in texture-less or non-Lambertian surfaces since the image similarity is difficult to determine in those areas without using a prior. The use of particle swarm optimization can achieve better accuracy and robustness in texture-poor or specular surfaces compared to other MVS algorithms [38]. However, the detailed surfaces are still hard to be well recovered with MVS, especially when the images are captured under very different illumination conditions.
In contrast to MVS, SfS recovers the surface normal through modeling the imaging process with one image or multiple images captured in a fixed position under varying illumination conditions. By recovering the surface based on the (surface) normal field, SfS can achieve better surface details since they are intrinsically embedded in the surface normal. However, SfS is an ill-posed problem. As such, assumptions about the illumination conditions and surface albedo are usually imposed to make the problem solvable. For example, uniform albedo and known illumination were assumed [39,40]. In recent years, the requirements for illumination conditions have relaxed, but the surface albedo still needs to be uniform [41,42]. The photometric stereo algorithm is another way to relax the restriction in traditional SfS by capturing multiple images at a fixed position under different lighting conditions [10]. Due to recent efforts [43][44][45][46], photometric stereo methods can handle un-calibrated natural lighting and non-Lambertian reflectance. For example, the introduction of spherical harmonics [47,48] allows un-calibrated natural lighting. The use of non-Lambertian reflectance models, for example, microfacet-based reflectance model can help photometric stereo to deal with highly specular surfaces [49].
MVS requires a well-textured surface whereas SfS generally can deal with texture-less surfaces better. It is therefore of great interest to take advantage of these two complementary methods to best reconstruct the surface. Wu et al. refine the initial MVS surface based on shading under un-calibrated illuminations represented by spherical harmonics [47]. To achieve the refinement, the albedo is assumed to be constant while the illumination is assumed to be fixed and distant in their method. There are researchers trying to decompose the reflectance from shading [50] and reconstruct the surfaces of texture-less objects with the combination of photo-consistency and shading [51,52]. However, the reflectance model is assumed to be Lambertian. There are also some methods based on the photometric stereo [44,[53][54][55]. Nehab et al. [56] proposed a method that can effectively recover the detailed surface with the normal determined by the photometric stereo. However, the images used in photometric stereo need to be captured in a fixed position under varying illumination conditions, which is hard to achieve, especially under natural illuminations. As another effort to relax the above limitation of the photometric stereo, instead of using the original images, Shi et al. [57] used the images created from a fixed position under varying illumination conditions by 3D warping the depth map generated from SfM and MVS. With the development of RGB-D sensors, SfS is also used to refine the depth map [58][59][60][61][62][63]. To reconstruct the surface in high quality, the visual hull is used to constrain partial vertices [64]. Although structured light can be used to reconstruct detailed surfaces well with proper consideration of surface normal [65], it needs specific equipment, which limits its applications. Kim et al. [24] proposed a method that refines the initial surface from MVS under arbitrary illumination and albedo through solving an imaging model represented by the spherical harmonics. However, it can only recover the surface detail under the Lambertian assumption. To better recover the surface, there is a need to explicitly model both specular and diffuse reflectance especially when the materials of the surface exhibit a mixture of both. In recent years, several methods that directly add shading to the image matching procedure are proposed [27,51,66]. Similarly to Kim et al. [24], they all assumed the Lambertian reflectance and are prone to fail when specular reflectance exists.
To sum up, many methods that use SfS to refine the initial surface from MVS or RGBD images have been proposed [16,17,53,60]. However, most of the previous methods can only refine the surface under limited conditions such as uniform albedo [16], known illumination [67], constant illumination [18] or Lambertian reflectance [24]. Different from the previous methods, the proposed method can recover more detailed surfaces under specularity and varying albedo by extending the existing imaging models.

The SREVAS Method
As shown in Figure 1, with the multiple input images, the camera parameters are estimated using SfM and an initial surface is reconstructed with MVS firstly. Then the initial surface is refined with the proposed SREVAS method. In our experiments, we use the VisualSFM [68] to estimate the camera parameters and CMPMVS [6] to recover the initial surface as a mesh model. To assure enough point density for texture-less areas, the initial surface is densified by recursively subdividing the triangles in the mesh until a preset maximum size for every triangle is reached. Through modeling the imaging process, the rendered image intensity can be calculated by the illumination, the albedo and normal of the surface. Since the wrong shape will increase the inconsistencies between the observed and rendered image intensities, the surface can be refined by solving an objective function of data terms and regularization terms about the illumination, the albedo and normal of the surface. This section describes the proposed SREVAS method in detail including the imaging model, the data term, the geometry term, the diffuse reflectance smoothness term and the specular reflectance smoothness term. When modeling the imaging process, the reflectance model has to be considered. Most of the previous works [18,24,69] assume that the reflectance is perfectly diffuse, in other words, the Lambertian assumption. However, this is often violated. Instead, we assume that the reflectance model is a mixture of diffusion and specularity. We add a specular component to the Lambertian reflectance model to consider the properties of non-Lambertian surfaces. For the Lambertian part of the reflectance model, we follow Basri and Jacobs' work [47] and approximate the illumination with second-order spherical harmonic basis functions. This is particularly suitable for representing complex illumination and has been commonly used [18,24,70]. Based on the consideration described above, the imaging process is modeled as: where I i is the corresponding pixel value of the i-th vertex, R i is the per-vertex albedo (also often noted as ρ i ), (n x , n y , n z ) is the per-vertex unit normal (vector) at the i-th vertex, h 1 -h 9 are the spherical harmonic bases, L 1 -L 9 are the per-image coefficients of the spherical harmonic bases or simply lighting coefficients and N is the number of the lighting coefficients. In our experiments N is 9 since the second-order spherical harmonic bases are used and S i is the specular component which can vary for different vertices and images. In order to allow varying illumination conditions, the lighting coefficients are image variable and the specular components are pixel variable. In the meantime, allowing different surface locations to have different albedos makes the proposed method more general for various complex scenes. Based on the imaging model above, we build our objective function with the surface albedo, surface normal, lighting coefficients and specular component.
where G is the geometry (position) displacement of the vertex along its normal; R is the albedo of the vertex, varying band to band, like red, green and blue in our experiments; L is the lighting coefficients in Equation (1) and S is the specular component. α, β and γ are used to balance the data term E data , geometry smoothness term E gsm , diffuse reflectance smoothness term E rsm and specular reflectance smoothness term E ssm . The above objective function is essential to determine the best set of G, R, L and S under the constraints of rendering difference, geometry smoothness, diffuse reflectance smoothness and specular reflectance smoothness. The data term E data is measured by the difference between the observed and rendered pixel values (intensities).
where m is the number of vertices, V i is the visible camera set for the i-th vertex, I i,c o is the observed pixel value of i-th vertex in the image c, O i is the initial position of the vertex while I i,c r is the rendered pixel value and N V i is the number of images in V i . The visibility of every vertex is computed with ray-triangle intersections between the ray from a camera to the vertex and all the triangles in the mesh. If the ray from a camera to the vertex is not occluded by any other triangle in the mesh, the vertex is regarded as visible in the camera. I i,c o will be re-computed with the change of vertex displacement G i . I i,c r is defined in Equation (1) with the surface albedo, surface normal, illumination conditions and specular component. The normal (n x , n y , n z ) of a vertex is computed using the vertex position O i and its displacement G i by averaging the normal of adjacent faces at the vertex, where the spherical harmonic basis h 1 to h 9 will be recomputed as well.
The geometry term E gsm encourages the surface to be smooth. To this end, we calculate the weighted mean distance between the vertex and its neighbor vertices.
where A i is the set of adjacent vertices of i-th vertex, (O i + G i ) are the new coordinates of i-th vertex while O i,j + G i,j is the new coordinates of j-th adjacent vertex at i-th vertex, N A i is the number of vertices in the set A i and l i is the average edge length between adjacent vertices and its centroid. The bilateral filter [71] weight computed from the pixel value difference and vertex coordinate difference is used to compute w gsm i,j . where This smoothness term with the bilateral filter weight is set to encourage the surface to be smooth while preserving sharp edges in images. Different from the usage of the bilateral filter in image filtering, the area of the filter kernel is defined with the neighbor of the vertex instead of a regular window in image space.
Since the albedo of the surface is allowed to be varying in our objective function, we set a diffuse reflectance smoothness term to better separate it from the lighting coefficients. To decompose the ambiguity, in other words, separating the albedo from shading, the diffuse reflectance smoothness term E rsm is calculated based on the assumption often used in intrinsic image decomposition [72] that vertices having similar albedo should have similar color values in each input image.
,c is the mean color value in all visible images of the i-th vertex and k is a constant value.
Inspired by Liu et al. [20], the specular reflectance smoothness term E ssm is set to prevent the rendered value from being considered only as specular and to encourage the specular component to be spatially smooth.
where w s1 and w s2 are constant values. Additionally, in order to regularize the illumination scale ambiguity, we select a dominant camera that has the largest view frustum and constrains the squared sum of its lighting coefficients to be unit similar to [24].
In the objective function, four types of variables: the lighting coefficients, the surface albedo, the specular components and the vertex displacements are meant to model the imaging process. Through optimizing the objective function described above by the Levenberg-Marquardt implementation from the Ceres Solver [73], the best sets of the four types of variables are determined and the surface can be refined since the positions of vertices will be updated by their displacements. The lighting coefficients are the same for one image while the albedo and the vertex displacement of one surface point are the same in different images. For the specular components, they can be different for different surface points and images. In addition, the constraints on the rendering difference, the geometry smoothness, diffuse reflectance smoothness and specular reflectance smoothness make the objective function solvable and robust.
Similarly to many shading based surface refinement methods [16,24,64], we achieved surface refinement through optimization of an objective function of data terms and regularization terms. The data term is measured with the rendering difference while the geometry smoothness and reflectance smoothness are added as constraints to keep the surface smooth while preserving sharp edges in images. The main difference is that the proposed method considers the specular, a common phenomenon in real-world situations. Due to the introduction of the specular component into the proposed method, a new specular reflectance smoothness constraint is designed to robustly solve the objective function. In [24], the geometry smoothness constraint is an image intensity weighted local surface curvature, while we use the edge preserving bilateral filter kernel as the weight.

Experiments and Discussion
Two groups of experiments were designed. The first one was meant to evaluate the effectiveness of the specular component and its solution robustness. For this purpose, we excluded all the specular related terms in the proposed method, in other words, a weak SREVAS without specularity, or named SREVA, to refine the initial surface. This was tested with the DTU Robot Image dataset [23] and the synthetic dataset of Joyful Yell [24]. The DTU dataset was collected in the laboratory (indoor) with several controllable lighting sources and has obvious specular reflections in some images. As such, it is suitable to evaluate the effectiveness of the specular component. Similarly, the experiment on the dataset of Joyful Yell was designed to evaluate the robustness of SREVAS since the dataset is synthetic (computer-generated) under perfect Lambertian without specular reflection.
The second group of tests was meant to understand the performance of the SREVAS method compared with the initial surface reconstruction method CMPMVS, and two representative shading based surface refinement and reconstruction methods, MVIR [24] and SMVS [27]. MVIR can recover detailed surfaces with the SfS technique under arbitrary illumination and albedo, while SMVS combines stereo and SfS in a single optimization scheme. As for the experiment data, we use the Herz-Jesu-P8 [25] and an Internet dataset [26]. All the images of the Herz-Jesu-P8 and Internet datasets are captured under real-world (outdoor) conditions different from the datasets in the first group of experiments. The Herz-Jesu-P8 dataset has calibrated camera parameters and a ground truth model. The images were captured under nearly the same illumination condition. In contrast, there are no calibrated camera parameters and ground truth models for the Internet dataset, where the images were captured under very different illumination conditions.
The DTU dataset, synthetic dataset and Herz-Jesu-P8 dataset have camera parameters. CMPMVS can, therefore, be directly applied to recover the initial surface. For the Internet dataset, SfM is used to estimate the camera parameters first. After the initial surface is recovered with CMPMVS, MVIR, our SREVAS and SREVA are applied to refine it, respectively. For MVIR, an executable program provided by Kim et al. [24] is used. Since the optimal parameters for the synthetic and Internet datasets in MVIR were provided in Kim et al.'s work [24], MVIR is only applied to the synthetic and Internet datasets in our study. For SMVS [27], the source code is provided and it is applied to the Herz-Jesu-P8 and Internet datasets with the default parameters.

Specular Component in SREVAS
Firstly, the Buddha model in the DTU dataset is used to evaluate the effectiveness of the specular component in the proposed method. According to [23], the dataset was generated under seven different lighting conditions from 49 or 64 positions in the laboratory. Calibrated camera parameters and ground truth points generated by structured light scanning were provided. The scene we choose contains 64 images of 1600 × 1200 pixels. To evaluate the effectiveness of our specular component, we choose 7 images with some specular areas shown in Figure 2. As shown in Figure 3, the initial surface (second column) generated by CMPMVS lacks the fine and sharp structures and is over-smoothed. In the area with specular reflection, SREVA (third column) creates many artifact details, whereas SREVAS (fourth column) can keep the surface smooth as shown in the second row. However, both SREVA and SREVAS can recover fine details in many areas such as the one shown in the third row. This result demonstrates that the proposed method can not only recover fine details but also keep smooth in specular reflectance areas. To quantitatively evaluate the surfaces from CMPMVS, SREVA and SREVAS, we use the algorithm provided by Jensen et al. [23]. The results are evaluated based on the accuracy and completeness [23], where the accuracy is measured as the distance from the results to the structured light reference and the completeness is measured from the reference to the results. The mean value, median value and root mean square value of the distances are computed. Table 1 shows that the proposed SREVAS method performs the best compared to its specular-free SREVA version and CMPMVS, yielding an improvement of 1.1-12.8% in position accuracy compared to the initial surface. The improvement is calculated through dividing the accuracy difference between SREVAS and CMPMVS by the accuracy of CMPMVS. Dropping the specular component from the SREVAS model has caused quality deterioration in the resultant surfaces. This demonstrates the necessity of considering the specular component for surface refinement. To evaluate the influence of the specular component in our approach when the actual reflectance is perfect Lambertian (i.e., no specular reflectance exists), the following test is designed. We use the dataset created by [24] with the well-known synthetic surface model "Joyful Yell". According to [24], a total of 37 input images of 2048 × 1536 pixels were generated with the CG rendering software (https://www.blender.org/). As shown in Figure 4, each image was under a single color but randomly generated light source, while the albedo of the model was colored to be non-constant with the CG software (https://www.blender.org/). For the reflectance model, the object was assumed to have a perfect Lambertian surface. As shown in Figure 5, the results from multi-view stereo CMPMVS lack fine details in the face, ear, hair and clothes of the model, while MVIR and the proposed method, either without or with the specular component, can recover fine and sharp details in those places. Besides, the CMPMVS surface is very rough while the results of the proposed method and MVIR are smooth. As mentioned above, it is hard to find good correspondence in texture-less areas for MVS. Besides, since images were generated under very different illumination conditions, the bad performance of CMPMVS is not unexpected. As for the proposed method either with or without specular component, shading is used to recover the fine details and geometry constraint is applied to assure the surface be smooth. It should be noted that our model is general, in other words, the specular component is included so as to prove the capability and generalization of our approach. The results demonstrate that our solution technique is quite stable even when the model has certain redundancy, such as the specular reflection parameters. The potential effect of the over-parameterization problem by introducing the specular component is minimal and can be ignored in practice. To quantitatively evaluate the results of different methods, the depth and surface normal of all the results and the ground truth are computed. Examples of the relative depth (divided by a mean depth of the ground truth) and surface normal errors are shown in Figures 6 and 7. As shown in Figure 6, no matter the smooth area (the left small square) or the area with many detailed shapes (the right small square), MVIR, SREVA and SREVAS significantly reduce the depth errors while the result of SREVA is slightly better. When we compare the surface normal, Figure 6 shows that MVIR, SREVA and SREVAS greatly improve the accuracy of the surface normal from CMPMVS in regions with either many detailed shapes (the left small square) or smooth shapes (the right small square). The result demonstrates that our method can refine the surface normal (i.e., surface shape) and improve the accuracy of the surface at the same time.
The overall root mean square (RMS) of relative depth errors and normal errors are computed for all images shown in Table 2. The table shows that MVIR and both versions of the proposed method clearly improve the accuracy in both depth and surface normal comparing to the input surface generated by CMPMVS. SREVAS improves 10.1% of the depth accuracy and 17.6% of the normal accuracy. Since the reflectance model of surface is assumed to be pure Lambertian, SREVA achieves the highest accuracy in depth and performs slightly better than SREVAS, whose result is still acceptable and slightly better than MVIR. Figure 8 further shows the accumulated distribution of depth errors of the surface reconstructed by CMPMVS and refined by MVIR, SREVA and SREVAS. It can be observed that all methods can improve the number of accurate pixels while SREVA and SREVAS are better than MVIR, especially when the relative depth error is less than 0.4%.    The experiments above have shown that the proposed SREVAS method can recover fine details that CMPMVS missed and improve the accuracy of the surface no matter if the surface reflection model is a mixture of Lambertian and specular or perfect Lambertian. SREVA can perform slightly better when the actual surface reflection model is set to be pure Lambertian, but poorly when there is a mixture of Lambertian and specular reflections. Therefore, SREVAS is a better and reliable choice unless we are certain that the surface reflection model is pure Lambertian.

Performance of SREVAS
To further evaluate the performance of the proposed method, experiments on the Herz-Jesu-P8 dataset are first conducted. The images of the Herz-Jesu-P8 dataset [25] were taken under natural illumination with accurate camera parameters. There were 8 images of 3072 × 2028 pixels in the Herz-Jesu-P8 dataset. The ground truth 3-D model was obtained using accurate light detection and ranging (laser scanning). Figure 9 shows, from the top row, the ground truth, the initial surface generated by CMPMVS, the refined surface by SREVAS and the surface reconstructed by SMVS. The illumination conditions of all the images in the Herz-Jesu-P8 dataset are nearly the same, as such the overall shape is reconstructed well by CMPMVS whereas some sharp details are over-smoothed or missed. As for SMVS, the surface is over-smoothed and there are many void areas. In contrast, Figure 8 depicts that the surface recovered by SREVAS has more fine details in the left and right squares; the shapes recovered by SREVAS can significantly better represent the ground truth. As shown in Table 3, SREVAS yields slight improvement (0.4%) in the accuracy of depth compared to the initial surface from CMPMVS. As for the surface normal, the proposed method still shows improvement (2.5%) comparing to the initial surface. SMVS combines shading with MVS in a single optimization scheme and achieves the best for accuracy of depth and normal among the three methods. However, as shown in Figure 8, there are many void areas in the surface reconstructed by SMVS because it discards many poor points after reconstruction [27]. Therefore, we calculate the omission rate which is the ratio of the number of pixels not existing in the reconstructed surface compared to the ground truth. As shown in Table 3, the omission rate of SMVS is the highest (24.25%). SREVAS can achieve the best balanced performance in terms of details, position accuracy, normal accuracy and omission rate. It is shown that SREVAS is very dependent on the initial surface. Although a more detailed shape can be recovered by SREVAS as shown in Figure 8, the accuracy in depth and normal has not been improved that much to be comparable with SMVS. Below we present a qualitative evaluation by using the Fountian-P11 dataset. Similar to the Herz-Jesu-P8 dataset, the images of the Fountian-P11 dataset [25] were also taken under natural illumination with accurate camera parameters. There were eleven 3072 × 2048 images. Figure 10 shows the reconstructed surfaces, including ground truth (from laser scanning), the initial surface from CMPMVS, the refined surface by SREVAS and the surface reconstructed by SMVS, respectively. The overall shapes are well reconstructed by all three methods. However, there are apparent void areas in the surface reconstructed by SMVS. It also results in an over-smoothed surface without many details, whereas the surface from CMPMVS is noisy. In contrast, results from SREVAS are sharper and more similar to the ground truth than CMPMVS and SMVS. After experimenting on the datasets with camera parameters, we also evaluate the proposed SREVAS on an Internet dataset without camera parameters. As shown in Figure 1a, the Internet dataset used is the Yorkminster in the 1DSfM dataset [26]. Similarly to the work of [24], the same 9 images are used for our experiment. As we can see from Figure 10, the illumination conditions among the 9 images are very different. Since there are no camera parameters for the images, VisualSFM is first used to estimate the camera parameters and then CMPMVS is applied to recover the initial surface for further refinement. Figure 11 shows the initial surface from CMPMVS, the surface refined by MVIR and SREVAS and the surface reconstructed by SMVS. Considering the large difference in illumination conditions and the absence of camera parameters, it is hard to achieve a fine surface with CMPMVS, which gives us some room to refine its result. Both MVIR and the proposed SREVAS clearly improve the details of the surface compared to the initial surface. However, it is obvious that SREVAS outperforms MVIR, such as in the places marked with a red rectangle. Compared to MVIR, the shapes refined by SREVAS become much more realistic to the ones shown in images. In the last red rectangle, MVIR recovers some details while generating a lot of artifact details as well. As for SREVAS, the shapes are well recovered and similar to the shapes in the original images. Considering that the Lambertian assumption is often violated and there are many shadows in the images, the unsatisfactory performance of MVIR can be explained. For the proposed SREVAS, the modeling of specular reflectance is more practical and can better recover the detailed shape even when shadows are visible. As for SMVS, some detailed shapes can be reconstructed comparing to the surface reconstructed by CMPMVS, however, there are still many void areas similar to the result of the Herz-Jesu-P8 dataset. In contrast, SREVAS can recover much more detailed shape than SMVS as shown in the zoom-in views of the red rectangles.

Susceptibility of the Parameters
In our experiments, the parameters are set with the basic rule that there should not be much difference among the terms in the objective function and mostly only α and β are tuned. Therefore, the susceptibility of α and β is evaluated with the DTU dataset here. As shown in Table 4, the accuracy and completeness [23] of the recovered surface do not change much with different α and β, which means the parameters are not susceptible.

Runtime
The proposed method is implemented using C++ with external dependencies: Ceres Solver [73] and OpenCV [74]. We experiment on a standard Windows 10 computer with an Intel Xeon CPU of 64 GB memory without GPU optimization. As shown in Table 5, SREVAS has the highest computational cost (1.3×-3.0×) for the Herz-Jesu-P8 dataset since the variables in our framework are the most and no accelerated process is applied. Nevertheless, SREVAS is able to produce the most points (1.2-3.1× more) among all three methods, a necessity to achieve fine details in surface reconstructions. It should also be noted that the computational efficiency is about the same across the three methods, considering the runtime per point.

Limitations
There are several limitations observed for SREVAS. It has model bias since the spherical harmonics assume distant lighting and convex objects. Nevertheless, our experiences show that SREVAS can still refine the results from multi-view stereo but without achieving its best performance. Furthermore, the low omission rate of surface reconstruction with SREVAS can sometimes be at the price of relatively low geometry accuracy, comparing to SMVS where shape-from-shading and multi-view stereo are combined in a single optimization scheme. Finally, the performance of SREVAS is dependent on the quality of the initial surface. As shown in Figure 12, some details on the surface cannot be well recovered due to the existence of occlusions and shadows in the input images. Similarly, when the input images are taken under nearly the same illumination and can be well reconstructed by image matching, SREVAS can only slightly improve the position and normal accuracy, despite being able to yield more fine details in texture-weak regions.

Conclusions
We have proposed a shading-based surface refinement SREVAS method, which can be used for reconstructing surfaces with varying albedo and specular reflection. Starting from this imaging model, we use an objective function to refine the initial surface generated by MVS. All our experiments demonstrate that this method can refine the surface initially created from multi-view image matching. It is shown that with varying illumination conditions, SREVAS can improve the accuracy up to 10.1% in surface position and 17.6% in the surface normal with lower omission rate compared to the initial surface. To be specific, our investigations achieve the following concluding remarks.
For ideal Lambertian (i.e., no specular reflection) surface (e.g., the synthetic dataset Joyful Yell), the use of SREVAS can still recover fine details with high accuracy, though its final results are slightly worse than the ones from SREVA, which best fits the data under these circumstances. Since a certain magnitude of specular reflection in reality always exists, it is recommended that SREVAS should be used as a common practice.
For scenes with obvious specular reflections (e.g., the DTU dataset) or scenes with a mixture of Lambertian and specular reflections (e.g., the Herz-Jesu-P8 dataset), SREVAS can recover realistic surface details and keep the smoothness of the reconstructed surface. On the contrary, ignoring the specular component would lead to a lot of artifacts in the reconstructed surface. The study demonstrates the necessity and effectiveness of the specular component for shape-from-shading.
With an appropriate illumination model and effective solution technique, shading is able to improve the surface resultant from multi-view image matching, especially under the circumstance of specular reflection and weak-texture.
When there are no accurate camera parameters for the input images (e.g., the Yorkminster dataset), the proposed method generates surfaces significantly better than some existing ones, such as CMPMVS and SMVS. This finding suggests that the shape-from-shading technique, in general, can contribute more to surface reconstruction from low cost, off-the-shelf images.
It should be noted that the proposed method assumes the same overall lighting of all pixels in one image. Future work may extend the current model to consider patch-wise lighting conditions in one image. Besides, continuing investigation on how to handle shadows and occlusions in the images is of necessity.