A Novel Learning Based Non-Lambertian Photometric Stereo Method for Pixel-Level Normal Reconstruction of Polished Surfaces

: High-quality reconstruction of polished surfaces is a promising yet challenging task in the industrial ﬁeld. Due to its extreme reﬂective properties, state-of-the-art methods have not achieved a satisfying trade-off between retaining texture and removing the effects of specular outliers. In this paper, we propose a learning based pixel-level photometric stereo method to estimate the surface normal. A feature fusion convolutional neural network is used to extract the features from the normal map solved by the least square method and from the original images respectively, and combine them to regress the normal map. The proposed network outperforms the state-of-the-art methods on the DiLiGenT benchmark dataset. Meanwhile, we use the polished rail welding surface to verify the generalization of our method. To ﬁt the complex geometry of the rails, we design a ﬂexible photometric stereo information collection hardware with multi-angle lights and multi-view cameras, which can collect the light and shade information of the rail surface for photometric stereo. The experimental results indicate that the proposed method is able to reconstruct the normal of the polished surface at the pixel level with abundant texture information.


Introduction
Surface quality inspection of products is an essential part of industrial manufacturing [1]. In recent years, 2D and 3D machine vision have become the mainstream methods to obtain the surface information of objects, enabling automatic surface defect detection and size measurement of industrial products [2][3][4][5]. However, obtaining information of surface with specular reflection is still a challenging task, especially in the industrial field [6]. Because the reflected light from the specular reflection area is more significant than the corresponding threshold of the camera sensors, the specular reflection area in the image obtained by the traditional 2D vision is overly highlighted and barely contains any useful information. In contrast, 3D vision can obtain the depth of the object surface and reflect the surface characteristics more comprehensively and concisely, thus becoming the commonly used method to replace 2D vision in the industrial field [7]. The mainstream 3D vision in the industrial field includes binocular stereo vision, structured light, time-of-flight (TOF), and so forth, which have been applied to 3D-dimension measurement and surface defect detection [8,9]. However, these methods are costly and can hardly deal with small surface defects.
In recent years, photometric stereo has become a very promising technology in 3D vision due to various outstanding advantages [10], among which three deserve our attention. The first is pixel-level resolution. Photometric stereo can reconstruct the normal of each pixel with rich texture information. The second is that it can handle specular reflections. Photometric stereo is a technology based on the reflection of the object surface. The normal of the specular reflection area can be regarded as the bisector of the viewing and light directions. Furthermore, the use of multiple images to reconstruct the normal map can greatly reduce the adverse effects of the highlighted area. The third is low cost. In terms of hardware, photometric stereo only requires one camera and multiple LED lights. Therefore, a large amount of research on photometric stereo has been carried out, and preliminary applications have been obtained in many fields, such as the detection of defects on the surface of strip steel [11].
The main problem that limits the development and application of photometric stereo is the estimation of non-Lambertian surface normals. The classic photometric stereo technology was developed based on the Lambertian reflectance model [12], but a pure Lambertian surface rarely exists in reality. As a result, numerous research papers have extended photometric stereo to non-Lamberttian fields. Non-Lambertian photometric stereo methods can be roughly divided into four categories. The first category is based on the method of removing outliers. The earliest non-Lambertian photometric stereo method believes that specular reflection rarely exists in photometric stereo, which is an abnormal value. Methods such as rank minimization [13], RANSAC [14], taking median values [15], expectation maximization [16], and sparse Bayesian regression [17] have been proposed to eliminate outliers in observations and estimate relatively accurate normal maps. However, such a method is only suitable for the case in which a small number of non-Lambertian regions exist. For materials with a large amount of specular reflection such as metal surfaces, these methods are no longer applicable. The second category is based on sophisticated reflectance models. These methods accurately estimate the surface normal by establishing the reflection model of the object surface, including many classic analysis models, such as the Blinn-Phong model [18], the Torrance-Sparrow model [19], the Ward model [20], the Cook-Torrance model [21] and so on. Some methods based on improved BRDF have been proposed, such as bivariate BRDF representations [22,23] and symmetry-based approach [24], which are used to characterize the reflection model of the object surface. However, these methods typically require a suitable optimization model, which is only suitable for limited materials, and it is difficult to propose a commonly used reflection model. The third category is the example-based method. If a ball of the same material is placed in the same scene, then the surface normal of the ball is already known, which turns the non-Lambertian problem into a matching problem [25]. Recently, Hertzmann et al. rendered balls of different materials to remove the limitation of physical reference balls [26]. However, these methods are too cumbersome and time-consuming. The fourth category is learning-based methods, which have emerged in recent years. In 2017, Santo et al. first introduced the deep fully connected network to photometric stereo, and learned the mapping between the corresponding pixel points in multiple observation images and the normal of the point in a pixel-wise manner [27]. However, this method requires a predefined light direction. When applied in an industrial scene, the light direction needs to be kept consistent with the one used during training. Ikehata et al. first introduced the convolutional neural network (CNN) to photometric stereo, merging all the input data of a single pixel into the intermediate representation, which is termed the observation map, and a CNN network was used to regress the surface normal [28]. Chen et al. first introduced the fully connected network into photometric stereo, and used the information of the entire image to directly estimate the normal map of the entire image [29]. Meanwhile, they proposed two widely used synthetic photometric stereo datasets. Cao et al. applied a three-dimensional convolutional neural network to photometric stereo, constructing inter-and intra-frame representations for accurate normal estimation of non-Lambertian objects for more accurate normal estimation results using significantly fewer parameters and performing robustly under both dense and sparse image capturing configurations [30]. Nevertheless, these methods still have some difficulties in dealing with polished surfaces with specular reflection.
In this paper, a novel learning-based photometric stereo method is proposed to solve the problem of specular reflection in non-Lambertian photometric stereo. We design a feature fusion convolutional neural network called FFCNN for estimating the normal map of objects. FFCNN combines the initial normal estimated by the L2 [12] method for the feature extraction of the convolutional neural network. Although CNN and maxpooling operations can extract salient features for estimating normal maps, for low-frequency information close to Lambertian reflection [22], the L2 method can reduce the influence of specular reflection areas in multiple images. The complementary advantages of CNN and L2 methods are well integrated to estimate more accurate normal maps. As a consequence, FFCNN can handle non-Lambertian surfaces very well, as well as objects that contain specular reflections. Compared with the state-of-the-art methods [30,31], our proposed FFCNN method can estimate the normal map more accurately.
Additionally, we apply our method to the surface normal estimation of rail welds after polishing, and we build a novel photometric stereo information collection system, which can adapt to the complex geometric surface of the rail. Experiments show that the estimated normal map contains abundant texture information in the presence of large specular reflection areas, validating the effectiveness and generalization of our method.

Principle of Photometric Stereo
Photometric stereo was proposed by Woodham in 1998 [12], and uses the light and shade information of the surface to analyze the light reflection model. Given images of the surface under different lighting directions with a fixed camera, the normal of the surface is calculated from the image brightness. Figure 1 shows a typical photometric stereo system, which consists of two parts: the photometric stereo information capture system and the photometric stereo normal estimate algorithm. The photometric stereo information acquisition system mainly includes a fixed camera and multiple LED lights. The camera captures a number of images when each light is sequentially turned on. The surface normal map can be solved from the acquired multiple images through the photometric stereo algorithm. Early photometric stereo algorithms usually assumed an ideal Lambertian reflectance model [12], and the image acquisition process can be formulated as: where I represents the intensity of the captured image, ρ represents the surface albedo which is constant, n represents the surface normal which is a unit vector, l represents the lighting direction, and max n l, 0 denotes attached shadows. The normal map can be solved by Equation (1) with three or more images [12]. Unfortunately, most real-world objects are non-Lambertian. For non-Lambertian, the image forming formula can be written as [32]: where (λ) represents the spectral length, Q c (λ) represents spectral sensitivity of the camera, E(λ) represents the spectral distribution of the light, and S(λ) represents the spectral reflectance of the object surface.
Normally, it is difficult to directly solve the normal n through this formula. As mentioned in Section 1, many researches have been proposed to solve the problem of non-Lambertian surfaces. Our method belongs to the fourth category. We let the network directly learn the mapping between the normal map and the captured images. Our method will be discussed in detail in Section 2.2.

Learning Based Photometric Stereo: FFCNN
We adopt two commonly-used assumptions in photometric stereo, i.e., orthogonal cameras and directional lights. According to the semiparameter BRDF model [23,33], for most isotropic reflectances, the image formation equation of photometric stereo can be written as: where f BRDF represents the BRDF function, n represents the surface normal, l is the lighting direction, and v is the viewing direction, which is [0, 0, 1]. For a Lambertian surface, the BRDF is a constant, and the L2 method (least square method) [12] can solve the surface normal well through three or more observations. However, the Lambertian surface barely exists. For a non-Lambertian surface, the problem of predicting normal n from the light source direction l and image brightness I is significantly more complicated because the BRDF function is unknown.
We design a learning-based approach to solve the problem of non-Lambertian. Instead of solving for n directly, we implicitly learn the mapping between input [I, l] and output n through the feature fusion convolutional neural network named FFCNN.
Firstly, we use a normalization strategy to process the image following [31]. Points at the same position in each image are processed as follows: where i and i represent the value of original and processed points, i 1 and i m represents points in the same position on the first images and the mth images. This can remove the effect of reflectivity in low-frequency information, which is very close to Lambertian reflectance [22]. It should be noted that when the number of input images during the test time and the number of input images during the training are different, the scale of the data will be different. For example, when the pixel values of the image are all 1, the number of input images during training is q, the number of input images during training is t, and the ratio of the input value during testing to the image during training is q/t. We multiply the normalized image by a factor t/q during the test.
Meanwhile, the initial normal is calculated by the L2 method [12] as the input of the subnetwork, which can provide sufficient prior information for FFCNN network. Then we use the preprocessed images and initial normal as the input of FFCNN to estimate the accurate surface normal. It can be written as: where f ( * ) represents the FFCNN model, N represents the estimated normal map, [I 1 , . . . . . . I m ] represents images from the first to the mth, [l 1 , . . . . . . , l m ] represents the corresponding lighting directions, and N initial represents normal map solved by L2 method. As illustrated in Figure 2, our proposed FFCNN model consists of four major components including photometric stereo image feature extraction, L2 normal feature extraction, feature fusion and normal map estimation.
Given m images and m corresponding lighting directions, we expand each light direction into a lighting map with the same size as the image, and then concat the image and lighting map to obtain a 6 × H × W input image IL. Thus, we have m input matrix IL.
Then, the preprocessed data IL is fed into the network named FFCNN. We first deploy a photometric stereo feature extraction module to extract feature map F IL . All the input matrix IL are fed into the photometric stereo feature extraction network sequentially, which means they share the same weight. The shared-weight photometric stereo feature extractor module is composed of seven convolution layers with 256 channels. Downsampling is carried out at the second convolution layer and the fourth convolution layer, and upsampling is carried out at the sixth convolution layer, which greatly increases the receptive field of the network and reduces the memory usage. Given m image-lighting data IL, m feature maps F IL are obtained by this module. Then, we deploy an L2 normal feature extraction module to extract feature map F L2 from the normal map solved by the L2 method [12]. Unlike the photometric stereo feature extraction module, only one down-sampling is performed at the second convolution layer in this module. The remaining 6 convolution layers are all convolution layers with 256 channels.
In the feature fusion module, m feature maps F IL and one feature map F L2 are concatenated. Next, we apply max-pooling [29] to aggregate the features on each feature map, which means that only the maximum feature value is retained for each point. Maxpooling is widely used in photometric stereo, due to the removal of non-activated features, the influence of cast shadows can be eliminated.
Finally, a normal regression module is used to estimate the surface normal. Four convolution layers including an up-sampling convolution layer are used to achieve the same spatial dimensions as the input image. Finally, the L2-Normalization layer is used to normalize the estimated normal map.
In the FFCNN network, all convolutional layers are followed by a Leaky ReLU activation layer. Except for up-sampling layers, all convolution layers adopt kernels size with dimensions of 3 × 3.
In the training stage, we calculate the cosine similarity error between the estimated normal and ground truth as the loss function as follows: where N i,j and N i,j represent the estimated normal and ground-truth, respectively. H and W are the height and width of the normal map, respectively. L is minimized using Adam with the suggested default settings. The more similar N i,j and N i,j are, the closer L is to 0.

Dataset
For training and testing, ground-truth is necessary for the calculation of loss during training and evaluation during testing, and a mass of data are also needed. However, it is very difficult to get enough photometric stereo data and ground-truth. Hence, we use two publicly available synthetic datasets called the PS Blobby dataset and the PS Sculpture dataset for training and one publicly available synthetic dataset called the PS Sphere Bunny dataset for testing [29] (see Figure 3). The MERL dataset is used for synthetic datasets to render 3D objects under different lighting conditions, which contains 100 different BRDFs of real-world materials [34].
The synthetic training datasets use two 3D datasets namely the blobby shape dataset and the sculpture shape dataset, containing 3D models of multiple objects [35,36]. The synthetic testing dataset uses 3D models of sphere and bunny to render photometric stereo data with 100 different materials.
In addition, a publicly available real-world photometric stereo dataset called DiLi-GenT [37] is used to verify the ability of the model to process real-world data, which contains real-world data of 10 objects under 96 different lighting conditions.

Training Details
All experiments are performed in the Ubuntu operating system on a computer with GeForce RTX 3090 Graphics Card and 256GB RAM. Our FFCNN model has 9.21 million learnable parameters. During training, the initial learning rate is set to 0.001, and the learning rate is reduced to half for every 5 epochs. It takes about 8 h to train the FFCNN model when the batch size is 32 and the epoch is 30. Following [29], the height and width of the image are randomly rescaled between [32,128] so that the model could cope with an input of different sizes, and the image is then randomly cropped to 32 × 32 and noise is randomly added.

Testing Details
The PS Sphere Bunny dataset is used to illustrate the ability of the FFCNN model to estimate normal maps on the synthetic dataset. The DiLiGenT benchmark dataset is used to verify the generalization ability of the FFCNN model in dealing with realworld photometric stereo data. The accuracy of the normal estimation is quantitatively evaluated by the average angular error (MAE) between the ground-truth normal map and the estimated normal map as: where n k andñ k represent the ground-truth and the predicted normal maps, respectively. K represents the number of all pixels in the image except the background area. A lower MAE means a more accurate normal map estimated by the model. The photometric stereo data of the polished rail surface are used to verify the generalization of the model and qualitative analysis of the effect of the model in the industrial field.

Effects of Kernel Size
The kernel size affects the receptive field of FFCNN, thus affecting the performance of the network. For each point, the larger the kernel size is, the more information from neighboring points can be used to estimate the normal of the point. However, if the point is too far away from the estimated point, it will have less useful information for the estimated point, and may even contain interfering information. Besides, it will also increase the computational complexity and time cost. We use the PS Sphere Bunny dataset to compare the performance of our model with different kernel sizes. The number of input images is 32 during training and 100 during testing. Both Sphere and Bunny contain data for 100 materials, and the result for each material is the average of 100 random trails. We take the average of the results for all materials. Figure 4 presents the relationship between the kernel size and the MAE of normal estimation. As illustrated in Figure 4, the MAE does not decrease continuously as the kernel size increases. On both bunny and sphere datasets, MAE decreases as the kernel size increases until the kernel size increases to 3. However, when the kernel size increases to 5 and 7, the MAE keeps increasing. This verifies our conjecture that when the kernel size is too large, the point too far from the estimated point may contain redundant information that interferes with the normal estimation. To achieve the best performance, we empirically set the kernel size to 3.

Effects of Input Number
The number of input images during training will also affect the performance of the model. Table 1 lists the performance of FFCNN model with different numbers of input images during training and testing. For a fixed number of inputs during training, the performance of FFCNN basically increases with the number of inputs during testing. When the number of input images during testing is fixed and fewer than 32, FFCNN performs better when the number of input images during training is 2 times that during testing. When the number of input images during testing is fixed and not less than 32, FFCNN performs slightly better when the number of input images is 32 during training. Considering the performance of FFCNN for estimating the normal and the memory of the computer, we use 32 images during training.

Effects of Feature Fusion
The L2 method performs poorly for non-Lambertian surfaces, especially for specular reflection. However, the low frequency information in the image is very close to Lambertian reflection, and the L2 method can estimate the normal map of this part well. The initial normal obtained by L2 method provides useful prior information for FFCNN to extract normal features from observation images. For specular reflection, the normal vector can be regarded as the bisector of the viewing and light directions. The max-pooling operation can extract these significant features and ignore the non-activated features of cast shadow. Table 2 shows the comparison of the performance of FFCNN and FFCNN-Without-L2. FFCNN-Without-L2 is the same as FFCNN but without the L2 normal feature extraction module. As shown in Table 2, on both bunny and sphere, FFCNN performs better than FFCNN-Without-L2. On the real-world object DiLiGenT benchmark dataset, FFCNN performs significantly better than FFCNN-Without-L2 in terms of the average of MAE on ten objects. The performance of FFCNN is obviously superior in most objects, and it is inferior to FFCNN-Without-L2 in only three objects, but the difference is very small. This means that our proposed feature fusion strategy is effective, especially on objects with complex structures, such as the reading object. Figure 5 shows the comparison of the performance of FFCNN and FFCNN-Without-L2 on two objects, ball and reading. It can be seen from the ball object that the performance of FFCNN is better than that of FFCNN-Without-L2 in the non-specular reflection area, which also verifies our conjecture that the L2 method can estimate the normal map close to the Lambertian region well, which provides prior information for FFCNN. Our feature fusion method also performs well on the reading object with more complex surfaces and more specular reflection areas.

Results on Different Materials
Specular reflection is a challenging problem in photometric stereo. The surface of many materials has specular reflection characteristics, especially in the industrial field. Figure 6 compares FFCNN with L2 Baseline [12], PS-FCN +N [31] and FFCNN-Without-L2 on samples of sphere object that were rendered with 100 different BRDFs, which contain a large number of materials with specular reflective properties. In Figure 6, the performance of FFCNN represented by the blue line is significantly better than the other three methods. The material represented on the right of the horizontal axis in this figure contains more intense specular reflection, and our method performs better on these materials, indicating that our method can handle specular reflection surfaces well.

Quantitative Comparison
We compare our method FFCNN with several other state-of-the-art photometric stere solutions including IA-14 [23], ST-14 [22], HI-17 [27], TM-18 [38], CH-18 [29], SI-18 [28], CA-21 [30], and CH-20 [31]. The source code and evaluation results for these methods are publicly available. All 96 images are used to estimate the normal direction of the object. The MAE of the normal map estimated by these methods on the DiLiGenT benchmark dataset is shown in Table 3. Table 3. Quantitative comparison of our proposed FFCNN model and state-of-the-art photometric stereo methods on the DiLiGenT benchmark dataset. * indicates that we use all 96 images to estimate the normal map of Bear, but the result shown in SI-18 [28] (Bear 4.1) was achieved by discarding the first 20 input images. The red values illustrate the best performance. The smaller the value of MAE means the better the model performs.

Method
Ball Our proposed FFCNN model achieved the best results on 10 real-world objects, with a mean angle error of 6.83 • . For objects with strong specular reflection or complex geometric surfaces, such as reading and cow objects, our model performs significantly better than other methods. This illustrates that the proposed network can effectively deal with specular reflection or complex geometric surfaces.
Please note that it was considered that the first 20 images of the bear object were corrupted in SI-18 [28]. After discarding the first 20 images of the bear object, we compare our method FFCNN with the method termed CNN-PS in SI-18 [28], as shown in Table 4. Although our proposed method is slightly inferior to CNN-PS on the bear object, the overall performance of our method on 10 objects is still significantly better.  Table 5 compares the proposed FFCNN and CNN-PS in SI-18 [28] in terms of running time. We repeat the estimation process 10 times and compute the average running time (the forward time of the network). Following the setting in SI-18 [28], we set the number of different rotations for the rotational pseudo-invariance to 10. Since CNN-PS estimates the normal map in a pixel-wise manner, while our method estimates the normal map in a frame-wise manner, the running time of CNN-PS is much more than that of FFCNN. Therefore, our proposed FFCNN is more suitable for applications in industrial fields where high efficiency is required.  Figure 7 shows some qualitative results on three real-world objects in the DiLiGenT benchmark dataset. For the object ball, we use 96 images to estimate the normal map, which means there will be 96 highlight areas. It can be seen from the estimated normal map and the detailed view that our method can handle the specular reflection problem well and estimate a more accurate normal map. For cow and reading objects, it can be seen in the detail view and the error map that our proposed FFCNN model is capable of estimating the normal map with richer details.

The Setup of Photometric Stereo System
To validate the effectiveness of our FFCNN model in industry, we apply FFCNN to normal map estimation of the polished rail welding surface, which contains severe specular reflections and complex reflection characteristics. Obtaining information of the product surface is the first step in applying photometric stereo to industrial applications. However, the geometric surface of the rail is complicated, the underside of the rail head and the upper surface of the rail foot are difficult to be illuminated by multiple light sources as well as reflected to the camera. Figure 8 is an illustration of the problem. Because of the special geometry of the rail, the information gathering of the red and green marked areas is challenging. Therefore, we design a novel photometric stereo image data capture system to solve this problem, as shown in Figure 9. The system consists of four parts, two identical surface information acquisition systems are used to obtain information on the top surface of the rail and information on the bottom surface of the rail namely IASa and IASb (IAS* represents image acquisition system), two identical and symmetrical information acquisition module to obtain information of the rail waist namely IASc and IASd. The module IASa or IASb are composed of a fixed orientation camera and a number of LED lights around the camera. The module IASc or IASd consists of three cameras and LED lights around the camera. The upper and lower cameras in the module IASc or IASd are oriented towards the underside of the rail head and the upper surface of the rail foot respectively to capture information on the concave surfaces. The middle camera is perpendicular to the waist of the rail. In addition, there is a two-degree-of-freedom guide rail on the side for the module IASc or IASd to adjust the height of the module and the distance between the rail and the module. The structure made of profiles supports the operation of the whole equipment. We designed a circuit to control the collaboration between the LED lights and the camera. We used Raspberry PI to send signals which control relays to switch each LED light on and off individually, and the camera takes an image when the LED is turned on. A 5V constant voltage power supply powers the Raspberry PI, and a 700 mA constant current power supply keeps the brightness of each LED light constant, as the surface normal can only be solved by keeping the brightness of the LED lights constant.
With this equipment, we can collect information of the welded rail surface after grinding from multiple perspectives.

Result on Polished Rail Welding Surface
After obtaining the surface information of the polished rail, we use our proposed FFCNN to estimate the normal map. Some examples of results on the top surface of the rail, the waist of the rail and the upper surface of the rail foot are shown in Figure 10, Figure 11 and Figure 12, respectively. As shown in Figure 10, the normal map estimated by our method contains rich detailed information, although the acquired image contains a large area of specular reflection. Figure 11 compares the performance of FFCNN and PS-FCN +N [31] on two samples of rail waist respectively. FFCNN performs significantly better than PS-FCN +N , especially in detailed information, as shown in the red box marked area. The normal map estimated by FCCNN contains more detailed information, while the important details in the normal map estimated by PS-FCN +N are smoothed. Figure 12 compares the performance of FFCNN and PS-FCN +N on two samples of the upper surface of the rail foot, respectively. Similarly, FFCNN performs significantly better than PS-FCN +N , and the estimated normal map of PS-FCN +N seems to be wrong. The normal map estimated by FFCNN performs well in terms of details and textures, which can provide rich information for the subsequent detection and evaluation of polishing quality, as shown in the red box in Figure 11.

Conclusions
In this paper, we propose a complete photometric stereo processing framework to estimate the normal map of non-Lambertian surfaces, especially polished surfaces. We propose a feature fusion neural network to regress the surface normal map, which uses the initial normal map obtained by the L2 method as prior information, and fuses the features extracted from the original image with the features extracted from the initial normal map. The proposed method makes full use of the low-frequency information close to the Lambertian and the information of the specular reflection area to make the estimated normal map more accurate. We have experimentally investigated and verified the performance of our model. The proposed method performs better than the state-of-the-art methods on both synthetic datasets and real-world object DiLiGenT benchmark dataset. Additionally, the proposed method is used to estimate the normal map of the polished rail welding surface, verifying the effectiveness of our method in the industrial field. We design a photometric stereo information capture system with multi-view cameras and multi-angle lights to obtain the surface information of polished rail welding surface with complex geometric surfaces. The normal map of the polished rail welding surface estimated by our FFCNN model contains rich texture information and detailed information, which can provide rich information for surface quality evaluation. This demonstrates the effectiveness of our method for industrial non-Lambertian surfaces, as well as specular reflective surfaces.

Data Availability Statement:
The data presented in this study are available on request from the correspondingauthor. The data are not publicly available due to the privacy policy of the organization.