Quality control is an essential part of the manufacturing industry, which helps standardize both production and reactions to quality issues. Most of the current inspection work is mainly done manually by visual inspection. The results are heavily influenced by subjective human factors, leading to unstable production quality control. The development of computer vision has greatly alleviated this problem. Automated inspection technology based on computer vision is one of the popular ways for efficient and high-precision defect detection through processing and analysing the images. Traditional defect detection methods based on computer vision mainly use techniques such as edge detection [
1,
2], grey-scale thresholding [
3,
4], and image segmentation [
5,
6]. However, traditional ones have limitations such as their reliance on artificially defined features. In recent years, numerous learning-based techniques have been proposed for defect detection, which transform data into complex and abstract representations that enable the features to be learned during the training phase to overcome the requirement of predefined features. A necessary condition for defect detection using computer vision algorithms is the ability to extract the characteristics of the defects in the image. However, the high reflective properties of the metal surfaces cause overexposure in some areas of the captured image and override the characteristics of the defect, as shown in
Figure 1.
When the angular bisector of the incident light and the view coincide with the normal vector, the observation points appear glossy. The surface prevents the light from falling on itself, which leads to the attached shadow in the observation area. When the surface prevents the light from falling on another surface, the observation are appears with a cast shadow [
7]. However, as the shape of the object is unknown, the way to avoid defects being in the highlight or shadow areas is to adjust the angle between the direction of light incidence and the direction of view. However, it is too tedious to adjust the light direction and the view with the variety of shapes of metal parts. It is therefore effective and convenient to reduce the probability of defects remaining in the highlight or shadow areas by illuminating the object in different directions. Furthermore, we introduce photometric stereo in the metal defect detection task, which removes the interference of highlights by using the information from multiple illuminations.
In recent years, photometric stereo has received significant attention in the field of advanced manufacturing [
8,
9,
10,
11,
12], which estimates the accurate and highly detailed surface normal of a target object based on a set of images captured in different light directions using a fixed-viewpoint camera. The basic theory of photometric stereo was first proposed by Woodham et al. [
13] based on the assumption of ideal Lambertian reflectance in the 1980s. However, there are few objects in reality that fit the ideal Lambertian assumption. Therefore, over the last forty years, numerous methods aimed at non-Lambertian objects have been proposed to improve the applicability of photometric stereo. Before the development of deep learning, most traditional photometric stereo methods treated the non-Lambertian pixels as outliers [
14,
15,
16,
17,
18]. However, these methods only worked for objects with sparse non-Lambertian regions. To handle dense non-Lambertian reflections, some methods used analytical or empirical models to simulate reflection on the surface models [
19,
20,
21,
22,
23,
24,
25]. However, this type of method is only applicable to a limited range of materials. In recent years, numerous deep-learning-based photometric stereo methods have been proposed to handle non-Lambertian objects in reality. According to how learning-based methods process the input image, methods can be divided into per-pixel [
26,
27,
28,
29] and all-pixel methods [
30,
31,
32,
33]. Per-pixel methods take the observed intensities as the input and output a surface normal for a single pixel. Santo et al. [
34] first introduced deep learning into photometric stereo and propose a deep photometric stereo network (DPSN), which assumed that the light direction stayed the same during the testing and training. To generalize the learning-based per-pixel methods, Ikehata projected the observation intensities onto a 2D space and proposed an observation map, which rearranged the observation intensities according to light directions. The operation of the observation map enabled per-pixel methods to handle the arbitrary number of directions and order-agnostic lighting. Numerous methods, such as CNN-PS [
26], LMPS [
28], and SPLINET-Net [
29], used this data preprocessing method in their neural networks and achieved great performance on the DiLiGenT benchmark dataset. All-pixel methods take whole images or patches as well as their corresponding light directions as input and output a surface map with the same resolution as the input. To handle order-agnostic light, Chen et al. [
30] proposed an all-pixel photometric stereo network named PS-FCN, which used a sharing-weight feature extractor to handle per-input image and used max pooling to aggregate the extracted features from multiple images. Based on this strategy, SDPS-Net [
32] and GC-Net [
35] were proposed to handle the uncalibrated photometric stereo task. According to the summary by Zheng et al. [
36], per-pixel methods are robust to nonuniform distributions of surface materials but cannot handle well regions with global illuminations effects (shadows or inter-reflection) since shape information is not explicitly considered. Because all-pixels methods are generally trained using various shapes with a uniform material for each shape, they perform well for regions with a global illumination effect, while they are ineffective for objects with nonuniform materials. There are several methods to improve performance by combining the per-pixel and all-pixel methods. SPS-Net used the attention mechanism to extract photometric information, then applied convolution layers to extract spatial information. Yao et al. [
37] proposed a graph-based photometric stereo network (GPS-Net) that first introduced structure-aware graph convolution filters to explore per-pixel information and classical convolution to explore spatial information. Ikehata [
38] applied Transformer [
39] to extract per-pixel and spatial information and proposed PS-Transformer, which aimed to handle the sparse photometric stereo task. Both defect detection methods and photometric stereo methods achieve outstanding performance on their respective benchmark. There are also methods [
8,
9,
40,
41] that combine photometric stereo with defect detection. Ren et al. [
42] used a data-driven PS method to extract the surface normal and separate defects from the background through filters. The method could only locate defects but not classify them and could not be applied when the defects and the background had relatively similar characteristics in the frequency domain. Feyza et al. [
43] used an L2 photometric stereo [
13] method to estimate the albedo map and normal map of objects. Furthermore, they took the combination of the albedo map and the normal map as input and used a CNN to detect the defects. The L2 photometric method is based on the Lambertian assumption and does not apply to most objects in reality. Defect detection and photometric stereo methods have achieved outstanding performance on their respective benchmarks with the development of deep learning. However, there are few methods that combine learning-based photometric stereo methods with learning-based defect detection methods. This is because most of the current defect detection data are captured under a single illumination and cannot be applied to photometric stereo. Meanwhile, existing photometric stereo datasets lack defect samples.