GPU Accelerated Image Processing in CCD-Based Neutron Imaging

Image processing is an important step in every imaging path in the scientific community. Especially in neutron imaging, image processing is very important to correct for image artefacts that arise from low light and high noise statistics. Due to the low global availability of neutron sources suitable for imaging, the development of these algorithms is not in the main scope of research work and once established, algorithms are not revisited for a long time and therefore not optimized for high throughput. This work shows the possible speed gain that arises from the usage of heterogeneous computing platforms in image processing along the example of an established adaptive noise reduction algorithm.


Introduction
Non-destructive testing is a term comprising several important techniques to probe specimen for certain characteristics without the need to physically alter them and has a vast field of applications. One area inside these non-destructive testing methods is radiographic as well as tomographic imaging of samples with e.g., X-rays or neutrons. In those imaging techniques, a sample is placed inside an incident beam. The characteristics of the beam are changed by the introduction of the sample and this change can be recorded by suitable detectors such as scintillation counters or cameras to gain information about the sample. In neutron imaging, the intensity of the incident beam is decreased inside the sample by several different interactions such as absorption, coherent and incoherent nuclear scattering and magnetic dipole interaction. As the interaction cross-sections depend on the composition of the atomic nucleus, they vary wildly over the periodic table and can even have significant differences for different isotopes of the same element as can be seen in Table 1. Table 1. Interaction cross-sections of neutrons for some selected elements. All cross-sections are given in barn with 1 b = 10 × 10 −24 cm 2 . The scattering cross section σ sc and the absorption cross section σ ab are tabulated for v = 2200 m/s (E = 25.30 meV) and taken from [1] and given without the error. Therefore, the attenuation of a sample is dependent on its material composition and is pictured in a 2D image in projections or in a 3D array of voxels in a tomographic acquisition. Neutrons are mostly produced in research reactors by nuclear fission or in spallation sources and are characterised by their temperature which is linked to their energy via (1) Mostly thermal and cold neutrons with temperatures around 25 K to 300 K are used for imaging [2]. Compared to X-ray imaging, a typical neutron imaging setup has a lower total flux and the global availability of neutron beam-time is magnitudes lower [3,4]. As the cross-sections of neutron interactions are strongly energy dependent, in general the spectrum of the incident beam has to be well known in order to being able to quantify the results. Therefore, not all neutron sources are equally well suited for imaging applications. This is why it is essential to maximise usable beam-time and to get the most of the acquired images with suitable image processing. Furthermore, a fast image-processing work-flow allows scientist to perform dynamic measuring rows and to directly assure the quality of the acquired images on-site. Especially in medical imaging in the X-ray domain, GPU based image processing is well established and has shown potential for speed increase in several orders of magnitudes [5]. Noise reduction is an important first step of all image pipelines and therefore chosen as an example in this work. GPU-based video denoising has also been demonstrated with speed gains [6]. However, the application to neutron imaging has not been shown. Furthermore, the chosen implementation shows some optimisations that are transferable to a wide range of image-processing tasks not only in neutron imaging but in general to all pixel based imaging techniques.

Setup
The measurements for this work were carried out at the Advanced Neutron Tomography and Radiography Experimental System (ANTARES) inside the research reactor neutron source facility Forschungs-Neutronenquelle Heinz-Maier-Leibnitz (FRM II). The cold neutron source used in the experiment has a temperature of about 25 K [7]. The shielding hutch of the ANTARES has three separated compartments, of which the first is used for beam-and spectrum shaping devices, amongst them being a mechanical velocity selector, a double crystal monochromator or interference gratings that can be used for grating based phase contrast imaging [8]. The detection system consists of a shielded detector box holding an Andor Ikon L CCD camera by Oxford Instruments, Abingdon, UK, with 2048 pixels in each dimension. A two-mirror beam-path allows the camera to be placed outside the direct neutron beam-path. A schematic drawing of the setup and the detector is shown in Figure 1

Noise Removal
The images taken with the setup above suffer from noise stemming from various sources. Besides electronic and thermal noise as well as readout noise from the camera, the main noise source in neutron imaging experiments that deteriorate image quality are high energy γ-particles that are produced in nuclear reactions of neutrons inside material in and around the experimental setup. Those high energy particles lead to bright spots when they hit the detector, which are referred to in this work as gamma spots. An example of the resulting artefact of such gamma spots is shown in Figure 2. Two bright spots with a bright surrounding can be seen in this image. These spots results from an excess charge production on the CCD sensor, that can no longer be contained in the potential well of one single pixel. Furthermore a streak in the readout direction of the camera can be seen emerging from the primary interaction position. This is due to the fact, that charge transfers from pixel to pixel in the readout direction are facilitated due to the inner workings of a CCD camera. This streak therefore is more prominent at higher readout frequencies and lower potential well depths of the pixels. The problem with these gamma spots is, that a global correction such as Gaussian blurring or taking the median of the affected area decreases the resolution of the image due to the averaging nature of these filters. Adaptive filters, such as described by [9] circumvent this problem by adapting the size of the filter kernel to the strength of noise in the image. Their algorithm will be shortly summarized in the following. A Laplacian of Gaussian (LoG) filter is a well-known edge finding algorithm and is a convolution with the LoG kernel, that can be written in the discrete form with the abbreviation h g (n 1 , n 2 ) = e −(n 2 where n 1 and n 2 are denoting the position of the kernel element in each dimension and σ is the standard deviation. The resulting edge-image shows high intensities at pixels, where there is a fast change in image intensity in the raw image. As higher noise levels lead to sharper edges, the edge-image can be thresholded in several levels, to adjust the size of a subsequent median kernel. In that way, the median filter only affects areas, in which noise above a certain threshold is present and can be tuned to account for a larger neighbourhood when the gamma spots affect a larger area. To account for very bright and big spots, the threshold image for the median with a kernel size of 7 can additionally be binary-dilated.

Sample
To allow for reproducible results, three dimensional test arrays with a discrete uniform random gray value distribution in the open interval [0, 1000) were created with a size of 1024 pixels in each dimension to simulate an image stack. These images where used to profile both the running time of the benchmark implementation as well as our optimised parallel implementation. The demonstration of the noise-removal abilities of the presented implementation is shown on images of polymer electrolyte membrane fuel cells (PEMFCs) acquired at the ANTARES and explained in more detail in [10].

Implementation
A reference implementation of the described algorithm was done in Python 2.7 to allow for detailed profiling and identification of the most timely image processing steps needed which have been identified to be the convolution with the precomputed LoG kernel and the median filtering. In general, a convolution in the image domain can be expressed as a multiplication in the Fourier domain and the needed algorithms to do a Fast Fourier Transform (FFT) are well known and constantly improved [11][12][13]. However, for small kernels convolution is faster and has less memory overhead than FFT [14]. Furthermore, there is no convolution kernel that defines the median, as the median is based on order statistics. Therefore a pixel-wise implementation is chosen both for the reference and the optimized version. The optimised implementation is done in OpenCL with the Python API PyOpenCL [15]. OpenCL was chosen because of it being an open royalty-free standard that is portable across a lot of different computing architectures such as CPUs and GPUs as well as other programmable hardware such as FPGAs and DSPs [16]. OpenCL implements an image object, which has a dedicated underlying memory model that allows the use of samplers for reading. As these samplers allow for automatic bounds-checking, the algorithm gets a lot simpler as manual bounds-checking has to be done no longer. Further more, the image model allows caching of pixel values, so that subsequent access to pixels in the neighbourhood of a lately read-out pixel are accelerated. Memory transfer from host to device is done with PyOpenCL and each transfer as well as each kernel enqueueing is enriched with an OpenCL event that allows to extract exact timing information for memory transfers as well as kernel executions on the device. The source code of the OpenCL kernels is shown in Appendix A.

Profiling
The profiling of the implementation in python shows that the running time of the convolution and the median scales linearly with the pixel count of the input array as can be seen in Figure 3. Furthermore, the median additionally scales with the footprint or region size.   [17].
From this measurements, the time needed for an 2048 × 2048 pixel image can be extrapolated. To get an estimate, for the running time in a real application, the running time of the median kernels has to be multiplied with the noise density in each threshold bin, as the median only has to be executed on these values, which reduces running time. This is not the case for the convolution. Li et al. mention noise densities of 13.35%, 1.68% and 0.33% for the corresponding median sizes of 3, 5 and 7 [9]. With these noise densities, the running time of the complete algorithm is dominated by the time taken in the convolution step and is around 1 s for one image compared to 5 s mentioned in the same source.

Optimisation
In the OpenCL implementation, the input data, namely the array holding the LoG coefficients and the raw input images, has to be transferred from host to device. Subsequently the convolution, thresholding, dilation and the adaptive median is run on the device. The timing of these steps on the same random images used before is done on one Tesla K10 G2 by Nvidia, Santa Clara, CA, USA, and shown in  This timing suggests a possible speed increase of 160 times compared to the original implementation. Transfer time back to the host is not included in the calculation as it strongly depends on the speed of the desired output storage but in general should be in the order of the time needed for the input. The time needed in this implementation is not dependent on the noise density or the threshold levels as the kernel is run on every input pixel in parallel as long as the input can be mapped to the available number of work items on the GPU device. If the size of input data is increased, the running time increases discretely with each new memory block. As an example, an image stack with a z-size between 65 and 128 would take double the time as the shown timing. The running time can further be decreased by interleaving the memory operations with the calculation. On modern graphics hardware, the input copy of a subsequent stack can happen while the calculations are done for the previous input. Conversely, the output of the previous result can be done during the calculation of the subsequent array. If implemented that way, the running time for a stack of 640 images is calculated to be around 4 s. As this implementation is highly parallelised, if higher speed is needed the algorithm can be run on several OpenCL devices at parallel. The application of the filter to an in-situ radiography of a fuel-cell is shown in Figure 5. This work shows, that efficient parallel programming on GPUs is a suitable technique in image processing for CCD-based neutron imaging as a large range of image processing algorithms can be implemented in that way. If the processing-chain gets longer, the overhead of memory operations from host-device transfers can be more and more neglected against the speed gained from parallel execution. We showed, that even pixel-based algorithms and algorithms which have no formulation as a convolution kernel can be efficiently implemented on GPU to allow beam-line users a real-time feedback regarding image quality. Further work in this direction could be done in the GPU implementation of the complete imaging toolchain from background subtraction and flat-field correction up to implementation of dynamic image statistics such as pixel-wise standard deviations to allow e.g., for real-time corrections of intensity fluctuations of neutron sources.

Funding:
We acknowledge financial support through the DFG (Gottfried Wilhelm Leibniz program) and the BMBF (project 05K16WO8, NeuRoFast).

Acknowledgments:
The authors would like to thank Simon Thiele, Riko Moroni and Matthias Breitwieser from IMTEK, Freiburg for the collaboration with the fuel-cell measurements.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.