Implementation of Thermal Event Image Processing Algorithms on NVIDIA Tegra Jetson TX2 Embedded System-on-a-Chip

Advances in Infrared (IR) cameras, as well as hardware computational capabilities, contributed towards qualifying vision systems as reliable plasma diagnostics for nuclear fusion experiments. Robust autonomous machine protection and plasma control during operation require real-time processing that might be facilitated by Graphics Processing Units (GPUs). One of the current aims of image plasma diagnostics involves thermal events detection and analysis with thermal imaging. The paper investigates the suitability of the NVIDIA Jetson TX2 Tegra-based embedded platform for real-time thermal events detection. Development of real-time processing algorithms on an embedded System-on-a-Chip (SoC) requires additional effort due to the constrained resources, yet low-power consumption enables embedded GPUs to be applied in MicroTCA.4 computing architecture that is prevalent in nuclear fusion projects. For this purpose, the authors have proposed, developed and optimised GPU-accelerated algorithms with the use of available software tools for NVIDIA Tegra systems. Furthermore, the implemented algorithms are evaluated and benchmarked on Wendelstein 7-X (W7-X) stellarator experimental data against the corresponding alternative Central Processing Unit (CPU) implementations. Considerable improvement is observed for the accelerated algorithms that enable real-time detection on the embedded SoC platform, yet some encountered limitations when developing parallel image processing routines are described and signified.


Introduction
Infrared (IR) thermal imaging nowadays is a non-destructive monitoring and measurement technique widely applied in both the industry and research. Thermography enables one to observe the radiating surface of an object with a temperature greater than 0 K [1]. Under real conditions, the image captured by an IR detector has to be pre-processed to take into account the observed body emissivity as well as device nonuniformities. With the advances in thermographic cameras and computational capabilities, it has become feasible to capture and process high-resolution images in real-time. Currently, dedicated vision systems are one of the major diagnostics in large-scale physics experiments, especially in nuclear fusion projects [2][3][4][5].

Image Plasma Diagnostics
Current objectives for image diagnostics systems mainly concern machine protection, plasma control and physics exploitation. The first exemplary use case, utilising images in nuclear fusion, covers algorithms for detecting and tracking thermal events occurring throughout a plasma discharge (see Figure 1). In the Wendelstein 7-X (W7-X), the algorithms identify events such as hotspots, leading edges, strike-lines, surface layers that exceed the operational limits of the stellarator [6][7][8][9][10]. Based on the detected thermal events, the risk is estimated, and if it exceeds the threshold, the interlock system is triggered [2]. Similar challenges are faced in the W Environment in Steady-state Tokamak (WEST) with persistent and transient hotspots detection and categorisation [11,12]. Moreover, strike-line images together with toroidal current are inputs for a Convolutional Neural Network (CNN) that outputs lower and upper divertor coils controls. The instrumentation is realised in order to modify a strike-line position and shape to protect Plasma Facing Components (PFCs) [13]. Physical quantities such as heat flux are computed using surface temperature and THermal Energy Onto DivertOR (THEODOR) code [14,15]. The estimated heat flux together with Gaussian spatial filtering and moving average filtering allows one to characterise strike-lines in the W7-X [6].
Visible light images processing is applied in areas where IR systems have issues with emissivity, absorption and dynamic range. Visible spectrum images are utilised to determine plasma discharge location, shape and timing in the W7-X [16,17] as well as to reconstruct a plasma boundary in the Experimental Advanced Superconducting Tokamak (EAST) [4]. Hotspots are also analysed by detecting clusters of bright pixels and tracking them across consecutive frames in the Joint European Torus (JET) [3]. Furthermore, images aid in the analysis and detection of anomalies such as falling debris, Multifaceted Asymmetric Radiation From the Edges (MARFEs) and Edge Localised Modes (ELMs) [11].
Besides, archived data is further utilised for offline analysis by physicists and diagnosticians. Isolation of thermal events and determining their ontology allow one to automate the plasma control strategies [6]. Reliable, automated and real-time systems are the basis for steady-state operation in future fusion experiments.

General-Purpose Computing on Graphics Processing Unit
According to the requirements for the future viewing systems in the ITER project, data volume for processing from a single camera was estimated to [18]: • 600 MB/s for a 16-bit IR image with resolution 2000 × 1500 at 100 Hz • 12 GB/s for an 8-bit visible image with resolution 4000 × 3000 at 1 kHz As a consequence, for robust autonomous machine protection and plasma control during operation, significant computational power is necessary. Acceleration of compu-tations might be provided by the general-purpose computing on Graphics Processing Units (GPUs) [19]. GPUs are flexible hardware accelerators for intensive computations that offer the best price-performance ratio compared to Field-Programmable Gate Arrays (FPGAs) [20] and Digital Signal Processors (DSPs) [21]. The highest performance increase is observed in applications with massive parallelism and minimal data transfers between a device and a host, i.e., Central Processing Unit (CPU). Although FPGAs might provide better power consumption-performance ratio, the increased implementation time and complexity is inconvenient in the rapidly changing thermonuclear fusion systems. Moreover, GPUs are common in plasma diagnostics setups [4,7,10], i.e., current configurations typically include GPUs as image processing hardware accelerators. As a consequence, it is significantly easier to integrate described algorithms and techniques. However, it is advisable to incorporate FPGAs for simpler and established pre-processing algorithms, e.g., correction and calibration steps.
In this paper, two thermal event detection algorithms for nuclear fusion were implemented to investigate the suitability of low-power embedded NVIDIA Tegra System-on-a-Chip (SoC) platforms for real-time processing. Conclusions are derived based on the performed benchmarks and the comparison between embedded GPU and CPU implementations.

NVIDIA Tegra System-on-a-Chip
NVIDIA Tegra SoC integrates an embedded CPU and GPU into a single integrated circuit. NVIDIA Tegra devices are low-cost, small and energy-efficient, yet powerful embedded platforms. NVIDIA Jetson TX2 [22] is an embedded system equipped with an NVIDIA Tegra destined for edge computing (see Table 1). It runs Linux for Tegra (L4T) Operating System (OS) based on Ubuntu 18.04 that is a part of dedicated JetPack SDK (https://developer.nvidia.com/embedded/jetpack/, accessed on 21 July 2021) 4.5.1 incorporating Compute Unified Device Architecture (CUDA) 10.2 toolkit for building accelerated solutions.  [23] shown in Figure 2.
MicroTCA.4 is a strongly adapted architecture in large-scale scientific experiments, including nuclear fusion projects such as ITER and W7-X. The MicroTCA.4 form factor enables a modular design that supports acquisition from several fast high-resolution cameras and frame grabbers equipped with a Camera Link interface in a single chassis. In addition, MicroTCA offers embedded system management and health monitoring capabilities [5,24,25]. Due to the low-power consumption, NVIDIA Tegra embedded GPUs are interfaced through additional modules [26,27] with MicroTCA.4 architecture in order to deliver a complete processing chain. MicroTCA.4 slots are limited to 80 W, and the average power consumption for a Tesla series Peripheral Component Interconnect Express (PCIe) GPU is 250 W. As a consequence, many discrete GPUs are inapplicable.

Computer Vision Software
OpenCV (https://opencv.org/platforms/cuda/, accessed on 21 July 2021) 4.4.0 was mainly used to implement image processing algorithms for both CPU and GPU. It is a leading open-source, optimised and cross-platform library for Computer Vision (CV) tasks. The library was compiled with CUDA support to enable Application Programming Interface (API) for graphics accelerators. Moreover, other NVIDIA libraries such as VisionWorks (https://developer.nvidia.com/embedded/visionworks/, accessed on 21 July 2021) 1.6, Nvidia Performance Primitives (NPP) (https://docs.nvidia.com/cuda/npp/, accessed on 21 July 2021) 10.2.89 and Thrust (https://developer.nvidia.com/thrust/, accessed on 21 July 2021) 1.9.7 were evaluated to be feasible to perform additional GPU operations that complement missing or suboptimal OpenCV functionalities.
The programming language used for development was C++17 compiled with GNU Compiler Collection (GCC) 7.5.0 as it supports the abovementioned libraries; besides, it allows one to provide custom CUDA kernels as well as manually manage the GPU's memory through the CUDA runtime API.

Wendelstein 7-X Experimental Data
The experimental data used for implementing and evaluating algorithms comes from W7-X stellarator experiments (Max Planck Institute for Plasma Physics, Greifswald, Germany). The program identifier of the archived calibrated pulse is 20171114.053, and the camera port is AEF20. Each image, captured by IRCam Caleo 768kL IR camera, has a 14-bit depth with a resolution of 1024 × 768. In addition, frames have assigned epoch timestamps as well as dynamic attributes such as Region of Interest (RoI), sensor temperature and frame rate. The data is stored in Hierarchical Data Format version 5 (HDF5) and contains supplementary pixel-wise scene model information such as Field of View (FoV) of the camera, Computer-Aided Design (CAD) model of the stellarator, hierarchical PFC identifiers. The HDF5 supports n-dimensional datasets, bundles together metadata and features high I/O performance; therefore, it is widely applied throughout scientific fields.
Data flow in the evaluated NVIDIA Jetson TX2 configuration is visualised in Figure 3. Since the GPU and CPU share the same physical memory, a pinned buffer is accessible from both devices. However, page-locked memory is uncached on NVIDIA Jetson TX2; therefore, a frame is efficiently copied with Direct Memory Access (DMA) to a cacheable device memory buffer for further computations [28].

Algorithms Implementation
This section describes the main ideas behind the detection of thermal events and GPU-accelerated algorithm implementation details. The source code of the implemented algorithms and benchmarks is available on GitLab (https://gitlab.dmcs.pl/jablonskiba/ mdpi-energies-2021-thermal-events-detection-tegra, accessed on 20 May 2021).

Scene Model Pre-Processing
The scene model data is transformed to the format utilised in the detection algorithms prior to any computations. Hierarchical PFC identifiers are encoded in 32-bits in the scene model. Thus, they are decoded to extract top-level components (see Figure 4) according to the predefined components coding [29]. In order to obtain a continuous mask, without any undefined pixels, the morphological closing with a 3 × 1 kernel and median blur with a 3 × 3 kernel operations are applied (see Figure 4). The undefined pixels inside PFCs might be observed in Figure 5 in the form of empty spaces between tiles. The components segmentation allows one to identify components affected by thermal events and utilise the masks to execute certain operations on individual parts. Furthermore, 2D pixel coordinates are transformed to 3D Cartesian coordinates based on the Lookup Table (LUT) generated from the scene model's x, y, z mappings.

Overload Hotspots Detection Procedure
The first algorithm detects and tracks overload hotspots, which exceed the operational limits of the PFCs [7] shown in Figure 5. The processing pipeline for the hotspot detection algorithm is indicated in Figure 6. Topological structural analysis extracts contours of overheating blobs [30]. The initial frame upload to a GPU is performed concurrently with the execution of steps one and two. A frame is divided row-wise into four chunks, and each chunk is processed asynchronously on a different CUDA stream. Copy and execution overlap is possible in this case as the host memory is pinned and executed computations are embarrassingly parallel. The transfer between steps four and five utilises unified memory with prefetching in order to reduce the migration overhead by enabling cached access to the same buffer for both the CPU and GPU. The entire host buffer is prefetched with cudaStreamAttachMemAsync (. . . , cudaMemAttachHost) before its use. As it is the asynchronous call to CUDA API, synchronisation is necessary if the host immediately processes the buffer. For GPU prefetching, cudaStreamAttachMemAsync (. . . , cudaMemAttachGlobal) is invoked. It is noteworthy that the cudaStreamAttachMemAsync function shall be used as the cudaMemPrefetchAsync, advised for discrete GPUs, is unsupported on Tegra devices [28]. Hotspot average temperature, maximum temperature and area attributes are computed and updated for each tracked hotspot. Moreover, pixels that contribute to the hotspot are translated to world coordinates using the LUT described in Section 3.1.
The result (see Figure 7) for the available dataset corresponds with the hotspot labelled on the baffle with a maximum temperature of 456°C in [7]. The hotspot originated at timestamp 1,510,677,589,505,503,488 ns epoch time and persisted for around 190 ms.

Tracking with Cluster Correspondence
Finally, tracking of blobs between consecutive frames is performed with correspondence criteria proposed in [3] that investigates overlapping and absolute size difference between pairs of blobs within consecutive frames.
The correspondence criteria for blobs A and B is defined with Equations (1)-(3) as: where min_area and max_oversize are adjustable parameters that constrain minimum overlap and maximum absolute size difference between blobs, respectively. An example presenting correspondence criteria between two blobs is shown in Figure 8. For the purpose of blob matching, the required minimum overlap is equal to 80% and the maximum two-fold size difference is allowed. The areas of blobs A, B, A ∩ B are computed concurrently on different CUDA streams to reduce the overall runtime as these operations are independent and require the same calculations.

Surface Layers Detection
The second implemented and benchmarked algorithm detects surface layers. Exposure of components to heat load (strike-line) leads to carbon erosion that is redeposited in the form of thin layers. The measured temperature is higher due to the lower thermal connection to the component surface. As a consequence, a false hotspot alarm might be triggered [9,10]. The detection scheme is based on the idea of a normalised derivative of the temperature evolution proposed by [14]. The authors state that the heating and cooling of regions affected by surface layers is more rapid compared to other regions.
This allows one to identify those regions by thresholding temperature evolution (see Figure 9). The normalised derivative is computed pixel-wise with Formula (4) as: where T(x t ) is the pixel temperature at time t. T(x t − 1) and T(x t − 2) correspond to the pixel temperatures in two previous frames in respect to the frame at time t. In the available experimental data, the detected surface layers are visible on the horizontal and vertical divertors (see Figure 10). Red colour designates surface layers that originated due to heating, blue due to cooling and green due to both factors. As expected, the surface layers were observed in regions that are affected by the heat load caused by the strike-lines. The analogous procedure might be applied to detect delaminations where thermal equilibrium is reached slower and at a higher surface temperature [31]. The surface temperature in areas covered by the surface layers (pixel with coordinates x: 599, y: 403) and the normal surface (x: 748, y: 464) under the strike-line were analysed. In Figure 11, rapid heating and cooling are observed, and the reached surface temperature is higher for the area covered by the surface layers. In Figure 12, the derivative exceeds the predefined threshold, and derivative spikes are notably larger for the surface layer area in general.    The used derivative threshold value that indicates if the area contains surface layers is the same as in [14].

Results
In this section, the suitability of the embedded NVIDIA Tegra for real-time processing is investigated. The most computationally intensive pulse sections for the particular algorithms were benchmarked. The followed benchmarking method assumes that the algorithms were compiled with the highest optimisation level and are repeated in a loop. The highest performance mode for both CPU and GPU on Jetson TX2 is activated with commands sudo nvpmodel -m 0 and sudo jetson_clocks. Running the algorithms on the first several frames without including them in the measurements allow one to avoid the initialisation overhead such as initial buffers allocation and CUDA context creation. Supplementary calculations for visualisation purposes, e.g., colouring surface layers types, the imposition of the CAD model, are not included in the measurements.
Benchmarks were made with analogical implementations that only utilise the CPU. Individual functions are mainly implemented in OpenCV compiled with support for the parallel framework-Threading Building Blocks (TBB) (https://software.intel.com/content/ www/us/en/develop/tools/oneapi/components/onetbb.html, accessed on 21 July 2021) 2017.0 in order to benefit from multiple cores. Average statistics were calculated from the unit timings obtained for each frame across the selected pulse section. The final statistics are computed from the result set, such as average, standard deviation, minimum and maximum.
The selected sections correspond to certain discharge stages (see Figure 13). The overload hotspots detection algorithm is benchmarked throughout the detected hotspot persistence, from timestamp C to D (191 ms) that corresponds to 20 frames. The hotspot ceases around the time when the heat injection is finished. The surface layers detection algorithm is benchmarked throughout the initial temperature rise, from timestamp A to B (251 ms), which equals 26 frames. The initial transfer cost to a GPU is included in the measurements. Gathered measurements on the setup presented in Table 1 are depicted in Table 2. The time-weighted average metrics of achieved occupancy and multiprocessor activity for all kernels, including both custom and invoked internally by the libraries, are collected with the nvprof profiling tool and summarised in Table 3. In both algorithms, the utilisation of a GPU resulted in overall performance improvement.

Discussion
The implemented algorithms are a subset of the whole processing pipeline, and additional algorithms must be applied for each frame to provide exhaustive image plasma diagnostics. According to [7], the thermal events must be detected within 100 ms to trigger the interlock system. Nonetheless, the available time is reduced by other activities such as acquisition, correction, analysis and alarm generation. Therefore, the shorter the processing algorithm runtime is, the more time might be devoted to other processes. In the case of incorporating the embedded GPU for both algorithms running sequentially, 13.89 ms is spared on average for each frame. Moreover, there is enough time to simultaneously run both algorithms in order to assess the risk, e.g., if a detected hotspot is a result of the surface layers. As a consequence, proposed algorithms might be employed for real-time thermal events detection. A significant performance improvement is observed for the surface layers detection on the embedded GPU as all operations are embarrassingly parallel; therefore, there is no need to migrate data to the host for other computations. For the overload hotspot detection on the embedded GPU, the acceleration is lower due to the bottleneck of a topological structural analysis performed on the embedded CPU.
It is noteworthy that to obtain at least the performance comparable to the embedded CPU overload hotspots detection implementation, it was necessary to apply several GPU optimisation techniques such as concurrent copy and execution, page-locked memory, unified memory and custom kernels in addition to incorporating optimised libraries. The crucial part was realising the discrepancies between the memory architecture of embedded NVIDIA Tegra devices and discrete GPUs. Characteristics of memory types in terms of caching are dependent on the exact compute capability of the hardware. For instance, unified memory allows one to avoid duplicate allocations and transfers between embedded GPU and CPU since the physical SoC Dynamic Random-Access Memory (RAM) is shared. Additionally, if prefetching hints [28] are applied, the coherency management is optimised by the driver. However, multiple accesses to pinned memory on both the embedded CPU and GPU result in performance degradation as the I/O coherency [28] is unsupported on the NVIDIA Jetson TX2. As a consequence, in the case of repetitive accesses, it might be advisable to copy a buffer to the device memory for optimal caching behaviour.
Algorithms with pixel-level parallelism are insensitive to image contents. Task-level parallelism requires analysis of the contents; therefore, the measured performance varies. This type of parallelism is often introduced by operations executed on a CPU. For instance, more surface layers would not affect the detection performance as all operations are pixel-wise. On the contrary, more overload hotspots would deteriorate the performance since additional blobs are analysed. Nevertheless, for artificial input frames where all pixels exceed operational limits, and if the analysed blobs are constrained to four, then the algorithm still completes in around 20 ms on the embedded GPU and in above 30 ms on the embedded CPU. Although the performance of some algorithms might vary for different input sequences, it is reasonable to assume that benchmarks performed on the real dataset correctly characterise and approximate the required computation time for future fusion experiments.
Although the image processing libraries were incorporated in the implementation of the algorithms, some available operations in the libraries were insufficient in terms of supported data types or performance. For instance, OpenCV implementation of a median filter cv::cuda::Filter created via cv::cuda::createMedianFilter is only limited to 8-bit images as well as it has low performance (56.14 ± 0.40 ms) compared to an NPP implementation nppiFilterMedian_8u_C1R (3.22 ± 0.04 ms) that also supports 16-bit images nppiFilterMe-dian_16u_C1R. Nonetheless, for a small median filter size, i.e., 3 × 3, a custom CUDA kernel was implemented and integrated with OpenCV image wrappers to enhance performance by fully utilising coalesced access to shared memory and efficient vector sorting by exchange proposed in the literature [32,33] (0.73 ± 0.02 ms). The average performance of OpenCV CPU implementation optimised for small kernels cv::medianBlur is (1.28 ± 0.03 ms). The runtime was measured on an 8-bit image with a resolution of 1024 × 768. The OpenCV GPU implementation of the median filter was the major bottleneck. The custom CUDA kernel implementation reduced the runtime by 77 times while performing nearly twice as fast as the OpenCV CPU implementation. Optimisations incorporated in the CUDA 3 × 3 median filter kernel are: • Shared memory minimises global memory accesses for threads in the same block. An image is chunked into 2D thread blocks that have assigned shared memory of the block size with additional zero-padded borders. Each thread loads the corresponding pixel value from the global memory to the shared memory. Edge threads fill the zero-padding, and if the padded value belongs to the image, it is replaced with the underlying pixel value (see Figure 14) [33]. • Forgetful sorting determines the median value out of nine neighbouring values (see Figure 14). It reduces constant memory and local memory usage (register spilling) in comparison to other sorting algorithms [33]. • Branchless exchange sorts two values with a single instruction [32].
The achieved occupancy and multiprocessor activity for this CUDA kernel are 0.961 and 99.8%, respectively.
The potential solution to reduce latency in setups equipped with a fast PCIe bus and interfacing with a third-party image provider, e.g., a frame grabber, is to utilise efficient GPUDirect transfers [4] in order to directly access data on third-party devices connected to the same PCIe root complex. NVIDIA Jetson Xavier AGX supports GPUDirect technology and is incorporated into the MicroTCA.4 compliant platform [26].
In future works, the authors plan to seek further parallelisation of other thermal image processing algorithms, e.g., reflection detection based on Normalised Cross Correlations (NCC) on a Sliding Time Window (SWNCC) [12] and strike-line detection with max-tree attributes filtering [29]. Moreover, the succeeding implementations and performance benchmarks will be performed on discrete GPU series such as Quadro or Tesla with x86 processors and other NVIDIA Jetson platforms. Finally, the development of tools for managing and monitoring the thermal image processing pipeline for plasma diagnostics is already underway.

Conclusions
Current and future fusion devices will increasingly need computational power. GPUs might increase the system performance and provide sufficient acceleration for time-critical applications also on low-power embedded devices compatible with MicroTCA.4 architectures. The performance improvements for overload hotspots detection and surface layers detection algorithms equal 78% and 283%, respectively. The optimisation process includes the implementation of the custom CUDA median filter that outperforms the OpenCV GPU implementation by 7605%. The presented work applies to other scientific disciplines, where thermal images are processed. Described techniques and algorithms especially pertain to other fusion experiments, e.g., hotspots detecting and tracking issues occur as well in tokamaks such as ITER [18], WEST [12], JET [3]. The proposed algorithms are suitable for real-time plasma diagnostics, and the detected events align with observations described in the literature [7]. In addition, the paper proposes exact image processing pipelines for thermal event detection as well as reports expected accelerated performance in the real nuclear fusion experiment with optimisation techniques specific to the NVIDIA Tegra embedded GPU. It might serve as a baseline for others to decide if an embedded SoC platform is sufficient for their application or to integrate the proposed solutions in systems with either discrete or embedded GPUs.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: