Best Practices for the Deployment of Edge Inference: The Conclusions to Start Designing

: The number of Artiﬁcial Intelligence (AI) and Machine Learning (ML) designs is rapidly increasing and certain concerns are raised on how to start an AI design for edge systems, what are the steps to follow and what are the critical pieces towards the most optimal performance. The complete development ﬂow undergoes two distinct phases; training and inference. During training, all the weights are calculated through optimization and back propagation of the network. The training phase is executed with the use of 32-bit ﬂoating point arithmetic as this is the convenient format for GPU platforms. The inference phase on the other hand, uses a trained network with new data. The sensitive optimization and back propagation phases are removed and forward propagation is only used. A much lower bit-width and ﬁxed point arithmetic is used aiming a good result with reduced footprint and power consumption. This study follows the survey based process and it is aimed to provide answers such as to clarify all AI edge hardware design aspects from the concept to the ﬁnal implementation and evaluation. The technology as frameworks and procedures are presented to the order of execution for a complete design cycle with guaranteed success.


Introduction
One of the main challenges Artificial Intelligence (AI) and Machine Learning (ML) designs are facing is the computational cost. This condition is forcing towards improving hardware acceleration in order to offer the high computational power, required from application fields that should be supported [1][2][3][4][5]. Recently, hardware accelerators are being developed in order to provide the needed computational power for the AI and ML designs [6]. In the literature, hardware accelerators are built using Field Programmable Gate Arrays (FPGA), Graphic Processor Units (GPU) and Application Specific Integrated Circuits (ASIC) to accelerate the computationally intensive tasks [7]. These accelerators provide high-performance hardware while aiming to preserve the required accuracy [8].
GPU is largely used at the training stage of the AI/ML design, because it provides floating point accuracy and parallel computation [9,10]. It also has a well-established ecosystem, as software/hardware collaborative scheme. At inference stage of AI/ML design, GPU consumes more energy which cannot be tolerated in edge devices, as other platforms can be more energy efficient, also in area of chip usage.
Reconfigurability, customizable dataflow, data-width, low-power and real-time makes FPGA an appealing platform for the inference phase of AI/ML designs [11,12]. The performance of the AI/ML accelerator can be limited by computation and memory resource on FPGA. Various techniques have been developed to improve the bandwidth and resource utilization, as it will be presented in the paragraphs that follow.
Another way to accelerate the AI/ML design is the customizable and dedicated hardware, the well known Application-Specific Integrated Circuits (ASIC) [13,14]. They are specifically tailored to accelerate one or a certain type of algorithm [15][16][17][18][19]. The acceleration effect is usually good and the power consumption is low, but at the same time, the reconfigurability is poor and the development cost is high. Furthermore, hardware design and longer development cycle may increase the cost.
Among the plurality of surveys that have been recently published and are individually mentioned in the main body of this work, the target of this study is the collection of the complete end-to-end AI/ML design considerations. They are presented in the order of execution, but without eliminating it to a single task sequence. In Figure 1, this structure is presented. The left-most block of Figure 1 contains the basic steps of the AI/ML design, modeled and evaluated as to resolve a specific problem. In details, it is shown in Figure 2, as an iterative process that depends on the selected dataset and produces a floating-point model. The model of choice is trained with a known dataset and validated with a smaller portion of it. Testing on new data that the model never saw before will allow to complete this phase after a number of iterations. In Section 2, the details of this process are described. In the middle, is the definition of the hardware into which the model will be transferred. The deployment to hardware is strongly connected with the architecture and the type of hardware that has been selected. Thus, the decision on it should be done at an early stage. The available options are presented in Section 3 of this work. The methods to optimize the AI/ML design for efficient deployment to the hardware are covered to the right-most block of Figure 1. At the size of edge AI on which this work is focused, the hyperparameters tuning is necessary to compete the limitations of the hardware capabilities. The overall process may be iterated several times for optimal performance. The rest of this study is structured as below: • First, the relationship to dataload that the AI/ML design will be processing is defined. The data are almost everything for deep learning and the best choices of the required resources are made when data have good quality. The selected dataload should be representing the data that the AI/ML design will confront when it is implemented, otherwise it will be flawed.

•
Second comes the selection of the AI/ML design architecture as modeled in high level and the design constraints are defined. There are reasons to select between a very flexible FPGA implementation or an ASIC deployment for the purpose of the AI/ML design execution.

•
Last but not least, the optimization and verification of the hardware design is discussed, with quantization, power consumption and area being the most important parts.

Dataload for Problem Solving
There is a large number of software libraries and frameworks developed for highperformance execution of the training stage in AI/ML designs with target to achieve best accuracy and improve the efficiency. Toolkits such as PyTorch [20,21] and TensorFlow [22][23][24] are popular with the aim of increasing the productivity on this scope by providing highlevel APIs. They are defined as software level optimizations that are performed iteratively, where, in each iteration, a small number of parameters are removed and the resultant network is fine-tuned by retraining for a limited number of epochs. The accuracy of the network is analyzed on the basis of the accuracy loss and from there the decision whether to continue it taken. Despite the power cost of GPUs, the high performance execution of training can be efficiently achieved, due to their multi-core and fast memory architecture [25][26][27].
Proper allocation of the effort during the iterations is very important. Collecting more data or prepossessing the existing with different optimization algorithms [28] or fine-tuning the algorithm operation is necessary at this stage.
The dataset should include the useful data required to resolve the particular problem that the AI/ML system will be designed for. In most cases, it will be large variety of data that are not all useful. The distribution of the data is the best method to approximate this condition. When the distribution matches the target of the AI/ML system, the design can be reduced to the necessary size. The best representation of the data can be achieved by clipping the far outlier values with an equal min/max range. The discipline of trying to understand data can be found in a visual context so that patterns, trends and correlations that might not otherwise be detected can be exposed. The features selected to train the AI/ML design should always be natural and meaningful to the context of the purpose of the design and the input data should be scaled in a common range between them. In addition, unnatural values or those with conventional meanings should be avoided.
Normalizing the values is a good approach to optimize the input data in the concept to be suitable for training [29,30].
Cleaning up a dataset from duplicated entries is an important action, but the exact method to do it relies on the actual AI/ML design purpose. In particular, the data preparation functions that have been found useful for the purpose of an AI/ML design are focused on statistical methods to apply the most optimal structure, distribution and the quality of the dataset prior starting the training of the model [31].
Pandas is the Data Analysis Library in Python [32]. It is a tool of analyzing and cleaning up large datasets as it can print out the summary of most basic statistical information for each feature in the dataset, such as minimum, maximum, mean and quarterly values [33]. A sanity check of the dataset is an important prerequisite as it will ensure the proper distribution of the data is the expected.
TensorFlow Extended (TFX) is a good example of the frond to back flow that performs the data preparation, followed from a formalized validation and deployment of the AI/ML design [34].
The quality of the AI/ML design can be significantly improved with error analysis [35]. A single scalar metric that measures end to end quality is necessary as it can be used to plot the learning curves that most toolflows provide. Identification of the situation that the AI/ML design is under-fitting or over-fitting the training data can be done before running a long time training session. Improvements can be applied by spotting patterns from the analysis [36,37].
Sometimes it is impractical or impossible to collect new raw datasets for use as training data. There is a number of methods developed to allow synthesize artificial data. Data Augmentation is the most popular method that can be applied as a data-space solution to the problem of limited data [38,39]. In AI/ML designs, the validation error must continue to decrease with the training error. Data Augmentation is a very powerful method of achieving this. The augmented data will represent a more comprehensive set of possible data points, thus minimizing the distance between the training and validation set, as well as any future testing sets. More methods to improve the generalization performance of the AI/ML design are focusing on the architecture of the model. These are functional solutions known as regularization techniques that aim at making the model generalize better. In particular these are the methods of Dropout [40], Batch normalization [41,42], Transfer learning [43], L2-regulazation [44], Hierarchical Guidance and Regularization (HGR) learning [45], translation [46], horizontal flipping [47], noise disturbance [48] and a plethora of others are available in the bibliography. Recently, the method of dynamic feature selection has been introduced with minimum redundancy information for linear data [49]. It is an effective feature selection method that can select a feature subset from higher order features. The selected feature subset reflects high correlation and low redundancy in various real-life datasets. Similarly, Zheng et al. propose a full stage data augmentation framework to improve the accuracy of deep CNNs, which can also play the role of model ensemble without introducing additional model training costs [50].

Definitions on Hardware Computing Platforms
AI/ML designs are in principle computationally expensive. Thus, numerous architecture designs were laid down by the researchers and also contributed commercially to promote hardware acceleration. In the state-of-the-art for accelerating deep learning algorithms a plethora of hardware platforms are supported such as, field programmable gate array (FPGA), application specific integrated circuit (ASIC) and graphic processing unit (GPU). The features that are most attractive to hardware designers are the benefits gained from increased performance, high energy efficiency, fast development round and capability of reconfiguration. This section is targeted to the differentiation and benefits of the usage of FPGAs, GPUs and ASICs based processors for AI/ML applications and more specifically focused on the inference for edge applications, with Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN).

FPGA-Based Accelerators
An FPGA architecture consists of the programmable input/output (I/O) and the programmable logic core. Inside the FPGA chip the I/O blocks are physically located in the edge of the design to facilitate the connection with external modules. The programmable core fabric includes programmable logic blocks and programmable routing interconnections. In the basic FPGA architecture, multiple DSP blocks and RAM blocks are also embedded thus consisting the Processing Elements (PE) of the device.
FPGAs are generally accepted for computational-intensive applications [51][52][53] and many research papers have demonstrated that FPGAs are very suitable for latency-sensitive and real-time inference jobs [54][55][56]. FPGAs also provide high throughput at a reasonable price with low-power consumption and reconfigurability. The highest energy consumption is the result of accessing the off-chip DRAM memory for data movement; therefore, the use of Block RAM (BRAM) is preferred, where low latency can be also achieved. There are also low-power techniques applied at RTL, such as clock-gating and memory-splitting, for efficient code generation of CNN algorithms. In [57], it is shown that FPGAs can maximize parallelism of feature map and convolutional kernels with high bandwidth and dynamical configurations to improve performance. From other observations [58] it has been concluded that DSP blocks and BRAM elements are more crucial to increase parallelism than flip-flops (FFs) and Look-up tables (LUTs), which can be further exploited and must be taken into consideration when designing FPGAs in the future.
When designing CNNs to fit inside an FPGA, several considerations should be taken into account including resource optimization such as, compression (weight quantization and network pruning) [59] and similarly, implementing Binary Neural Networks (BNN) which are a type of CNNs where the MAC operations are transformed using weight values of +1 and −1 implementing less expensive computations such as XNOR and bit counting. Therefore, the aforementioned techniques can introduce efficient resource utilization without significant accuracy drop [60] in implementations using FPGAs. Other quantization methods include, 32 bit floating-point (FP32) to 8 bit fixed-point (INT8) vector representations, although further network quantization is limited because binarized activation functions, as well as weights show a significant accuracy drop (12.4%) [61,62]. In Figure 3, is the high-level workflow of quantization and pruning method applied to the weights and network of a CNN in order to fit inside an FPGA. More details are presented in the third section of this study. Minakova et al. at [63] present a novel method to reduce the memory required to store intermediate data exchanged between CNN operations. The method is orthogonal to pruning and quantization, and hence they can be combined. In particular it is proposed to convert each CNN layer into a functionally equivalent Cyclo-Static Dataflow model (CSDF) [64] that eliminates the memory footprint from 17% to 64%. The method is attached to the PEs that are related with the execution of the specific layer. The model is shown in Figure 4a, as adapted from the relative study.
The massive amount of storage space provided by the external memories comes with the cost of high access time and usually non-constant access time. In this concern, Benes et al. in [65] proposes an architecture that optimizes memory accesses in such a way that the Read Modify Write operations ideally do not require pipeline stalls even when memory accesses are randomly distributed over the whole external memory. The core of the proposal is the setup of a Transaction table that contains multiple entries with same address. The Transaction table format is shown in Figure 4b.
The strong research interest to address the memory requirements of the AI/ML designs has driven to the investigation of memory aware designs such as µ-Genie proposed from Stramondo et al. in [66]. The effective use of the available resources is achieved with co-designed memory system and processing elements. The template of the functional unit for this purpose is adapted for the relative study in Figure 4c. Further challenges include streamlining the software architectures to facilitate the development for efficient mapping of AI models, using hardware description language (HDL), into FPGAs. Towards this process, several frameworks have been introduced such as the HADDOC2 [67] that models the target CNN into a data flow graph. The optimized low-level architecture mainly uses LEs instead of DSPs. Furthermore, multiplications by zero are being totally removed from the computation (replaced by simple signal connections), as well as the multiplications by two are being implemented via shift registers. The hls4 ml [68] translates ML algorithms for FPGA implementation using Python. It is based on an open-source code design workflow supporting network pruning and quantization and with the latest work [69], its capabilities are extended towards low-power inference implementations. In fpgaConvNet [70] architecture, one processing stage per layer is employed and a set of transformations are proposed to facilitate the CNN mapping on FPGAs. In CNN2Gate [71], the computation flow of network layers along with the weights and biases is extracted and a fixed-point quantization is applied. Necessary data are exported using vendor specific OpenCL to synthesize the build. Ultimately, the dependency on developer's low-level hardware knowledge is abstracted. FINN [72], also incorporates a data-driven approach to generate a custom streaming architecture based on binary neural network (BNN) structure. Each convolution layer has its own computation engine and all engines are inter-connected together to create a pipeline. Together with development kits (like Vitis AI [73], Dnnweaver2 [74]) to transform the network model into RTL code. Recently, researchers have addressed the cost-effective implementations of FPGA clusters in AI and ML applications [75]. In Table 1, the most important FPGAs frameworks with the basic transformation techniques are given.

Frameworks
Transformation Technique

HADDOC2
Models CNN as a dataflow graph hls4ml Translating NN into FGPA firmware using Python and HLS fpgaConvNet Assign one processing stage per network layer Dnnweaver2 Use of dataflow graph CNN2Gate Automatic synthesis using OpenCL FINN Generate data streams based on Binary NN

GPU-Based Accelerators
The GPU architecture contains a programmable engine with massively parallel processors (ALUs) for computer graphics rendering that includes multiple buffers with high precision (64-bit double) support and fast DRAM modules. Recently, GPUs are also used for general purpose computing applications (GP-GPU). Therefore, GPUs are widely used as hardware accelerators providing high processing speed and reduced execution time. GPUs process data in single-instruction and multiple-thread (SIMT) have a large number of parallel ALUs which fetch data directly from memory and do not communicate directly with each other. These programmable cores are supported by a high memory bandwidth and thousands of threads [76]. Typical applications that GPUs excel at, include signal and image processing which take advantage of their multiple native floating-point cores together with their high-parallelism. In Figure 5, a layout diagram of the main components inside a GPU is illustrated. In order to compute in parallel, concurrent threads must be created on the GPU. Each thread is executed on a GPU core (ALU) and multiple GPU cores are grouped together into blocks. Threads operating at the same block can inter-exchange data through shared memory. At runtime, a list of blocks that need to be executed must be maintained by the system. However, each block execution is independent from each other. CNN implementation on a GPU is materialized using a high-level language such as C/C++. Thus, GPUs can be programmed easier by developers, who are abstracted from the hardware layer. NVIDIA also created CUDA [77], a parallel programming platform to help developers accelerate general computing applications. In addition, TensortRT, is a deep learning inference optimizer, based on CUDA, that supports INT8 and FP16 representations to reduce precision, as well as latency [78]. Typically, the process to fit a CNN into a GPU requires two phases (i) training and (ii) inference. In most cases, high-end GPUs are utilized for model training [79][80][81][82], mainly based on voltage reduction and frequency scaling which can achieve better performance with power capping.
It is also observed, that among different CNN implementations using GPUs, performance is dependent on the degree of the parallelism in the execution threads, as well as in the efficient allocation of resources (registers, shared memory) [83]. For example, if Error Code Correction (ECC) in RAM is disabled, as bit-level accuracy is not mandatory for CNN applications, this can lead to a 6.2% decrease in memory utilization [84]. In Table 2, a list of CNN transformation frameworks with GPU support is shown.

ASIC-Based Accelerators
Contrary to the way AI/ML design is formulated to the FPGA or GPU hardware structures for the purpose of hardware acceleration, the main way to meet the design target using ASIC is to customize the dedicated hardware acceleration circuits [91]. An optimized and specialized hardware implementation can reduce system cost by optimizing the needed resources and decreasing power requirements while improving the performance [92].
Similar to the reconfigurable, FPGA or FPGA like architectures, dedicated AI/ML designs have also adopted fixed-point representation to eliminate the size of the exchanged data [93,94]. In most cases some flexibility is achieved with specialized MAC units that can be parallelized.
The AI/ML design of dedicated architectures cannot be changed after fabrication, they include a separate processing unit with generic role for all other functions than the specialized MAC operations. The fabrication process is time consuming and resource demanding, making it difficult to follow the rapid evolution of AI/ML designs.
The performance variation is a strong selection criterion as it depends on the technology, data size and area efficiency of the AI/ML design. These different features justify the differences in peak performance among the architectures [95]. Several vendors involved with the design of AI/ML ASIC-based accelerators have selected their own hardware architecture. The most indicative cases are listed in Table 3 that follows. Table 3. Available AI/ML specialized accelerators as ASIC.

Vendor
Hardware Architecture Greenwaves [96] Composed of one Fabric Controller (FC)core and eight cores in a cluster as extended version of the RISC-V instruction set and separate data cache for each.
BrainChip [97] It is based to spiking-neural-network (SNN) technology that connects 80 event-based neural processor units and it is optimized for low-power edge AI.

Maxim Integrated Products [98]
It is a hardware-based CNN accelerator that enables battery-powered applications to execute AI inferences. It is combined with a large on-chip system memory DSP group [99] It is a standalone hardware engine that is designed to accelerate the execution of NN inferences. It is optimized for efficiency to ensure ultra-low power consumption for small to medium size NNs Synaptics [100] proprietary power and energy optimized neural network and domain specific processing cores, significant on-chip memory Synsense [101] A scalable, fully-configurable digital event-driven neuromorphic processor with 1 M ReLU spiking neurons per chip for implementing Spiking Convolutional Neural Networks Renesas [102] Dynamically Reconfigurable Processor (DRP) technology is special purpose hardware that accelerates image processing algorithms by as much as 10×, or more Recently, resistive memory based Bayesian neural networks have been found applicable in edge inference hardware. In [103], Dalgaty et al. propose to use trained Bayesian models with probabilistic programming methods to overcome the inherently random process. Hence, the intrinsic cycle to cycle and device to device conductance variation are limited.

Deployment of Optimized h/w Resources
Dedicated hardware solutions are the most performance efficient, as they aim to achieve the highest performance at the lowest cost. The accelerator-based devices can be very limited in terms of configurability and thus unable to adapt the processing dataload to each particular network and model optimizations. Reconfigurable computing is an alternative solution providing the required configurability and high performance. The lower cost and high energy efficiency are the result of a dedicated hardware optimization for inference.
The performance efficiency also depends on the network [104], hence there are several methods developed to optimize the deployment and accordingly the hardware resources [105].
Targets for the deployment of a network at the edge in embedded systems are the configurability, the low size and lowest power. Moreover, it is critical to control the size of the memory that the AI/ML design consumes. The two techniques to achieve this are pruning [106] and quantization [107]. The first sets to zero all near-zero weights, which effectively reduces the size of the weights table, and hence it allows for compression. The second replaces the 32-bit floating point arithmetic used at training with lower precision representations that executes to the hardware, such as 8-bit fixed-point. They are also referred as hardware aware optimizations.
The main idea of network pruning is to delete unimportant parameters, since not all parameters are important in highly precise deep neural networks. Consequently, connections with less weights are removed, which converts a dense network into a sparse one [108]. It is as simple as to remove the weights of a trained network that are lower than a threshold and then retrain. The appropriate threshold selection to prune the network is an iterative process involving training that may consume significant time due to the re-iterate steps. Manessi et al. in [109], propose the differentiability-based pruning with respect to learning thresholds, thus allowing pruning during the backpropagation phase. Han et al. in [110] propose pruning that is lossless in accuracy for state of art CNN models, as the method retrains the sparse network. Yu et al. in [111] propose Scalpel that consists of SIMD-aware weight pruning and node pruning; hence, it customizes pruning for different hardware platforms based on their parallelism. Further compression is achieved when pruning is combined with quantization that is analyzed next.
Quantization of the parameters takes an existing neural network and compresses its parameters by changing from floating-point numbers to low-bit width numbers, thus avoiding costly floating-point multiplications [112,113]. Analysis has shown that data represented with fixed-point preserves the accuracy close to that obtained with data represented with floating-point [114]. A deeper optimization allows different fixed-point scales and bitwidths for different layers [115,116]. Data quantization can be taken to the limit with some or all data represented in binary. Although there is a trade-off between the network size and precision, quantization gives a significant improvement to the point that results can be almost identical to the full-precision network. The quantization of 8-bit seems to be a good choice, as it practically compromises between the memory size and the retraining accuracy. More aggressive quantization using 4, 2 or even 1-bit resolution is possible, but it requires more special distribution methods.
Quantization and pruning approaches can be considered individually as well as jointly by pruning the unimportant connections and then quantizing the network. After pruning and quantization, the dense computation becomes sparse computation and after quantization the arithmetic alignment should be preserved. The sparsity can reduce the efficiency of the hardware; hence, the precision and efficiency trade off remain till the end of the AI/ML design.
Among the pruning and quantization methods, the model compression to accelerate the hardware compute capabilities at the lower possible cost has been explored with significant results. Albericio et al. in [117] proposes a method for zero-skipping, based on the observation that in practice, many of the neuron values turn out to be zero; thus, the corresponding multiplications and additions do not contribute to the final result and could be avoided. Aimar et al. in [118] have developed NullHop, a CNN accelerator that exploits activation sparsity per its ability to skip zeroes and further compression through the sparsity map and the nonzero value list. Dynamic sparsity of activation is exploited by Han et al. in [119] with EIE accelerator design. Zhang et al. propose Cambricon-X in [120], a sparse accelerator to exploit the sparsity and irregularity of NN models for increased efficiency with a PE-based architecture consisting of multiple Processing Elements. Last but not least, Yao et al. in [121] have proposed DeepIoT that compresses commonly used deep neural network structures for sensing applications through deciding the minimum number of elements in each layer.
More recently, the inference of State-Space Models (SSM) has appeared as optimal approach to the deployment of models that are dynamic, large, complex or uncertain due to the lossy data acquisition from sensors or microphones that are common in edge AI systems [122]. Towards the increased computational cost and large parameters space, Imani in [123] proposes a two stages Bayesian optimization framework that consists of dimensionality reduction and sample selection processes. An eigenvalue decomposition based technique is developed for mapping the original large parameter space to reduced space in which the inference function has the highest variability.
Under this variety of optimizations for performance, the next requirement is to have enough metrics for evaluation and selection of the most optimal solution. In [124,125], the method of Eyexam is proposed as a systematic way to capture the performance limitation of the AI/ML designs. It is based on the roofline model and extends it with seven sequential steps to conclude to the fine-grained profile of the AI/ML design. At step 1, the upper bound of the performance is defined as the finite size of the workload. The order of operations determines step 2 and it is strongly related to the dataflow. The minimum number of processing elements in which the performance restrictions appear is evaluated as step 3. The physical dimensions and the storage capacity are examined in steps 4 and 5. The data bandwidth required for each computation to be exchanged with the memory and the effects of data variation are visited in steps 6 and 7 respectively.
In a comprehensive view of the complete flow for efficient inference, the accuracy of the AI/ML design is the metric that is necessary to be met and interpreted in the content of difficulty to resolve the given problem or task with the provided dataload. Throughput and latency are the giving useful input to indicate the amount of data that can be processed or the amount of time consumed to finish a task. Energy efficiency, power consumption and cost are those to guarantee the success of the design, while flexibility and scalability refer to designs in which flexibility is important.

Conclusions
Due to the complexity of the AI/ML designs and the limitations of the hardware to achieve lower power, lower latency and better accuracy in edge inference, the complete endto-end AI/ML design flow has been presented in this study. It follows a software/hardware synergy for training and inference with the aim of highlighting the design challenges and providing guidance through the available solutions. The role of the dataload as the starting point for resolving a problem is a large area that can affect the size and the performance in hardware execution. The varying choices for the hardware implementation have been explored and the individual advantages have been highlighted, thus allowing the proper and most optimal choice. Last but not least the major optimization techniques have been presented, as the energy efficiency accompanies the design cost without compromising the quality in terms of accuracy. It is clear that the edge inference brings challenges and new design requirements for the network models, network optimizations and new computing devices. The practices towards the deployment on edge inference should be defined at the beginning and iterated inside the large circle of Figure 1 untill the target is met.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.