Revisiting the High-Performance Reconfigurable Computing for Future Datacenters

: Modern datacenters are reinforcing the computational power and energy efficiency by assimilating field programmable gate arrays (FPGAs). The sustainability of this large-scale integration depends on enabling multi-tenant FPGAs. This requisite amplifies the importance of communication architecture and virtualization method with the required features in order to meet the high-end objective. Consequently, in the last decade, academia and industry proposed several virtualization techniques and hardware architectures for addressing resource management, scheduling, adoptability, programmability, time-to-market, security, and mainly, multitenancy. This paper provides an extensive survey covering three important aspects—discussion on non-standard terms used in existing literature, network-on-chip evaluation choices as a mean to explore the communication architecture, and virtualization methods under latest classification. The purpose is to emphasize the importance of choosing appropriate communication architecture, virtualization technique and standard language to evolve the multi-tenant FPGAs in datacenters. None of the previous surveys encapsulated these aspects in one writing. Open problems are indicated for scientific community as well.


Introduction
Today, datacenters are equipped with the heterogeneous computing resources that range from Central Processing Units (CPUs), Graphical Processing Units (GPUs), Networks on Chip (NoCs) to Field Programmable Gate Arrays (FPGAs), each suited for a certain type of operation, as concluded by Escobar et al. in [1]. They all purvey the scalability and parallelism; hence, unfold new fronts for the existing body of knowledge in algorithmic optimization, computer architecture, microarchitecture, and platform-based design methods [2]. FPGAs are considered as a competitive computational resource for two reasons, added performance and lower power consumption. The cost of electrical power in datacenters is far-reaching, as it contributes roughly half of lifetime cost, as concluded in [3]. This factor alone motivates the companies to deploy FPGAs in datacenters, hence urging the scientific community to exploit High-Performance Reconfigurable Computing (HRC).
Industrial and academic works both incorporated the FPGAs to accelerate large-scale datacenter services; Microsoft's Catapult is one such example [4]. Putnam et al. chose FPGA over GPU on the question of power demand. The flagship project accelerated Bing search engine by 95% as compared to a software-only solution, at the cost of 10% additional power.
The deployment of FPGAs in datacenters will neither be sustainable nor economical, without realizing the multi-tenancy feature of virtualization across multiple FPGAs. To achieve this ambitious goal, the scientific community needs to master two crafts, an interconnect solution preferably Network on Chip (NoC) as a communication architecture and an improved virtualization method with all the features of an operating system. Accumulating the state of the art in a survey can foster the development in this area and direct the researchers into more focused and challenging problems. Despite of the two excellent surveys, [5] in 2004 and [6] in 2018, former one categorized the FPGA virtualization as temporal partitioning, virtualized execution, and virtual machines, while, after fourteen years, the later one classified based on abstraction levels to accommodate the future changes, but the communication architecture or interconnect possibilities are not fully explored. To address this gap, an improved survey on FPGA virtualization is presented with the coverage of network-onchip evaluation choices as a mean to explore the communication architecture, and commentary on nomenclature of existing body of knowledge. We revisited the network-on-chip evaluation platforms in order to highlight its importance as compared to bus-based architectures. We stretched our review from acceleration of standalone FPGA to FPGAs connected as a computational resource in heterogeneous environment. We attempted to create a synergy through combining three domains to assist the designers to choose right communication architecture for the right virtualization technique and, finally, share the work in the right language, only then, multi-tenant FPGAs in datacenters can be realized.
The remaining of the review paper is organized, as follows. Section 2 includes the commentary on nomenclature and recommendations for the scientific community. Section 3 talks about the available NoC evaluation tools to find out precise communication architecture in relatively less time. Section 4 puts the virtualization works into limelight with focus on architectures that are scalable and support multi-applications or multi-FPGAs. Section 5 indicates the trends and open problems as well presents a closing discussion about the area.

Revisiting the Nomenclature
The applications of FPGAs as computing resource are diverse that includes data analytics, financial computing and cloud computing. This broad range of applications in different areas requires efficient applications and resource management. This lays the foundation for the need of virtualizing the FPGA as a potential resource. Nomenclature is much varying due to the different backgrounds of the researchers contributing to this area. There are many such examples in literature where similar concepts or architecture is described using a different name or term. There is also an abundance of jargon terms and acronyms, which confuse the researchers rather enhancing their understanding. Table 1 identifies and lists non-standard terms in literature from the last decade. This area is stagnated for a lack of a standard nomenclature. We recommend that the scientific community should use a unified nomenclature to present the viewpoint in order to improve the clarity and precision of communication for advancing the knowledge base. We also recommend that this area must be referred as High-Performance Reconfigurable Computing (HRC) in literature. Moreover, it has been observed that the use of computer science language is more conveying as virtualization in FPGAs is comparable to an operating system in CPUs.
We urge the scientific community to come together to develop nomenclature, as it will improve the communication among researchers. It will ease the classification of works for entry-level researchers and help them to focus on complex research problems.
We acknowledge some quality examples such as the suitability of FPGAs has been discussed in depth in the context of high performance computing and heterogenous computing resources in [1], a new classification of FPGA virtualization has been presented in [5], and state of the art has been explored in the context of cloud computing, as defined by the National Institute of Standards and Technology in [18]. These authors have used the standard language of computer science and written in such a way that it added value to the understanding of readers.

Revisiting the Network on Chip Evaluation Tools
Data transfers in most of the high-performance architectures are limited by memory hierarchy and communication architecture, as summarized in [19,20]. Exploiting communication architecture suggests the use of NoC, an effective replacement for buses or dedicated links in a system with large number of processing cores [21,22]. NoC is composed of several tunable parameters like network architecture, algorithm, network topology and flow control. No System on Chip (SoC) is outright without NoC, today, due to promised high communication bandwidth with low latency as compared to the alternate communication architectures.
Researchers heavily rely on automated evaluation tools, where performance and power evaluation can be viewed early in design, given the complexity of NoC. Figure 1 describes a typical cycle of NoC evaluation, with FPGA being connected to a Central Processing Unit (CPU). Traffic scenarios are generated through traffic generator, sent to NoC that resides in FPGA, and the evaluation results are received through traffic receptors. Tools for FPGA based NoC prototyping are diverse architecture-wise. De Lima et al. in [23] identified an architectural model comprising of three layers: network, traffic, and management. There are four different types of network: Direct Mapping on Single or Multi FPGA(s), Fast Prototyping and Virtualization. The choice of the network affects the accuracy and resource utilization. Traffic on network can be generated in two different ways: synthetic and applicationspecific. Synthetic traffic is a kind of load testing to evaluate the overall performance, but it fails to forecast the performance under real traffic flow. Application-specific traffic, on the other hand, is based on the behavior of real traffic flow that is difficult to acquire but gives more accurate results.
These patterns can be acquired either through trace, statistical method or executing application cores. As traces comprises of millions of packets so the size becomes a limiting factor. Running application cores to generate traffic is also resource-expensive method. Table 2 lists some FPGA based NoC evaluation tools, describing every architecture with network type, traffic type, number of routers, target board, and execution frequency, while hiding the complexity of NoC designs. The number of routers in NoC depends on the network type, architecture with relatively more routers, are based on second group type of network, fast prototyping and virtualization. We have used the direct mapping network type in our previous works due to relatively high execution frequency [24,25]. These evaluation platforms assist the designers to reach the design-specific communication architecture, meeting most of the requirement specifications, for a certain application. These evaluation platforms take comparatively more time to synthesize the change, while on the other hand, a simulator can accommodate the same change in much lesser time. Designers offer dynamic reconfiguration, as a peroration to this limitation, but simulators are still the first choice of many entry-level researchers. However, the choice of NoC to realize the future datacenters with multitenant multi-FPGAs is yet to explore. The linking of several computational nodes becomes complicated and affects the performance of the overall system. Although NoC is not the only choice for communication within an FPGA as well as among multiple FPGAs but offer a competitive and promising solution. Other solutions include traditional bus, bus combined with a soft shell, different types of soft NoC and hard NoC. Many comparative studies evaluated these choices based on parameters like useable bandwidth, area consumption, latency, wire requirement and routing congestion. The way NoC is generated, also affects the performance so designers must be careful while choosing the NoC or an alternate for their design.

Revisiting the FPGA Virtualization
Resources are time multiplexed in a cloud services provider datacenter, referred as Infrastructure as a Service (IaaS). The sharing of resources is achieved through virtualization, an abstraction layer for hiding the physical resources from users. The process of virtualization raises issues like ease-of-use, privacy and performance but yet IaaS provide individual users and small organizations with an economic choice of renting over spending on infrastructure. Other than an academic example, such as SAVI testbed [39], industry offers plenty of solutions that are equally popular among designers. Amazon Web Services EC2 [40], IBM Zurich [41], and Intel are important competitors. Alveo on the Nimbix Cloud [42] is suitable for the designers working on Xilinx tools. Maxeler Technologies, however, offers specific solutions, like an algorithmic contribution for memory mapping [43] and an area optimization technique [44].
Virtualization plays a relatable role to an operating system in a computer, but the term is being used in different meanings in this area, due to non-uniform nomenclature discussed earlier. Yet, the universal concept of an abstraction layer remains unchanged, a layer for the user to hide the underlying complexity of the computing machine, where the computing machine is not a traditional one, but FPGA. Many virtualization architectures have been proposed as per the requirements of the diverse applications. In 2004, a survey in this regard categorized the virtualization architectures into three broad categories, temporal partitioning, virtualized execution, and overlays [5]. Since then, no serious effort has been recorded on the classification of virtualization, until Vaishnav et al. [6] in 2018 classified the virtualization architectures based on abstraction levels. This much-needed classification contributed by Vaishnav et al. has been adopted as is, to discuss the works in this survey. We reiterated them with some of the representative work examples in Table 3. The works have been discussed under the same abstract classification. Table 3. Classification of FPGA Virtualization adopted from [6]. Although there are many features of virtualization like management, scheduling, adoptability, segregation, scalability, performance-overhead, availability, programmability, time-to-market, security, but the most important feature in the context of scope of this research is the multi-tenancy because it is essential for a sustainable and economically viable deployment in datacenters. FPGA has two types of fabric: reconfigurable and non-reconfigurable. The virtualization for the nonreconfigurable fabric is the same as of CPU, but there are several variations when it comes to the virtualization of the reconfigurable fabric.

Overlays
Overlay architectures are diverse based on application and respective requirements. Overlays provide another higher abstraction layer on lower level fabric of FPGA, as depicted in Figure 2. The primary objective is to enhance the ease of programming for the software programmer. The reduced compilation time is an added advantage, given that the computer-aided design part to generate an accelerator is left out in the compilation process.
With respect to the ability of functional units, overlays are categorized in spatially-configured and time-multiplexed architectures. Li et al. compiled a comprehensive account of time-multiplexed overlays in a recent survey [45]. However, overlays are often discussed with respect to their implementation architectures in most of the literature that divides them into processor-based and coarse-grained reconfigurable architectures (CGRAs). A complete review of CGRAs can be found in Jain's doctorate thesis [46]. Processor-based comes in a variety of soft processor, either single-issue or multi-issue or multithreaded. They all add value to programmability, but the limited throughput is not suitable for the very high speed applications. Processor-based comes with a parallel processor as well, either in the form or multithreaded or VLIW or soft vector processor or soft GPU. One form of the soft-core processors is [47], and similar solutions [48,49] are available from industry. Other forms include soft vector processors [50][51][52][53][54]. CGRAs offer higher performance and scalability with lower power consumption, the very characteristics FPGAs are used for. CGRAs exist in the form of processing arrays or coarse to medium grained processing elements, where operations are performed at the processing element level. Examples of connected arrays of processing elements with programmable interconnects are [55][56][57][58][59]. Some CGRAs are kept dynamic by programming the processing elements and interconnect logic [60][61][62][63], while other architectures are kept static in spatial-configuration, as in [56,57]. Frequently appearing interconnect topologies in CGRAs are the nearest neighbor [55,56] and island model [57][58][59]. NoC are also found abundantly in CGRAs, some examples are [64][65][66][67]. NoC based architectures offer flexibility at the cost of higher implementation cost, but some works, like Hoplite soft NoC [67] and hard NoC [22], offer resource-efficient fast interconnects. Although an effort to reduce the cost by mapping the overlay look-up tables and multiplexers to the FPGA fabric has been achieved in [68]. Table 4 summarizes time-multiplexed CGRA overlays. Overlays are opted only to meet the requirement of rapid functionality change, where partial reconfiguration fails to cope with the speed of change, due to the sizable cost of implementation. Many solution providers have also commercialized this idea, like VectorBlox [50] Figure 3 depicts the generic architecture where many virtual channels are represented with dashed lines, that do not equate to available physical channels. The middle layer in I/O virtualization plays multiple roles, like enforcing security mechanism, monitoring resource utilization, ensuring Quality of Service (QoS) in datacenters, improving access time, and installing memory buffers. There are two possibilities for the design of control logic, either software or hardware. The software approach offers high flexibility and space efficiency [75][76][77]. On the other hand, the hardware module offers improved performance at the cost of consuming some reconfigurable resources [78,79].
Designers, as in [80], employed I/O virtualization to accelerate the storage up to 6x, beneficial for data intensive applications. Microsoft used the same to reduce the traffic of the network by directly handing over the requests to FPGA [4]. These are examples of the diverse use of I/O virtualization middle layer.

Virtual Machine Monitors (VMMs)
VMMs is the trouble-free method, as it takes many challenges away from FPGA. The scenario of treating FPGA as an attached peripheral to CPU provides the software programmers with multiple benefits, like familiar interface, libraries, and programming. Integrating accelerator with Virtual Machine Monitor (VMM) has almost zero performance overhead, as the experimental results showed in [81]. However, there are other approaches that further enhance the VMM capability to control many partial reconfigurable regions, like using micro-kernel [82], using micro-kernel to make a portable accelerator [83], and using OpenStack [84]. The idea of disassociating the static and dynamic fragments pays off at so many levels.
VMMs through resource allocation contribute to achieve many objectives for virtualization, such as multi-tenancy, management and scheduling, segregation, security, and availability. As FPGA is connected to CPU using standard frameworks, so multiple FPGAs can be added in the same arrangement.

Shells
Shells are referred as the static part of the system, which fundamentally provides the functionality of an operating system (OS); hence, various other names exist in literature, like FPGA operating system or hypervisor. It manages resources, I/O mechanism, required drivers and other essentials to configure or reconfigure the desired application. Figure 4 lays out the important infrastructures that have been proposed, developed, and tested. These architectures have been exhibiting a certain level of one or more virtualization characteristics and significant performance.
Multiple partial reconfiguration regions, symmetric and asymmetric, are being used for achieving multiple applications on a single FPGA. Symmetric or tiled regions are uniform in size, as in [85]. In this way, the resource allocation becomes flexible, as it can reside in one or more neighboring regions, which further minimizes the internal fragmentation, as in [78]. On the other hand, asymmetric regions support the modules of different sizes and save us from reconfiguring the whole FPGA [86] altogether. The connectivity is crucial for every execution model, it can either be host connectivity, or independent connectivity or the hybrid of the two. The architectures that are based on host connectivity only, CPU control the resources of the FPGA and reserve most of the reconfigurable resources and regions for the applications [76]. Multi-processors System on Chip (MPSoC) products from the FPGA vendors makes the implementation easier, however employing such products results in wastage of resources. Solutions, like [41,87], offer sharing among multiple CPUs, where [4,41,75,88] offer sharing among standalone FPGAs to avoid the underutilization of the FPGA resources. But the required support for the network layer consume a reasonable resource of FPGA. However, Asiatici et al. [89] developed a lightweight version featuring high-end application program interface (API), with a simpler execution model and shared memory. They proved their concept by measuring the marginal performance overhead. The hybrid approach offers more control intensive connectivity by exploiting offload to CPU, but additional hardware is required for I/O acceleration, as in [4]. Another type of shell called, container [90], is described as one without VMM, a process-level virtualization of application. This design has been accomplished by providing features like segregation, management and scheduling, and resolve for driver dependencies.
Considerable architectures have been tested in the last decade. The works that did impact the research in this area have been summarized in Table 5, along with hardware and virtualization characteristics. Partial reconfiguration (PR) is used to reconfigure a part of FPGA dynamically, many architectures run more than one application using this function provided by the FPGA vendors. Multitenancy is defined as the capacity to serve multiple users using the same FPGA. Scalability is the qualitative measure of potential to scale up to multi FPGAs or multiple users with low overhead and congestion. Adoptability is featured as an acceptance of wide range of workload and applications, also referred as flexibility in previous works. Time-to-Market is a development time, directly the function of complexity of deployment on FPGA from design specifications. All these features of the shells are summarized in tabular form.
Industry also offers an API based solution, Intel's Open Programmable Acceleration Engine (OPAE) [96] is a collection of drivers, libraries, user and programmers' tools to enumerate, access, manipulate, and reconfigure programmable accelerators. Figure 5 provides detailed insight. The designer must recognize that using shells can cause performance overheads due to layout limitations that are enforced by the partially reconfigurable slots. Furthermore, limiting the logic placement to a specific region on the chip can lead to longer wires that can result in slower modules [97]. Finding the optimal number of partial reconfigurable regions is a compelling open problem to explore, given the complexity of the shell and impact on the overall performance.

Scheduling
Scheduling is the key to multi-tenancy, but the conventional techniques (preemptive, nonpreemptive, and cooperative) cannot be used for FPGA accelerators unchanged, as the state of the system that needs to be saved and restored is not trivial. The state data may be distributed across all different resources on FPGA fabric and one single operation to save or restore the state can add micro to milli seconds to the latency [98]. However, the requirement of mandatory dedicated hardware module can be avoided, as in [99], where such jobs are either blocked or sent back to CPU to perform.
The concept of scan-chain to provide the state data through an external interface, in order to make preemptive scheduling cost-effective and fast, has been implemented in [100] while using High-Level Synthesis (HLS) extension but for only a subset of registers. Non-preemptive scheduling has a low-cost implementation and a simpler design. Cooperative, on the other hand, only offer context switching at certain check points on the run time with the least overhead [82].
With the mature HLS methodology and availability of MPSoC platforms, hardware threads have been proposed as ReconOS [101] and Hthreads [102] with a pre-condition of tightly coupled CPU-FPGA to bring the scheduling closer to standard hardware description languages (HDLs).
Largely, scheduling techniques fall in non-preemptive category, which is fundamentally a time domain optimization. However, a dynamic approach has recently been introduced in [89], which takes advantage of the empty slots and keeps the utilization balanced on the run time. This dynamic scheduling technique enables the multi-tenancy like none other, as it gives the power of increasing or decreasing the resources usage, as per the workload requirement.
Some scheduling approaches are good for certain scenarios, like the one in [103,104], serve the multiple users at the same time without going through tedious partial reconfiguration, given that the accelerator needs of multiple users are the same. Another work, VineTalk [105], enables the FPGA sharing to a server or virtual machine in a datacenter, where the user has the liberty to choose through an API [106] among the GPU or FPGA accelerator, as per the need of the algorithm. In the heterogenous computing environment, OpenCL [107] is popular in practice and recommendation like SparkCL [108] solidified it further by bridging OpenCL and Java. OpenCPI [109] is an open source alternate of OpenCL. Important methodologies to mention are Intel HLS Compiler [110], Vivado High-Level Synthesis [111], and OpenSPL [112], as the programmability wall of FPGA remains a significant problem to this day. With these platform and practices combined, the idea of future datacenters can be realized, as pictured in introduction. However, the process automation for the selection of appropriate accelerator in heterogenous computing environment is yet to be explored by the community.

Multi-Node Level Virtualization
The primary job is to distribute an acceleration job among multiple FPGAs, while abstracting the complex details from the user. The architecture of virtualization largely depends on how the multiple FPGAs are connected, there ways are depicted in Figure 6. This is not a standard, but the works so far have exhibited these formations. The direct model where FPGAs directly communicate with other FPGAs, where link represent the physical connection or virtualized I/O interface. The slave model where FPGAs are connected to the CPUs through PCIe or other links and CPUs are connected to the network, so if FPGA wants to send data to another FPGA, it goes through CPUs and network. The standalone model where FPGAs and CPUs are accessible though the network as standalone node. The designer can also combine them to form a hybrid model to meet the certain objectives. Before discussing the sub classes, the salient features of some representative works of multiple FPGAs are to be discussed. Byma et al. [75] focused on the minimum virtualization overhead of medium scale datacenter providing commercial cloud services. They achieved significant performance when compared to regular virtual machines along with reduced iteration time for design. Kondel et al. [78] focused on maximizing the utilization of high-end FPGAs through paravirtualization and provided homogeneous virtualized FPGA regions for the clients. This flexible multi tenancy approach enables the individual resources to adopt the user requirements. Zhang et al. [79] developed an operating system to share single FPGA chip among different users at run-time with an improved resource manager. However, these mentioned works have not discussed the FPGA to FPGA or CPU connectivity in detail and the interfaces are not clearly described, except an indication of PCIe. Weerasinghe et al. [41] presented a different approach, FPGA as a standalone connected to datacenter network. The decoupled approach can utilize FPGA as an equal processing resource, especially in hyperscale datacenters. They chalked out a detailed system architecture with an outlook analysis on resource estimation and scaling perspectives.
A relatively recent trend is the emergence of tightly coupled CPU-FPGA platforms. Examples include Heterogeneous Architecture Research Platform (HARP) by the Intel and power chip combined with Coherent Accelerator Processor Interface (CAPI) by IBM. Academics responded to the call for proposals by Intel and several works have been published in last four years, some recent examples are [113,114].

Custom Clusters
Custom clusters are based on the concept of systolic array model in parallel computing architecture, where every node acts as a data processing unit and processed data move from one node to another through first-in first-out (FIFO) buffer or network semantics. Some of these architectures [115][116][117][118] use Peer to Peer (P2P) connection MaxRing, fast series transceivers with FIFO buffers, and Peripheral Component Interconnect Express (PCIe) links, for transmitting data across multiple nodes. Tailored designs allow the direct communication among the nodes through explicit network connections. A cluster of 512 FPGAs [119] exploits the systolic array model to perform computations on multiple FPGAs.

Frameworks
Frameworks exploit the conventional server-client architecture, where only the computational part is assigned to one or more FPGAs, but the CPU server manages the rest, including configuration, application related data, and scheduling. The central piece in this architecture is the data management model, and models for CPU are equally extendible to FPGAs. For example, the idea of the MapReduce framework has been extended on FPGAs where mapping and reduction operations are performed by FPGA accelerators [103,118,119] in similar way as CPU client-server architecture. These frameworks have an added advantage of bridging the gap between the heterogeneity of datacenters, [120] is one such cluster comprised of FPGAs and GPUs while using MapReduce. Furthermore, Chen and colleagues in [104] extended java virtual machine (JVM) framework using Apache Spark to accommodate the FPGAs, this however comes with a communication overhead and requires precision.
Tarafdar [77] and his colleagues utilized OpenCL via Xilinx SDAccel framework using an abstract layer to assign the data to multiple FPGAs and maintaining a transparent directory to virtualize the FPGAs at the lower abstraction level. The approach of the FPGA groups [121] suggests that multiple FPGAs can be shared by one group but configured with a matching accelerator. However, this comes with a limitation of occupying a complete FPGA that results in under-utilization but it can be addressed with an automation of the scaling algorithm. A similar concept has been proposed in [122] while using Hadoop YARN with a value-added advantage of ease of programming.
In the heterogeneous computing environment, the performance is also a function of execution strategy. For the exploration of alternative execution strategies on disaggregated environments, the evaluation platform presented in [123] is useful.

Cloud Services
Cloud services architecture guarantees QoS and promises computational correctness while abstracting the underlying architecture. Therefore, as the user has no concern about the choice of computational node, the job can be computed on an employed FPGA. Amazon offering FPGA as a resource in [40] does not fall into this category but the landmark work of Microsoft [4] on search ranking that achieved a substantial speed-up, with relatively higher power consumption. This is also a good example of hybrid architecture as Catapult can allow for the acceleration jobs to both, host CPU core, and standalone FPGA. Baidu [124] achieved the same performance for deep neural networks. The use of FPGAs as co-processors in compute-intensive problems has been implemented [125], exploiting the multiple data streams.
The architectures with network support widen the choice of connectivity, which allows the CPU provisioning either as a soft-core or embedded on-chip. OpenStack is the most common method for directly allowing the user to program the FPGA [75,77,87] through physical or virtual address. It provides the flexibility to the expert user for exploiting either socket or remote routine approach to establish connectivity to an FPGA. Bashir et al. addressed the issue of poor utility and high computation complexity on high-dimensional data in [126] and proposed many network architectures for datacenters in [127,130,131].

Execution Model based Distribution
The execution model is used as a decision parameter while doing system partitioning, process to place certain modules in shell. Inspired from the Flynn Taxonomy in [132], the execution models can be categorized as four, as described in Table 6. All of the works in the last decade have been distributed in any of the four boxes, as per relevance, for quick navigation.

Open Problems and Discussion
There is plenty to do in this area, but we would like to mention a few open problems. The foremost goal is to enable multi-tenant multi-FPGAs for medium to large-scale datacenters, only then we can unleash the real potential of FPGAs as a heterogeneous computing resource. This can be achieved either by developing FPGA operating system or improving existing virtualization methods. An intense investigation is required on how to compute over multiple FPGAs in a scalable manner. A design is required that can exploit multiple FPGAs via streaming between Catapult style or batching MapReduce style, other than OpenStack.
A serious effort is required to make the shell and development stack modular. Currently, everything must compile against a shell and any change in shell requires recompiling accelerators. Likewise, a change in the Linux kernel means the recompilation of all user software, so one can imagine how bad is the FPGA ecosystem yet today. However, FPGAs provide a lot of customization, without which it would be meaningless to use FPGAs in the first place. Overlays solve this issue to an extent for a small class of application, but the solution is not scalable for general computation with FPGAs. Therefore, we need a set of APIs and standards in software stack to manage this heterogeneity in a sensible manner. Dynamic resource allocation somehow addresses this issue, but largely it remains an ignored area by the community.
With multi-tenant support, efficient management and scheduling is required for the resources, an advanced resource manager that can fit the same workload on fewer FPGA resources should be the key point of future development.
Security is another aspect that needs intense attention of the community with a lot of potential for development. The complex case of FPGAs in datacenter is vulnerable to all sorts of attacks, as the reported attacks include malicious bitstream and side channel that severely damage the availability. It also assists in segregation of many accelerators on same FPGA or network.

Conclusion
The integration of FPGAs in datacenters might have different motivations from acceleration to energy efficiency, but the ultimate objective of better performance remained unshaken. FPGAs are being utilized in a variety of ways today, tightly coupled with heterogenous computing resources and a standalone network of homogenous resources. Open source software stacks, propriety tool chain, and programming languages with advanced methodologies are hitting hard on the programmability wall of the FPGA. Therefore, it was important to visualize this area as highperformance reconfigurable computing.
In this paper, we rendered a survey on high-performance reconfigurable computing. We pointed out the use of non-standard nomenclature in published research as an obstacle to the growth of the body of knowledge. We further identified, the contributors of different background, approaching for a wide range of applications, to be the reason of this phenomenon. We indicated some examples of using standard language and nomenclature. We revisited the network-on-chip evaluation platforms to highlight its importance as compared to the bus-based architectures. The limitations of virtualization shells like frequency drop, high wire demand, increased design latency, and routing congestion leading to routing failure, can be addressed using a suitable network-on-chip. We highlighted the need of network-on-chip evaluation platforms to quickly analyze the performance to reach a required communication architecture. We updated the scientific community on classical and recent virtualization techniques, from the last decade. We stretched our review from acceleration of standalone FPGA to FPGAs that are connected as a computational resource in heterogeneous environment. The purpose of this research was to create a synergy through combining three domains to assist the designers to choose right communication architecture for the right virtualization technique and to emphasize the importance of using the standard language, so that multi-tenant FPGAs in the datacenters can be evolved.
We have chalked out open problems in this area. Our future research will be focused on finding optimal communication architecture, for multi FPGAs. Other than the interconnection between different processing elements within one FPGA, the communication among multiple FPGAs poses a bigger challenge in our future work, and an opportunity for the community as well.