Efﬁcient ROS-Compliant CPU-iGPU Communication on Embedded Platforms

: Many modern programmable embedded devices contain CPUs and a GPU that share the same system memory on a single die. Such a uniﬁed memory architecture (UMA) allows programmers to implement different communication models between CPU and the integrated GPU (iGPU). Although the simpler model guarantees implicit synchronization at the cost of performance, the more advanced model allows, through the zero-copy paradigm, the explicit data copying between CPU and iGPU to be eliminated with the beneﬁt of signiﬁcantly improving performance and energy savings. On the other hand, the robot operating system (ROS) has become a de-facto reference standard for developing robotic applications. It allows for application re-use and the easy integration of software blocks in complex cyber-physical systems. Although ROS compliance is strongly required for SW portability and reuse, it can lead to performance loss and elude the beneﬁts of the zero-copy communication. In this article we present efﬁcient techniques to implement CPU–iGPU communication by guaranteeing compliance to the ROS standard. We show how key features of each communication model are maintained and the corresponding overhead involved by the ROS compliancy.

required smart re-configuration. It has been developed on top of ROS Jade (ROS 1) to deploy software on autonomous transport vehicles. In [18], the authors present Autoware software, which is designed to enable autonomous vehicles on embedded boards, specifically Nvidia Drive PX2 boards communicating through ROS Kinetic (ROS 1), showing high levels of efficiency thanks to the iGPU and ARM CPU installed on the embedded board, superior to high-end x86 laptops parts.
Unlike these prior works, that target the cyber-physical systems from the point of view of functionality or performance, the analysis and methodology proposed in this work target the overhead of CPU-iGPU communication through the ROS protocols on zero-copy communication models that, to the best of our knowledge, has not been addressed in prior works.

CPU-iGPU Communication and ROS Protocols
This section presents an overview of the CPU-iGPU communication models and the standard ROS-based communication between tasks. For the sake of clarity and without loss of generality, we consider CUDA as the iGPU architecture.

CPU-iGPU Communication in Shared-Memory Architectures
The most traditional and simple communication model between CPU and iGPU is the Standard Copy (CUDA-SC in the following), which is based on explicit data copy (see Figure 1a). With CUDA-SC, the physically shared memory space is partitioned into different logical spaces and the CPU copies the data from its own partitions to the iGPU partitions. Since both the CPU and GPU caches are enabled, it can hide the data copy overhead. The cache coherence is guaranteed implicitly by the operating system, which flushes the caches before and after each GPU kernel invocation. Such a model also guarantees implicit synchronization for data access between CPU and iGPU.  Programmers can take advantage of different solutions that allow for this explicit copy to be removed and for performance and power consumption benefits.

Heterogeneous
The unified memory model (CUDA-UM in the following) allows the programmer to implement CPU-iGPU communication through data pointers, thus avoiding explicit data transfer invocations (see Figure 1b). In this communication model, the physically shared memory is still partitioned into CPU and iGPU logic spaces although they are abstracted and used by the programmer as a virtually unified logical space. The runtime system implements synchronization between CPU and iGPU logical spaces through an on-demand page migration heuristic.
The zero-copy model (CUDA-ZC in the following) implements CPU-iGPU communication by passing data through pointers to the pinned shared address space. When CPU and the iGPU physically share the same physical memory space, their communication does not rely on the PCIe bus and communication through CUDA-ZC provides the best efficiency in different application contexts [3,4]. CUDA-ZC through shared address space requires the system to guarantee cache coherency across the memory hierarchy of the two processing elements. Since the overhead involved by SW coherency protocols applied to CPU-iGPU devices may elude the benefit of the zero-copy communication, many systems address the problem by disabling the last level caches (see Figure 2a).
CUDA-ZC best applies when the shared address space between CPU and iGPU points to the same physical memory space but, if the memory data pointer changes during CPU execution due to new data allocations, CUDA-ZC requires a new allocation. However, the CUDA-ZC model provides the cudaHostRegister function API to handle this situation without allocating additional memory. It applies page-locking to a CPU memory address and transforms it into a CUDA-ZC address. This operation requires hardware I/O coherency to avoid memory consistency problems caused by CPU caching.
Although additional memory has to be allocated for coherency with cudaHostAlloc API, the data copy phase is more efficient than CUDA-SC as it avoids the copy stream buffering. On the other hand, since a copy is performed, this situation decreases the performance compared to the zero-copy model if the copy is performed on the same memory location.  Although CUDA-ZC best applies to many SW applications (e.g., applications in which CPU is the data producer and the GPU is the consumer), it often leads to strong performance degradation when the applications make intensive use of the GPU cache (i.e., cache-dependent applications). To reduce such a limitation, more recent embedded devices include hardware-implemented I/O coherence, which allows the iGPU to snoop the CPU cache. Indeed, the CPU cache is always active and any update on it is visible to the GPU. In contrast, the GPU cache is disabled (see Figure 2b).

ROS-Based Communication between Nodes
In ROS, the system functionality is implemented through communicating and interacting nodes. The nodes exchange data using two mechanisms: the publisher-subscriber approach and the service approach (i.e., remote procedure call-RPC) [19].
The publisher-subscriber approach relies on topics, which are communication buses identified by a name [20]. One node (publisher) publishes data asynchronously and one node (subscriber) receives the data simultaneously. Given a publisher URI, a subscribing node negotiates a connection through the master node via XML RPC. The result of the negotiation is that the two nodes are connected, with messages streaming from publisher to subscriber using the appropriate transport.
Each transport has its own protocol for how the message data is exchanged. For example, using TCP, the negotiation would involve the publisher giving the subscriber the IP address and port on which to connect. The subscriber then creates a TCP/IP socket to the specified address and port. The nodes exchange a connection header that includes information like the MD5 sum of the message type and the name of the topic, and then the publisher begins sending serialized message data directly over the socket.
The topic-based approach relies on a socket-based communication, which allows for collective communication. The communication channel is instantiated at the system start-up and never closed. In this type of communication there can be many publishers and many subscribers on the same topic, as in the example of With the service approach, a node provides a service on request (see Figure 4). Any client can query the service synchronously or asynchronously. In case of synchronous communication, the querying node waits for a response from the service. There can be multiple clients, but only one service server is allowed for a given service.
In general, the service approach relies on a point-to-point communication (i.e., socketbased) between client and server and, for each client request, the server creates a dedicated new thread to serve the response. After the response, the communication channel is closed. Publishers-subscribers communication is implemented through a physical copy of the data. In case of nodes running on the same device and sharing the same resources, the copy of large-size messages may slow down the entire computation. ROS2 introduced ROS zero-copy, a new method to perform efficient intra-process communication [21]. Figure 5 shows the overview of such a communication method. With ROS zero-copy (ROS-ZC in the following), two nodes exchange the pointer to the data through topics, while the data are shared (not copied) in the common physical space. Even though such a zerocopy model guarantees high communication bandwidth between nodes instantiated on the same process, it has several limitations that prevent its applicability. First, it does not apply to service-based communication and it does not apply to inter process communication. In addition, it does not support multiple concurrent subscribers and it does not allow for computation-communication overlapping.

Efficient ROS-Compliant CPU-iGPU Communication
GPU-accelerated code (i.e., GPU kernels) cannot be directly invoked by a ROScompliant node. To guarantee ROS standard compliance of any task accelerated on GPU, the GPU kernel has to be wrapped to become a ROS node.

Making CPU-iGPU Communication Compliant to ROS
We consider, as a starting point, communication between CPU and iGPU implemented through CUDA-SC or CUDA-UM (see Figure 1a,b). For the sake of space and without loss of generality, we consider the performance of UM similar to SC. As confirmed by our experimental results, the maximum difference between the two model performance is within ±8% in all the considered devices. The difference is strictly related to the driver implementation for the on-demand page migration. Compared to the difference between CUDA-SC(CUDA-UM) and CUDA-ZC, we consider negligible, in this article, the difference of performance between CUDA-SC and CUDA-UM. Figure 6 shows the most intuitive solution to make such a communication model compliant to ROS. The main CPU task and the wrapped GPU task are implemented by two different ROS nodes (i.e., processes P1 and P2). Communication between the two nodes relies on the publisher-subscriber model. The CPU node (i.e., process P1) has the role of executing the CPU tasks and managing the synchronization points between the CPU and GPU nodes. The CPU node publishes input data into a send topic and subscribes on a receive topic to get the data elaborated by the GPU. The GPU node (i.e., process P2) is implemented by a GPU wrapper and the GPU kernel. The GPU wrapper is a CPU task that subscribes on the send topic to receive the input data from P1, it exchanges the data with the GPU through the standard CPU-iGPU communication model, and publishes the result on the receive topic.
This solution is simple and easy to implement. On the other hand, the system keeps three synchronized copies of the I/O data messages (i.e., for CPU task, GPU wrapper, and GPU task). For each new input data received by the wrapper through the send topic, the memory slot allocated for the message has to be updated both in the CPU logic space (i.e., for the GPU wrapper) and in the GPU logic space (i.e., for the GPU task) as for the CUDA-SC standard model. On the other hand, CPU and GPU tasks can overlap, as the publisher-subscriber protocol allows for asynchronous and non-blocking communication. The following snippet of pseudo-code shows the steps performed by the CPU node to perform the CPU operations and to communicate with the iGPU node through the publish-subscribe paradigm and topics: callback(msg::SharedPtr msg) { copy_gpu_result(this->data, msg->data); // synchronize operations and perform next step sync_and_step(msg, this->data); } // init this->publisher = create_pub<msg>(TOPIC_SEND); this->subscriber = create_sub<msg>(TOPIC_RECEIVE, &callback); // perform first step msg::SharedPtr msg = new msg(); msg->data = this->data; this->publisher->publish(msg); //< send to iGPU cpu_operations(this->data); //< perform CPU ops in overlapping Figure 7 shows a ROS-compliant implementation of the CUDA-SC model through the ROS service approach. Similarly to the first publisher-subscriber solution, the memory for the data message is replicated in each ROS node and in the GPU memory. The message has to be copied during every data communication and, since the service request can be performed asynchronously, CPU-iGPU operations can be performed in parallel.
The following snippet of pseudo-code shows the steps performed by the CPU node in order to perform the CPU operations, sends a request and waits for a response through RPC model: // init auto gpu_client = create_client<srv>(SERVICE_GPU_NAME); while (!gpu_client->wait_for_service()); //< wait for GPU service available // send a new request asynchronously auto request = new request(); request->data = this->data; auto gpu_result = gpu_client->async_send_request(request); cpu_operations(this->data); //< perform CPU ops in overlapping // wait for result spin_until_future_complete(gpu_result); copy_gpu_result(this->data, gpu_result.get()->data);  Figure 8 shows a more optimized version of such a standardization, which relies on the ROS-ZC and intra-process communication. The ROS nodes are implemented as threads of the same process. As a consequence, each node shares the same virtual space memory and communication can rely on the more efficient protocols based on shared memory. Nevertheless, the usage of ROS-ZC has several limitations: • ROS-ZC can be implemented only for communication between threads of the same process. When a zero-copy message is sent to a ROS node of a different process (i.e., inter-node communication), the communication mechanism automatically switches to the ROS standard copy; • ROS-ZC does not allow for multiple nodes subscribed on the same data resource. If several nodes have to access a ROS-ZC message concurrently, ROS-ZC applies to only one of these nodes. The communication mechanism automatically switches to the ROS standard copy for the others. This condition holds for both intra-process and inter-process communication; • ROS-ZC only allows for synchronous ownership of the memory address. A node that publishes a zero-copy message over a topic will not be allowed to access the message memory address. For this reason, CPU and iGPU operations cannot be performed in parallel, as the CPU node cannot execute operations after sending the message. Then the CPU node is forced to compute its operations either before or after the GPU operations (i.e., no overlapping computation is allowed over the shared memory address).  Figure 8. CPU-iGPU standard copy with ROS native zero-copy and topic paradigm.
These limitations guarantee no race conditions in shared memory locations and provide better performance with regards to ROS standard copy in case of intra-process communication. On the other hand, it does not apply for inter-process communication, which is fundamental for portability of robotic applications.
Due to the several limitations involved by the previous simple solutions, we propose a new approach that maintains the standard ROS interface and related modularity advantages while taking advantage of the shared memory between processes when nodes are deployed in the same unified memory architecture (see Figure 9). The idea is to implement a ROS interface that shares the reference to the inter-process shared memory with the other ROS nodes, and exchanges only synchronization messages between nodes. The proposed solution has the following characteristics: • Not only intra-process : This solution also applies to inter-process communication by means of a IPC shared memory managed by the operating system; • Unique data allocation: the only memory allocated for data exchange is the shared memory; • Efficient CUDA-ZC: The reference to the shared memory does not change during the whole communication process between ROS nodes. As a consequence, it also applies to the CUDA-ZC communication between the wrapper and the iGPU task. (see Figure 10); • Easy concurrency: The shared memory can be accessed concurrently by different nodes allowing parallel execution between nodes over the same memory space when the application is safe from race condition.  The two drawbacks of this implementation are the risk of race conditions, which have to be managed by the programmer, and the need to fall back to the standard ROS communication protocol when one of the nodes moves outside of the unified memory architecture. The most intuitive ROS integration in a concurrent GPU-iCPU application is the partition of the CPU and iGPU tasks into two different nodes. In case of multiple nodes (CPUs and iGPU), we propose a different architecture by which a dedicated node implements exclusively the multiple node synchronization and scheduling (see Figure 11). In particular, the CPU nodes wait for any new data message from the send topic, perform the defined tasks, and then return the results in the corresponding topic. The iGPU node(s) performs similarly for the CUDA kernel tasks. Although the scheduler node acts as a synchronizer for the CPU and iGPU nodes, it provides the data to elaborate on send topic, waits for CPUs and iGPU responses and, when both are received, it merges the responses and forwards the merged data. Figure 11 shows the multi-node architecture implemented as an extension of the two node architecture based on the ROS topic paradigm. It can be analogously implemented with ROS service paradigm.

Layer GPU Operations
Process Pm

Data Msg
Memory Memory

CPU Operations
Process Pn Comparing the two solutions with multi-node architecture, the topic paradigm is more efficient than the service paradigm in terms of communication as the sending topic is shared between the other subscriber nodes. In the service paradigm, a new request has to be performed for each required service. This aspect is particularly important as it underlines the scalability advantages of the multi-node architecture compared to the two node architecture. Assuming an application with N CPU tasks and M GPU tasks all sharing the same resource. The system can be implemented compliant to ROS with one scheduler node, N CPU nodes, and M GPU nodes. All nodes are synchronized by the scheduler node with the ROS topic or ROS service paradigm. Considering the topic paradigm, the system requires N + M receive topics and one send topic. With the service paradigm, each N + M CPU and GPU nodes have to create a service, which will be used by the scheduler node.
The topic paradigm relies on an single send topic, which can be shared by different subscribers. With the service paradigm, the scheduler node has to perform a new request for each service. In terms of performance, the topic paradigm for multiple tasks is more efficient in the case of data-flow applications.
In order to implement multi-node architecture, it is necessary to create N callbacks for CPU data receive and M callbacks for GPU data receive. In this case, the scheduling node will have the only purpose to manage the communication between the others node. The scheduling implementation skeleton for the topic paradigm is the following: callback_cpu_i(msg::SharedPtr msg) { copy_cpu_i_result(this->data, msg->data); // synchronize operations and perform next step sync_and_step(msg, this->data); } callback_gpu_j(msg::SharedPtr msg) { copy_gpu_j_result(this->data, msg->data); // synchronize operations and perform next step sync_and_step(msg, this->data); } // init this->publisher = create_pub<msg>(TOPIC_SEND); this->subscriber_cpu_i = create_sub<msg>(TOPIC_CPU_I_RECEIVE, &callback_cpu_i); this->subscriber_gpu_i = create_sub<msg>(TOPIC_GPU_J_RECEIVE, &callback_gpu_j); // perform first step msg::SharedPtr msg = new msg(); msg->data = this->data; this->publisher->publish(msg); //< send to subscriber nodes Figure 12 shows the multi-node architecture implemented through the ROS-ZC for node communication. The CPU node, the iGPU node, and the scheduler node have the same roles as before, but they are executed as threads of the same process and the exchanged messages rely on the zero-copy paradigm. For this reason, two communicating nodes (i.e., scheduler-CPU or scheduler-iGPU) can exchange only the reference address of the data. The communication concerning the remaining competing node automatically switches to the ROS standard copy modality. For example, scheduler-CPU communication will be switched to ROS standard copy if scheduler-iGPU transfer is ROS-ZC and scheduler-iGPU will be ROS-SC if scheduler-iGPU is ROS-ZC. The zero-copy exchange will be provided to the first node ready to receive the data and the choice of such a node is not predictable. In general, this architecture with N CPU tasks and M iGPU tasks requires N + M − 1 data instances in the virtual address memory space (e.g., for one CPU task and one iGPU the system requires two data instances, as showed in Figure 12), as only one instance can be shared in zero-copy between the scheduler node and another node. Therefore, due to the ROS-ZC limitations, this architecture with ROS-ZC can save only one data instance compared to the ROS-SC solution. The advantage of the multi-node architecture with ROS-ZC is that it overcomes the limitations of the two node architecture with ROS-ZC, which does not allow for concurrent CPU and iGPU execution. Thanks to the scheduler node, which implements CPU-iGPU synchronization, all CPU and iGPU tasks can be executed while overlapping.  However, since data copy is still needed, it is not possible to combine ROS-ZC to CUDA-ZC in case of multiple nodes by fully taking advantage of the zero-copy semantics. This is due to the fact that ROS does not guarantee that the same virtual address is maintained for the GPU node. As a consequence, as confirmed by our experimental results, this solution cannot guarantee the best performance in multi-node communication compared to standard copy solutions in terms of both memory usage and GPU management.

Layer GPU Operations
To overcome such a limitation, we propose the solution represented in Figure 13. Differently from the previous solution, the system manages the IPC shared memory and, unlike ROS-ZC, the nodes can be instantiated as processes. The scheduler node creates the IPC shared memory by using the standard Linux OS syscall, Then, it performs the memory attach to bind the shared created memory to its own virtual memory space, and, then, it shares with the other nodes synchronization messages with the shared memory information. The CPU nodes and iGPU nodes wait for synchronization messages. They obtain the IPC shared memory with the OS syscalls, they perform the memory attach, and they finally perform the requested actions. This approach relies on the ROS-ZC mechanism implemented through the shared memory. It applies for both the intra-process and interprocess communication. On the other hand, since it aims at avoiding multiple copies of the data message, it involves more overhead to manage race condition among the multiple nodes sharing the same logical space.

IPC Shared Memory
Concurrent Access Data Msg Figure 13. CPU-GPU zero-copy with our ROS zero-copy solution, topic and 3 nodes architecture.

Experimental Results
To verify the performance of the proposed ROS-compliant communication models, we carefully tailored two different benchmarks: cache-dependent and concurrent benchmark.
The cache benchmark, implements the elaboration of a matrix data structure independently performed by both CPU and iGPU. In particular, the CPU performs a series of floating point square roots, divisions, and multiplications with data read and written from and to a single memory address. The GPU performs a 2D reduction multiple times through linear memory accesses. This is achieved through iterative loading of the operands (ld.global instructions), a sum (add.s32), and the result store (st.global). As a consequence, this benchmark makes intensive use of the caches (both CPU and GPU), with no overlapping between CPU and iGPU execution. CUDA-ZC makes use of the concurrent execution of the routines and the concurrent access to the shared data structure. CUDA-SC explicitly exchanges the data structure before the routine computation.
The concurrent benchmark, performs a balanced CPU+GPU computation through a routine with highly reduced use of the GPU cache. The GPU kernel implements a single read access (ld.global) and single write access (st.global) per iteration in order to reduce the cache usage as much as possible. It implements a concurrent access pattern and a perfect overlap of the CPU and GPU computations to extrapolate the maximum communication performance the given embedded platform can provide by considering CUDA-ZC. It should greatly favor the communication patterns that allow concurrent access to the shared data (i.e., only CUDA-ZC).
The routines of both benchmarks are optimized on I/O coherent hardware through the use of the cudaHostRegister API.
For all tests, we used an NVIDIA Jetson TX2 and a Jetson Xavier as embedded computing architectures, which are prevalent low-power heterogeneous devices used in industrial robots, machine vision cameras, and portable medical equipment, see Figure 14.
Both of these boards are equipped with iGPU and UMA, which allow us to best apply the proposed methodology. The TX2 consists of a quad-core ARM Cortex-A57 MPCore, a dual-core NVIDIA Denver 2 64-Bit CPU, 8 GB of unified memory and a NVIDIA Pascal iGPU with 256 CUDA cores. The Xavier consists of four dual-core NVIDIA Carmel ARMv8.2 CPUs, 32 GB of unified memory and a 512 CUDA cores NVIDIA Volta iGPU with I/O coherency. These two synthetic tests aim at maximizing the communication bottleneck and are representative of a worst-case scenario communication-wise. Real world applications, especially in the field of machine learning, will see a lesser bottleneck because of a more coarse-grained communication. In contrast, a high number of nodes with limited communication will still see benefits from the application of the proposed methodology to reduce the overall communication overhead. Table 1 shows the performance results obtained by running the two benchmarks on the two different devices with different communication models (i.e., CUDA-SC, CUDA-ZC) without ROS. The reported times are the averaged results of 30 runs. Standard deviation is also considered in the fifth column (i.e., column "SD"). As expected, the TX2 device provides less performance then the Xavier. In both devices, the cache benchmark with the CUDA-ZC model, which disables the last level caches of CPU and GPU, provides the worst performance. The concurrent benchmark, even with a light cache workload and concurrent execution, still leads to performance loss. On the other hand, the I/O coherency implemented in hardware in the Xavier reduces this performance loss. In contrast, such a performance loss is extremely evident in the TX2.
The concurrent benchmark shows how a lighter cache usage combined to I/O coherency and concurrent executions thanks to CUDA-ZC, allows for significant performance improvements when compared to CUDA-SC.
The results of Table 1 will be used as reference times to calculate the overhead of all the tested ROS-based configurations. We present the results obtained with the proposed ROS compliant models into four tables. Tables 2 and 3  In Table 2, the first two rows show how a standard implementation of the ROS protocol (i.e., ROS-SC) decreases the overall performance to unacceptable levels, for both services and topics ROS mechanisms. The ROS-ZC standard shows noticeable improvements compared to ROS-SC, by reducing the overhead by a factor of ≈100. The proposed ROS-SHM-ZC improves even further by reducing the overhead down to 2% and 11% when compared to CUDA-SC and CUDA-ZC, respectively.
Moving from two to three nodes, we found a performance improvement by combining CUDA-SC with ROS-SC in both topics and services. Although there is a slight overhead caused by the addition of the third node, these models lead to better overall performance due to the easier synchronization between nodes. The standard deviation shows high variance between results, suggesting a communication bottleneck that can be exacerbated by the system network conditions, outside of the programmer's control, even in localhost. This communication bottleneck is greatly reduced in the ROS-ZC and ROS-SHM-ZC configurations, thanks to the reduced size of the messages. The third node overhead still reduces the performance when compared to two nodes. Overall, the difference between two nodes and three nodes in the zero-copy configurations is negligible.
The same considerations hold for the five node architecture, which also proves to be very costly due to the additional nodes. Nonetheless, it is still better than the two nodes ROS-SC configuration, overhead-wise. In this instance, for the sake of clarity, the only reported solutions are the proposed ROS-SHM-ZC. The obtained overhead, while not negligible, is still limited, especially in the CUDA-ZC configuration. Considering the concurrent benchmark on the Jetson TX2 (Table 3), we found performance results similar to the cache benchmark, but with lower overall overheads in the two nodes ROS-SC configurations. In this benchmark, the service mechanism of ROS is consistently slower than the topics mechanism. Worthy of interest are the speed-ups obtained with ROS-SHM-ZC. With CUDA-SC they are within a margin of error at −1% with two nodes and −5% with three nodes. With CUDA-ZC they are more significant at −9% and −11% with two and three nodes, respectively. This is caused by the caches not being disabled on the CPU node. As the hardware is not I/O coherent, we had to manually handle the consistency of the data, and because the cache utilization is low but still present, the CPU computations have better performance. This allows for an improvement when compared to the original CUDA-ZC solution of Table 1. Nonetheless, this also means that while there should be no copies in these two configurations, the combination of the ROS mechanisms with the CUDA communication model (CUDA-ZC) actually forces the creation of explicit data copies. One from CPU to iGPU and one in the opposite direction. These copies are responsible for the loss in performance when comparing ROS-SHM-ZC + CUDA-ZC to ROS-SHM-ZC + CUDA-SC.
Considering the Xavier device, (Tables 4 and 5), that ROS-SC model is much slower compared to the reference performance. Services are consistently slower than topics. ROS-ZC is faster than ROS-SC by a wide margin. It is also important to note that, for ROS-SHM-ZC, there are no negative overheads on the Xavier. This is due to the hardware I/O coherency, which already extrapolates the maximum performance for CUDA-ZC and the manual handling of the coherency does not lead to performance improvements. In both benchmarks, the three node configurations for ROS-SHM-ZC show higher overhead compared to the two node variants, highlighting the higher cost of the third node and, thus, leading to a significant performance loss when compared to the optimal performance of the native configuration.

Conclusions
In this article we presented different techniques to efficiently implement CPU-iGPU communication that complies to the ROS standard. We showed that a direct porting of the most widespread communication protocols to the standard ROS can lead to up to 5000% overhead. We presented an analysis to show that each different technique provides extremely different performance depending on both the application and the hardware device characteristics. As a consequence, the analysis allows programmers to select the best technique to implement ROS-compliant CPU-iGPU communication protocols by fully taking advantage of the CPU-iGPU communication architecture provided by the embedded device, by applying different mechanisms included in the ROS standard, and by considering different communication scenario, from two to many ROS nodes.