Adaptive Multi-Hop P2P Video Communication: A Super Node-Based Architecture for Conversation-Aware Streaming

Chen, Jiajing; Fujita, Satoshi

doi:10.3390/info16080643

Open AccessArticle

Adaptive Multi-Hop P2P Video Communication: A Super Node-Based Architecture for Conversation-Aware Streaming

by

Jiajing Chen

^† and

Satoshi Fujita

^*,†

Department of Information Science, Graduate School of Advanced Science and Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi-Hiroshima 739-8527, Japan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2025, 16(8), 643; https://doi.org/10.3390/info16080643

Submission received: 6 June 2025 / Revised: 16 July 2025 / Accepted: 18 July 2025 / Published: 28 July 2025

(This article belongs to the Special Issue Second Edition of Advances in Wireless Communications Systems)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a multi-hop peer-to-peer (P2P) video streaming architecture designed to support dynamic, conversation-aware communication. The primary contribution is a decentralized system built on WebRTC that eliminates reliance on a central media server by employing super node aggregation. In this architecture, video streams from multiple peer nodes are dynamically routed through a group of super nodes, enabling real-time reconfiguration of the network topology in response to conversational changes. To support this dynamic behavior, the system leverages WebRTC data channels for control signaling and overlay restructuring, allowing efficient dissemination of topology updates and coordination messages among peers. A key focus of this study is the rapid and efficient reallocation of network resources immediately following conversational events, ensuring that the streaming overlay remains aligned with ongoing interaction patterns. While the automatic detection of such events is beyond the scope of this work, we assume that external triggers are available to initiate topology updates. To validate the effectiveness of the proposed system, we construct a simulation environment using Docker containers and evaluate its streaming performance under dynamic network conditions. The results demonstrate the system’s applicability to adaptive, naturalistic communication scenarios. Finally, we discuss future directions, including the seamless integration of external trigger sources and enhanced support for flexible, context-sensitive interaction frameworks.

Keywords:

peer-to-peer (P2P) video streaming; WebRTC; super node architecture; dynamic network reconfiguration; speaker transition; RTP/RTCP protocols; conversation-aware systems

1. Introduction

Aristotle, in his Politics [1], characterized humans as “political animals”, suggesting that we fulfill our potential and cultivate virtue through collective engagement. From this classical perspective, conversation has long been regarded as essential to social cooperation and ethical development. Accordingly, conversation functions not merely as a channel for information exchange but as the foundational mechanism for mutual understanding, conflict resolution, and the preservation of social cohesion [2,3,4].

In recent years, an extensive body of research has examined the characteristics of natural conversation, with particular attention paid to the mechanisms that support smooth turn-taking, shared attention, and the regulation of participation [5,6,7]. These studies highlight that effective dialogue depends not only on the transmission of information but also on subtle multimodal cues—such as gaze, body orientation, and prosody—that help coordinate social interaction.

However, conventional online communication systems often fail to fully support these subtle mechanisms. For much of human history, conversation has occurred face-to-face, where individuals convened in the same physical space. Such co-located interactions naturally afford conversational affordances such as spontaneous entry and exit, overlapping speech, and the rapid negotiation of speaker roles.

With the advancement of information and communication technologies, this norm has shifted. Web-based video communication tools (e.g., Skype, Zoom, Microsoft Teams) have made remote interaction widely accessible, and recent work has sought to enhance co-presence through immersive systems such as telepresence robots, virtual avatars, and metaverse platforms [8,9]. Despite these developments, replicating the fluidity and naturalness of face-to-face communication remains an open challenge.

Prior studies have pointed out that key limitations in current systems include the lack of support for dynamic role-switching between speaker and listener, difficulties in managing floor control in group settings, and the insufficient responsiveness to conversational context [10,11,12]. These limitations motivate the development of systems that can better approximate the nuances of physical conversation.

In this context, our study aims to bridge the experiential gap between digital and in-person communication by focusing on real-time responsiveness to conversational structure. Specifically, we target the enhancement of dynamic participation and conversational fluidity, which are often constrained in existing online platforms.

Many existing systems adopt a server-based streaming architecture, in which streams are collected centrally before redistribution. While this model simplifies coordination, it introduces several challenges. The reliance on a central media server creates a single point of failure (SPOF), making systems vulnerable to outages or attacks. Additionally, funneling all communication through a central entity raises concerns about surveillance and censorship.

Although technical countermeasures such as redundancy and encryption provide partial relief, they do not eliminate the structural risks associated with centralization. To address these challenges, recent research has explored decentralized and peer-to-peer architectures for online communication [13,14,15], aiming to improve resilience, scalability, and user autonomy.

This study builds upon these efforts by investigating a multi-polar, decentralized architecture that enables dynamic topological reconfiguration based on conversational structure. Our approach is designed to support fluid conversational dynamics while minimizing reliance on centralized infrastructure, thereby contributing to the broader goal of secure and democratic online communication.

Our Contribution

To address the limitations of centralized architectures and enable more flexible, resilient online conversations, we propose a decentralized peer-to-peer (P2P) communication system based on Web Real-Time Communication (WebRTC). WebRTC is an open framework supported by major browsers that enables low-latency, high-quality media exchange without the need for a central media server, and is widely used in applications such as video conferencing.

While most existing systems utilize WebRTC in one-hop or two-hop configurations, our approach extends it to a multi-hop topology—specifically involving three or more nodes. This extension is nontrivial, as WebRTC was not originally designed for such configurations, and achieving low-latency media delivery across multiple hops requires careful routing and connection management. Compared to overlay multicast systems or distributed streaming protocols like BitTorrent Live, our method offers a fully browser-native, standards-compliant solution that avoids custom plug-ins or heavyweight overlays. This makes it readily deployable and lightweight while preserving the interactive responsiveness required for natural conversation.

WebRTC’s media channels internally rely on RTP (Real-time Transport Protocol) for packet delivery and RTCP (RTP Control Protocol) for quality feedback and synchronization. These protocols are fundamental to maintaining smooth audio and video playback, enabling our system to aggregate media streams and adaptively manage quality across multiple hops. Their use ensures compatibility with existing QoS-aware video systems and provides a standards-based mechanism for monitoring and optimizing multi-hop media flow.

Psycholinguistic research shows that conversational turn-taking feels natural when response latency is under 300 ms, while delays over one second significantly degrade mutual understanding. Our routing scheme ensures that even with per-hop delays of around 10 ms, the overall end-to-end latency across three or more hops remains within this conversational threshold.

Although initial WebRTC connection setup may take up to a few seconds due to factors like ICE negotiation and NAT traversal, this occurs only once at session initiation. ICE (Interactive Connectivity Establishment) is a protocol that coordinates the discovery of possible communication paths between peers, selecting the most efficient route. NAT (Network Address Translation) complicates direct P2P communication by masking private IP addresses, often requiring additional steps for address discovery. Once established, media transmission via host or reflexive ICE candidates typically achieves round-trip times of 5–10 ms. When no direct route is available, TURN (Traversal Using Relays around NAT) servers act as intermediaries to relay traffic between peers. Although TURN relays introduce higher latency, their use can be minimized in well-connected network environments. Terms related to WebRTC are described in detail in Section 2.

In summary, our work demonstrates that multi-hop WebRTC—augmented by standards-compliant RTP/RTCP mechanisms—can provide a low-latency, scalable, and robust alternative to traditional server-based systems, thereby offering a novel architecture for natural, conversation-aware communication.

The remainder of this paper is organized as follows. Section 2 provides a technical overview of WebRTC. Section 3 reviews related work on systems supporting online conversation. Section 4 introduces the proposed super-node-based P2P framework. Section 5 details the reconfiguration process that adapts to conversational dynamics. Section 6 presents a performance evaluation of the proposed method under dynamic conditions. Finally, Section 7 concludes the paper and outlines future directions.

2. WebRTC Overview

WebRTC is an open-source framework that enables real-time communication in web and mobile applications through standardized APIs [16]. It facilitates P2P communication, allowing direct transmission of audio, video, and data between devices without the need for intermediary media relay servers. This capability makes WebRTC particularly effective for applications demanding low-latency and high-quality communication.

To support communication under diverse network environments, WebRTC employs a combination of ICE, the JavaScript Session Establishment Protocol (JSEP), and the Session Description Protocol (SDP). ICE enables connection establishment across NAT and firewall boundaries by coordinating multiple candidate connection paths. STUN and TURN servers are used for NAT traversal, where TURN relays serve as a fallback when direct connections are not feasible, and STUN (Session Traversal Utilities for NAT) servers assist in this process by allowing clients to discover their public-facing IP addresses and ports, enabling direct peer-to-peer communication when possible.

In practice, WebRTC-based systems often adopt one of several architectural models depending on scalability and performance requirements. The most common is the Selective Forwarding Unit (SFU) architecture, in which a central media server (in WebRTC-based systems, communication typically involves not only media servers that handle the forwarding of media streams, but also signaling servers, which are responsible for exchanging metadata such as IP addresses and session descriptions between peers during connection setup. It is important to note that while the primary objective of decentralized media streaming is to eliminate the need for centralized media servers, this does not necessarily imply the elimination of signaling servers. In the proposed approach, we focus on decentralizing media transport, but still assume the presence of signaling infrastructure for connection establishment) forwards media streams without decoding them. This contrasts with the Multipoint Control Unit (MCU), which decodes, mixes, and re-encodes streams, resulting in higher server-side processing. Compared to mesh-based systems, which require full-mesh P2P connections between all clients, SFU-based systems scale better and reduce bandwidth usage on client devices.

Mediasoup is a representative open-source SFU library widely used in scalable WebRTC applications. It supports dynamic forwarding of audio and video streams and provides fine-grained control over transport and routing behavior. Mediasoup uses internal components such as PipeTransport, which allows two Mediasoup routers to interconnect—enabling multi-server, multi-region deployments.

Media transmission in WebRTC is carried over RTP, with control messages exchanged RTCP. RTP handles timestamped packet delivery for audio and video, while RTCP provides feedback on packet loss, jitter, and round-trip time, supporting adaptive bitrate control and stream quality monitoring.

In addition to media channels, WebRTC includes the RTCDataChannel API, which enables low-latency, reliable or unreliable transmission of arbitrary data. In our proposed system, the DataChannel is employed to coordinate dynamic overlay reconfiguration among super nodes, such as exchanging control messages for speaker transitions and network updates. This use of the DataChannel complements the media path and supports real-time responsiveness to conversational dynamics.

Taken together, these components form the technical backbone of our proposed multi-hop, decentralized video communication system. By combining an SFU-based architecture with real-time signaling via the WebRTC DataChannel and leveraging the underlying RTP/RTCP protocols for media delivery and quality control, the system achieves scalable, low-latency performance suitable for dynamic, conversation-aware scenarios. The architectural choices are guided by lessons from both centralized and distributed systems, offering a hybrid approach that balances robustness, flexibility, and user autonomy.

Before turning to related work on WebRTC-based remote conferencing systems in the next section, we first highlight the key differences between our approach and existing multi-hop video streaming systems that do not rely on WebRTC. In contrast to prior application-layer multicast methods, our system offers several notable advantages: (1) it is built on WebRTC, a widely adopted and standardized technology that ensures broad compatibility across platforms; (2) unlike conventional approaches that typically assume a single, fixed source node, our system supports dynamically changing source nodes—such as speakers in multi-party conversations; and (3) it incorporates built-in support for backup nodes, greatly enhancing fault tolerance and overall system resilience. These features collectively contribute to a more robust and adaptable solution for real-time communication in dynamic environments.

3. Related Work

WebRTC was initially introduced by Google as an open-source project in June 2011. It soon became the subject of standardization efforts by the W3C and IETF, culminating in the publication of WebRTC 1.0 as a W3C Candidate Recommendation in January 2018, and its finalization as a W3C Recommendation in January 2021. Originally intended for browser-based video streaming, WebRTC has since found widespread adoption across diverse domains such as telemedicine (e.g., Doxy.me (https://doxy.me/en/, accessed on 5 June 2025), MDLIVE (https://www.mdlive.com/, accessed on 5 June 2025)) and peer-to-peer file sharing (e.g., ShareDrop (https://sharedrop.io/, accessed on 5 June 2025), FilePizza (https://file.pizza/, accessed on 5 June 2025)).

This section reviews the evolution of WebRTC-based conferencing systems (Section 3.1) and discusses recent studies that evaluate WebRTC performance and techniques for enhancing media stream quality (Section 3.2).

3.1. WebRTC-Based Conferencing Systems

Early attempts to support multi-user video communication using WebRTC 1.0 include the work of Elleuch et al. [17], who proposed two conferencing models for browser-to-browser communication. For small-scale meetings, they employed a full-mesh topology using a simple signaling mechanism. For larger meetings, they proposed a topology control method based on application-layer multicast, with a central server used for coordination. This approach organizes participants into a tree structure to overcome the scalability limitations of full-mesh networks. While conceptually related to our method, their model does not distinguish between regular peers and super nodes, limiting its adaptability to dynamic group membership changes.

Suciu et al. [18] developed a browser-based P2P real-time communication system leveraging WebRTC, HTML5, and a Node.js signaling server. Their system integrates ScaleDrone to support high-speed message delivery and demonstrates the feasibility of using WebRTC for scalable video conferencing directly within web browsers, with Mozilla Firefox as their primary test platform.

Jawad [19] examined the role of 5G networks in enhancing WebRTC performance. The study reported measurable improvements in connection setup times, video resolution, and throughput. These results underscore the importance of network infrastructure in delivering seamless real-time communication with minimal latency and buffering.

In a more immersive context, Hu et al. [20] designed a mesh-based metaverse platform for inclusive education. Their system supports high-quality audio, video, and spatial interactions within virtual rooms, regardless of user count. To address scalability, they employed Sora—an SFU developed by Shiguredo—which is capable of supporting up to 1000 participants per room, ensuring robust communication quality even under heavy loads.

3.2. Performance Evaluation and Quality Enhancement of WebRTC

Extensive research has been conducted to evaluate the performance of WebRTC and improve the quality of transmitted media streams, particularly under dynamic and heterogeneous network conditions. Key areas of focus include congestion control, error resilience, and transmission efficiency.

One of the foundational mechanisms for congestion control in WebRTC is Receive-side Real-Time Congestion Control (RRTCC). Singh et al. [21] evaluated the performance of RRTCC across various communication topologies, including full-mesh and centralized models. Their experiments tested WebRTC endpoints under a range of network impairments, such as fluctuating bandwidth, packet loss, latency, and competing TCP traffic. While RRTCC performed adequately in controlled settings, its effectiveness significantly degraded under TCP competition, exposing its limitations in maintaining transmission stability in mixed-protocol environments.

Building on this, Jansen et al. [22] examined WebRTC performance using emulated network conditions. Their results revealed that WebRTC traffic tends to be prioritized slightly above TCP flows. However, in wireless networks, the performance of WebRTC exhibited notable variability, underscoring the need for adaptive strategies to ensure consistent quality of experience (QoE) across diverse deployment scenarios.

Error correction is another critical component in WebRTC. Forward Error Correction (FEC) is widely adopted to mitigate packet loss, but higher FEC rates increase bandwidth usage and processing overhead, potentially reducing video quality. To address this trade-off, Lee et al. [23] proposed R-FEC, a reinforcement-learning-based framework that dynamically adjusts video and FEC bitrates based on real-time network conditions. By learning from historical data, R-FEC balances error resilience and bandwidth consumption, achieving improved QoE in video conferencing applications.

To further enhance reliability, multipath transmission has emerged as a promising extension to traditional single-path WebRTC communication. Sathyanarayana et al. [24] introduced Converge, a WebRTC-compliant platform that leverages multiple transmission paths to deliver video content. Converge features a video-aware scheduler that intelligently distributes packets based on video structure, supported by receiver-side QoE feedback for real-time optimization. Moreover, it introduces a path-specific FEC mechanism that enhances resilience without incurring excessive overhead. Experimental evaluations demonstrated that Converge outperforms standard WebRTC by increasing media throughput by 1.2×, reducing end-to-end latency by 20%, and improving video quality by 55%.

Collectively, these studies highlight both the limitations of standard WebRTC implementations and the effectiveness of novel approaches to enhance real-time media streaming. The method proposed in this paper is complementary to these efforts: while existing solutions improve stream-level transmission quality, our approach focuses on the adaptive reconfiguration of the network overlay to support dynamic conversational flows. Both directions can be jointly employed to deliver a robust and scalable online communication experience.

4. Multi-Hop Distributed Streaming Model

4.1. Overview

This paper proposes a multi-hop video streaming architecture based on WebRTC and Mediasoup (https://mediasoup.org/, accessed on 5 June 2025) to enhance the scalability and efficiency of conventional SFU-based architectures. The architecture employs a pool of Mediasoup SFUs and leverages dynamic routing and load balancing to ensure low latency, high concurrency, and system stability.

Extending SFU-based architectures from single-hop to multi-hop fundamentally alters the trade-off between participant capacity and end-to-end latency. Consider an SFU with sufficient upload bandwidth to stream simultaneously to up to k users. In a single-hop architecture, the maximum number of participants is limited to

k + 1

(including the sender). By introducing an additional relay hop, the maximum number of participants increases from

k + 1

to

k^{2} + 1

, as each of the k nodes in the second layer can further distribute the stream to an additional k recipients. This improvement in scalability, however, comes at the cost of increased end-to-end latency.

The latency increase in the above extension can be understood as follows. In a single-server setup, the communication path typically consists of two hops: one from the sender to the SFU server and another from the SFU server to the receivers. When the stream is relayed through x super nodes arranged in sequence, the total number of hops becomes

x + 1

, and the resulting latency increases proportionally to

\frac{x + 1}{2}

. For example, when there are two relay super nodes (

x = 2

), the stream passes through three hops in total, and the latency increases by a factor of

\frac{3}{2} = 1.5

. This trade-off is particularly beneficial in large-scale communication scenarios, where optimizing bandwidth efficiency while maintaining acceptable latency is critical.

In the proposed architecture, WebRTC serves as the core transmission protocol, enabling direct P2P data exchange with support for adaptive bitrate control and network traversal mechanisms. Concurrently, the Mediasoup SFU functions as a super node, optimizing bandwidth efficiency through selective media forwarding and traffic management. Beyond media forwarding, super nodes perform essential computational tasks, including traffic monitoring, fault detection, and recovery. These capabilities enhance system resilience and ensure high availability, allowing seamless adaptation to network fluctuations and failures while maintaining consistent, high-quality real-time communication.

For clarity, we focus on the case where the number of hops is limited to three. These three hops consist of the following:

The first hop, where the speaker transmits media to the nearest super node.
The intermediate hop, where super nodes relay streams between each other.
The final hop, where listeners receive media from their respective super nodes.

Figure 1 shows a video streaming model with three hops.

More specifically:

In the first hop, the speaker establishes a WebRTC connection with the nearest super node (Mediasoup SFU). Once the connection is established, the super node receives the incoming audio and video streams and performs initial bandwidth optimizations, such as bitrate adjustments.
In the intermediate hop, super nodes exchange signaling information via WebRTC DataChannel or WebSockets and optimize media forwarding. The system dynamically selects the optimal relay path to minimize transmission latency.
In the final hop, the end users (receivers) retrieve media streams from their respective super nodes, ensuring efficient and scalable content delivery.

It is important to note that a conventional SFU does not include the intermediate hop, as media streams are directly forwarded from the speaker’s client to the listeners’ clients via a single SFU in a two-hop configuration. The introduction of the intermediate relay layer significantly enhances scalability while maintaining a structured and efficient routing mechanism.

4.2. Optimized Media Transmission via RTP and RTCP Integration

To ensure efficient and reliable media transmission in real-time communication scenarios, the proposed system integrates RTP and its companion protocol, RTCP. These protocols serve as the foundation for the system’s media delivery and quality control mechanisms, enabling seamless audio and video streaming even under dynamic network conditions.

The super-node-based architecture in the proposed system is specifically designed to maximize the advantages of RTP and RTCP for optimized media transmission. Each super node operates as an SFU, selectively forwarding RTP media streams based on real-time network conditions and participant requirements. This approach minimizes bandwidth consumption while maintaining low latency. Furthermore, RTCP plays a crucial role in real-time performance monitoring and adaptive adjustment of transmission parameters, ensuring sustained media quality and network efficiency.

The system implements the following key mechanisms to enhance media transmission:

Dynamic Stream Rerouting: When high packet loss, jitter, or latency is detected, a super node dynamically reroutes the affected stream through an alternative super node. This proactive mechanism enhances transmission stability, mitigates network disruptions, and ensures continuous media playback.
Adaptive Bitrate Adjustment: In response to network congestion or fluctuating network conditions, super nodes dynamically regulate media stream bitrates. This adaptive control maintains smooth playback while preventing excessive network load, optimizing both performance and resource utilization.
Aggregated RTCP Feedback for Network Optimization: Super nodes collect RTCP feedback from multiple participants, aggregating data to generate a holistic view of network conditions, including packet loss, delay, and jitter. By analyzing this aggregated feedback, the system enables intelligent, centralized decision-making to refine media distribution strategies, optimize transmission routes, dynamically adjust codec settings, and effectively balance network load.

By tightly integrating RTP and RTCP within the super node architecture, the proposed system achieves high scalability, low latency, and robust adaptability. This integration ensures consistent media quality and resilience, even in dynamic real-time communication environments characterized by unpredictable network fluctuations.

4.3. Interconnecting Super Nodes via Mediasoup PipeTransport

Mediasoup’s PipeTransport feature provides a built-in mechanism for SFU-to-SFU interconnection, enabling direct links between SFUs to facilitate efficient media stream forwarding in large-scale WebRTC deployments. PipeTransport enables SFUs to transmit Producer data (i.e., media streams) and establish Consumer links (i.e., stream subscriptions) across interconnected super nodes, ensuring scalable and low-latency media distribution.

The proposed system leverages this capability to interconnect super nodes, each functioning as a Mediasoup SFU. Every super node creates and configures a PipeTransport instance to establish connections with other super nodes, forming a distributed media forwarding network that supports dynamic stream routing and load balancing across SFUs.

To maintain optimal performance, the system dynamically reassigns user media streams between SFUs based on real-time load conditions and latency requirements. When an SFU is overloaded or when a lower-latency alternative is available, the current SFU transfers the user’s Producer stream and associated transport information to the target SFU. The new SFU then retrieves the corresponding media stream via PipeTransport, enabling a seamless handover without noticeable disruption.

Additionally, when a new user joins, a dynamic load balancing algorithm evaluates each super node’s CPU load, bandwidth usage, and active connections, assigning the user to the most suitable super node. This approach effectively distributes system load, minimizes latency, and enhances scalability.

In the event of an SFU failure or unresponsiveness, the system dynamically reroutes traffic by redistributing affected media streams to available super nodes via PipeTransport. If an SFU becomes unresponsive, active connections are reassigned to alternative super nodes, ensuring service continuity. This failover mechanism improves system reliability and resilience, preventing service disruptions.

The details of the reconstruction procedure are explained in Section 5.

4.4. Control of the Configuration of the Super Node Network

While PipeTransport in Mediasoup enables media transmission between SFUs, Mediasoup itself does not provide mechanisms for managing session states, tracking user presence, or handling stream handovers. These aspects must be implemented at the application layer.

To address this limitation, the proposed system leverages the WebRTC DataChannel as a control channel for real-time coordination among SFUs. This channel is used to exchange session-related metadata, including ICE candidate updates, user join/leave events, and stream status notifications.

Through the DataChannel, SFUs continuously share operational information such as stream availability, packet loss reports, user reassignment requirements, CPU load metrics, bandwidth usage, and network congestion levels. This collaborative data exchange enables dynamic and efficient resource allocation across the network.

In particular, when an SFU failure is detected, the remaining SFUs coordinate through the DataChannel to redistribute active media streams and reassign affected users to alternative SFUs with sufficient capacity. This fault recovery mechanism ensures service continuity and significantly improves system resilience, especially in large-scale, real-time communication environments.

5. Efficient Support for Changes in Chat Groups and Speaker Rotation

5.1. Role Distribution of Super Nodes

The proposed system is built upon the multi-hop SFU model introduced in the previous section. This system is specifically designed to support large-scale, real-time WebRTC-based communication by enabling efficient media stream distribution, scalability, and fault tolerance through hierarchical super node organization (see Figure 2 for illustration). The system is composed of two types of nodes: participating nodes, which represent end-user clients, and super nodes, which serve as media relay points. Among the super nodes, a subset is maintained in an inactive, standby state as backup super nodes, ready to take over in case of failure or load spikes.

The main components of the architecture and their respective roles are described below:

Participating Nodes: These are individual end-user devices connected to the WebRTC network. Each participating node establishes a direct media connection with its designated super node, which acts as its gateway to the larger system.
Super Nodes: Functioning as the primary SFUs in the multi-hop hierarchy, super nodes are responsible for routing, forwarding, and aggregating media streams from multiple participating nodes. They also handle key system operations such as congestion control and load balancing. In the proposed system, super nodes can relay media not only to participating nodes but also to other super nodes positioned at different levels of the hierarchy, forming a multi-hop transmission path.
Backup Super Nodes: These are pre-initialized super nodes that remain in standby mode under normal conditions. They continuously synchronize with active super nodes to maintain consistent session states. When an active super node becomes overloaded or fails, a backup super node can be dynamically activated to assume its responsibilities, thereby supporting seamless failover and enhancing system robustness.

Figure 2 illustrates the system configuration after a user has completed the joining procedure. Once connected to an assigned super node, the user’s media streams are transmitted to that super node, which then forwards the aggregated streams to downstream super nodes or directly to other participants as needed. The arrows in the figure represent these transmission paths. Media flows follow a hierarchical, multi-hop structure, where higher-level super nodes relay data to lower-level nodes, reducing redundant connections and optimizing bandwidth.

In case of failure or overload, backup super nodes are activated to maintain service continuity. Thanks to synchronized session states, failover occurs seamlessly. To handle scenarios where new failures are detected during the activation of backup nodes, the system proactively activates additional backup nodes in parallel. This parallel activation strategy enhances fault tolerance by maintaining recovery progress even in the face of cascading failures. The degree of redundancy—namely, the number of backup nodes activated in advance—is determined by a trade-off between resource overhead and the required level of resilience. In practice, the system adaptively adjusts this redundancy level based on the observed failure frequency in the deployment environment. In addition, it is important to note that when the total number of participating nodes—including backup nodes—is fixed, the system’s fault tolerance inevitably degrades as more super nodes become unavailable over time. Once all backup nodes are exhausted, the system can no longer handle additional failures, potentially resulting in adverse effects such as the rejection of new participant requests or interruptions to ongoing services.

In the following subsections, we sequentially describe the procedures executed when a participating node joins a conversation group (Section 5.2), leaves a conversation group (Section 5.3), and switches speakers within a conversation group (Section 5.4).

5.2. Procedure for Joining a Conversation Group

When a new user attempts to join an ongoing conversation, the system follows a hierarchical node assignment policy designed to ensure optimal resource utilization, maintain load balance, and guarantee seamless media stream distribution. Although the system is described as the acting agent in the following explanation, it should be noted that the actual join request is sent to a specific super node identified via an appropriate bootstrap process. Therefore, it is equally valid to interpret the super node that receives the join request as the principal actor. While the specific implementation may vary, the bootstrap mechanism can be designed to recommend a single, definitive super node, thereby preventing potential resource contention among multiple super nodes. The admission procedure is executed through the following steps:

Evaluation of Capacity in Active Super Nodes:
The system first examines the currently active super nodes to determine whether any have sufficient capacity to accommodate the new participant.
Action:
- If a suitable super node is found, the new user establishes a connection and immediately begins receiving media streams through it.
Benefit: Leveraging already-operational resources minimizes system overhead and reduces connection latency.
Activation of Unused Super Nodes:
If all active super nodes are operating at full capacity, the system searches for unused but pre-initialized super nodes.
Action:
- An unused super node is activated and assigned to handle the new participant’s connection and media delivery.
Benefit: Efficient load balancing is achieved without impacting the performance of existing active nodes.
Promotion of Backup Super Nodes:
If no unused super nodes are available, the system promotes a backup super node from the standby pool. Backup super nodes continuously maintain synchronized session states, enabling rapid activation when needed.
Action:
- A backup super node is activated and designated to manage the new participant’s session.
Benefit: This ensures fault tolerance and continuous service availability, even during periods of high load or unexpected node failures.
Denial of Participation Request:
If no suitable super node is available at any level, the system denies the user’s request to join the conversation.
Action:
- A rejection message is sent to the user, informing them that participation is currently unavailable.
Benefit: Protects system stability and preserves the quality of service for existing participants by preventing resource exhaustion.

By following this structured and hierarchical admission policy, the system dynamically adapts to fluctuating participant loads while maintaining operational stability and responsiveness. Through proactive resource management and automated failover mechanisms, the system remains highly scalable, reliable, and capable of supporting large-scale real-time communication environments.

5.3. Procedure for Leaving a Conversation Group

When a user leaves an ongoing conversation, the system initiates a structured procedure to reclaim resources, maintain network integrity, and ensure uninterrupted communication among the remaining participants. This departure process is executed through the following coordinated steps:

Termination of Participant Connection:
The system first identifies the super node associated with the departing participant and terminates the participant’s active media stream connection.
Action:
- Cease the participant’s media sessions and release the resources allocated to the corresponding super node.
Benefit: Immediate resource recovery reduces bandwidth consumption and computational load, thereby enhancing overall system efficiency.
Assessment of Super Node Occupancy:
Following the termination, the system evaluates whether the departure has rendered the super node vacant. If all active super nodes remain at full capacity, the system also checks for unused but pre-initialized super nodes to maintain operational flexibility.
Action:
- If other participants remain connected, the super node updates its internal connection registry without further intervention.
- If no active participants remain, the system proceeds to optimize resource usage in the next step.
Benefit: This selective response minimizes unnecessary processing overhead and maintains efficient operation.
Super Node Role Adjustment and Resource Management:
If a super node is found to be vacant, the system determines the appropriate course of action based on the node’s operational role.
Action:
- For active super nodes: Downgrade the node to a backup role, preserving synchronized state information for rapid reactivation if needed.
- For backup super nodes: Fully release associated resources and transition the node into a shutdown state.
Benefit: Role-based management preserves critical standby nodes while optimizing resource allocation and improving energy efficiency.
Reconfiguration of Media Stream Routing:
To maintain communication integrity and adaptability, the system autonomously reconfigures media stream forwarding and signaling paths in response to the participant’s departure.
Action:
- Recalculate the media transmission topology among the remaining participants.
- Reconstruct media stream routing paths and broadcast updated connection states to all affected nodes to ensure network consistency and synchronization.
Benefit: This mechanism enables continuous and stable media delivery even under dynamic network conditions, preventing communication interruptions and quality degradation, thereby ensuring a seamless user experience.

By following this structured departure procedure, the system achieves robust resource management and autonomous network self-maintenance, significantly enhancing its scalability and reliability in large-scale, real-time multimedia communication scenarios.

5.4. Procedure for Switching Speakers Within the Same Conversation Group

In modern WebRTC-based real-time multi-party communication systems, an increase in the number of participants often introduces technical challenges, including communication delays, media stream instability, and server or network load concentration. To address these issues, this study proposes a dynamic routing and load balancing mechanism utilizing super nodes and backup nodes, enabling efficient and seamless speaker transitions.

When a speaker switch occurs within the same conversation group, the system executes the following structured procedure:

Speaker Identification and Node State Update:
The system first identifies the participant designated as the new active speaker and broadcasts signaling messages to all associated super nodes and backup nodes to coordinate the transition.
Action:
- The signaling control module broadcasts a speaker-change notification, specifying both the new and previous speakers.
- Upon receiving the notification, each super node updates its internal state to prioritize media streams originating from the new speaker node.
Benefit: Rapid synchronization of node states minimizes coordination delays, enabling swift and smooth speaker transitions across the system.
Real-time Dynamic Resource Allocation:
Following the speaker update, the system dynamically reallocates network and computational resources to meet the new speaker’s media transmission requirements.
Action:
- Bandwidth and computational resources are automatically reallocated among super nodes and backup nodes to prioritize the new speaker’s media stream.
- Based on current load conditions, backup nodes are selectively activated to maintain balanced resource utilization across the network.
Benefit: Precise resource management prevents node overload, improves system performance, and ensures continuous high-quality media delivery.
Transition Confirmation and Logging:
After completing the reconfiguration of media streams and the reallocation of resources, the system verifies the success of the transition.
Action:
- Each super node and backup node reports the successful completion of the speaker-switching process to the central signaling server.
- The system logs detailed records of the transition events and any corresponding changes in node load conditions.
Benefit: Enhanced system observability and maintainability facilitates ongoing optimization and troubleshooting, improving overall system robustness.

By employing this structured speaker-switching mechanism, the system can adapt swiftly to changes in participant activity and varying network conditions, thereby maintaining high communication quality and significantly enhancing overall system stability and user experience.

6. Experiments

This section describes the details of the experiments conducted to evaluate the performance of the proposed system.

6.1. Setup

To ensure a comprehensive evaluation of the proposed system, we designed a structured experimental environment consisting of virtual client simulations, testing tools, monitoring utilities, and well-defined test scenarios.

To simulate both super node and backup node behavior, multiple mediasoup instances were deployed using Docker containers. Each node was assigned a dedicated Docker image with isolated network configurations, allowing each mediasoup instance to operate in an independent container. Signaling channels and media ports were uniquely allocated per container, and environment variables were used to explicitly distinguish the node roles (either super or backup). The orchestration of multiple nodes was managed using docker-compose, and inter-node communication was enabled via Docker’s virtual networking capabilities. Each container image included built-in mechanisms for automatic registration and service discovery, which enabled real-time state synchronization between nodes.

By default, the super node is set to be online and is responsible for handling all incoming media requests. In contrast, the backup node remains in a passive listening state where it only synchronizes states but does not process any media traffic under normal conditions. In the event of node failure or load balancing requirements, container-level health checks or external monitoring mechanisms automatically activate or escalate the processing capabilities of the backup node, thereby enabling failover functionality.

6.1.1. Test Tool Preparation

To realistically simulate user interactions and media streaming behavior, we employed Apache JMeter combined with custom WebRTC test scripts as the primary performance evaluation tool. Although JMeter is an open-source load testing tool that natively supports HTTP/HTTPS requests, its functionality can be extended to WebRTC-related operations through plug-ins such as the WebSocket Sampler and customized scripts.

In this study, we tailored JMeter scripts to support the automated execution of the following core operations:

Simulated interactions for signaling processes typical of user behavior, including room creation, room joining, and room leaving, as well as SDP exchange.
Establishment and termination of media sessions.
Parallel simulation of multiple concurrent users to emulate realistic load conditions.

These scripts allowed for batch generation of user join/leave events under controlled conditions. Simultaneously, performance metrics were collected via performance listeners, enabling the assessment of system responsiveness and stability under varying user scales and network load levels. Here, details on operations related to conversation group maintenance are as follows.

Create Room: This operation represents the complete process initiated when a user creates a new session. It includes WebRTC signaling exchanges, server-side resource allocation, and session state initialization. The corresponding response time reflects the system’s efficiency in initializing new communication sessions.
Join Call: This refers to the sequence of events triggered when a user attempts to join an active session. It involves hierarchical node allocation strategies, super node selection based on real-time conditions, relay path construction, and synchronization of session state between nodes.
Leave Call: This operation captures the process executed when a user leaves an ongoing session, including the deallocation of routing resources, updates to the overlay network topology, and activation of backup nodes if necessary to maintain session continuity.

To comprehensively evaluate the dynamic behavior, performance, and elasticity of the proposed WebRTC-based conferencing system under varying user load conditions, we designed two types of test scenarios: (i) baseline load scenarios, and (ii) dynamic load mutation scenarios. These experiments were conducted using Apache JMeter 5.5, which enables fine-grained control over virtual user behavior, including thread lifecycle management, load intensity, and abrupt departure events. Details of the baseline scenario are presented in Section 6.2, while the dynamic load mutation scenario is described in Section 6.4.

6.1.2. Evaluation Metrics

The primary monitoring indicators were as follows:

Response Time: The latency between a user’s operation and the system’s corresponding response.
Throughput: The number of requests processed per unit time, used to measure the system’s processing capacity.
Packet Loss Rate: The proportion of audio and video packets lost during media streaming sessions.
System Resource Utilization: Node-level load indicators such as CPU usage, memory consumption, and network bandwidth utilization.

This experimental setup enabled us to perform a detailed evaluation of system behavior under both nominal and stress conditions, thereby providing valuable insights into its real-world feasibility and robustness.

6.2. Basic Test Scenarios

To evaluate the system’s performance under increasing concurrency levels, we designed a structured set of baseline test scenarios. These scenarios simulate varying user loads to assess the system’s stability, responsiveness, and scalability. Specifically, we conducted experiments under four configurations, corresponding to concurrent virtual user counts of 10, 20, 50, and 100 (The upper bound of 100 participants is motivated by findings from foundational studies in conversation analysis [25,26], which indicate that smooth turn-taking in face-to-face communication is most common in groups of four to six people. When the number of participants exceeds this range, conversations tend to fragment into multiple subgroups—a phenomenon known as schisming. This behavioral pattern is also reflected in the interface and interaction design of widely used video conferencing platforms such as Zoom and Microsoft Teams.).

For each configuration, the system must support three core operations: room creation, user joining, and user leaving. These operations are simulated independently using separate JMeter thread groups. This modular design enables fine-grained measurement and analysis by isolating the performance impact of each operation, thereby facilitating accurate identification of potential bottlenecks or inefficiencies in the system.

Key configuration parameters for the baseline tests are as follows:

Thread Count (Concurrent Virtual Users): 10, 20, 50, and 100. Each configuration represents a distinct concurrency level used to evaluate the system’s resource usage, responsiveness, and bandwidth efficiency.
Test Structure: Each test run simulates a single session unit representing a complete interaction lifecycle: Create Room →Users Join One by One→Stable Interaction→ Users Leave One by One.
Session Duration: Each session unit spans a total of 20 s, divided into four distinct behavioral phases to reflect realistic interaction dynamics. Although the total number of users per test remains fixed, the number of active participants changes over time to simulate various stages of the session lifecycle.
Ramp-Up Period: 1 s. All threads (i.e., virtual users) are launched linearly during this period. For example, in the 10-user case, threads start at 0.1-s intervals; for 20 users, at 0.05-s intervals, and so on.
Loop Count: Infinite. Each thread repeatedly performs its designated operations throughout the 20-s session duration to maintain continuous system load.
Thread Lifetime Control: Enabled via the “Specify Thread Duration” option. This ensures that each thread remains active for a defined time span, rather than a fixed number of iterations, thereby simulating more realistic user behavior.

The temporal behavior of threads is structured as follows:

0–1 s: Threads start linearly and execute the room creation operation once. Users then begin joining the room gradually.
2–19 s: Thread count remains constant, simulating a stable period of interaction among users.
19–20 s: Threads terminate linearly, representing users leaving the call one by one.

Results

Table 1 summarizes the results of the baseline testing scenario. For each user load configuration (10, 20, 50, and 100 concurrent users), we conducted 10 complete session units. For instance, under the 10-user setting, each session unit generated 1 sample for “Create Room”, and 10 samples each for “Join Call” and “Leave Call”, yielding the sample counts reported in the table.

Across all concurrency levels, the error rate remained consistently at 0% for all three operations—room creation, joining, and leaving calls. This demonstrates that the system can reliably process requests as long as sufficient resources are available.

In terms of bandwidth usage, both inbound and outbound traffic stayed within stable and predictable ranges:

Received bandwidth: approximately 186–190 KB/s.
Sent bandwidth: approximately 56–59 KB/s.

These results indicate efficient resource utilization, with no evidence of congestion or transmission bottlenecks, even under the highest load conditions.

Response time metrics are presented as averages computed from all collected samples. While average response time increases with higher user concurrency, the increase follows a roughly linear pattern rather than an exponential one, demonstrating graceful degradation—the system continues to function reliably under pressure, with no crashes, freezes, or timeouts.

Specifically, the average response times for the operations are as follows:

Create Room: Remains nearly constant across all user loads. This is likely because room creation is only triggered during the ramp-up phase and involves minimal contention.
Join Call: Increases from 87 ms (10 users) to 739 ms (100 users).
Leave Call: Increases from 85 ms to 647 ms over the same range.

These trends can be attributed to the differing complexity of each operation:

Join Call involves multi-step coordination, including layered node selection and session state replication, resulting in the highest latency and variability.
Leave Call entails resource cleanup and topology updates, leading to moderate latency.
Create Room is a lightweight initialization process with the lowest latency and least variability.

6.3. CPU Usage Under Varying User Loads

Building upon the baseline tests for functionality and scalability, we conducted a system-level performance evaluation with a specific focus on the CPU utilization of super nodes under varying user load conditions. This evaluation aimed to assess the efficiency and elasticity of the super node architecture in handling concurrent operations.

6.3.1. Setup

Prior to the start of each experiment, all super node components—including both primary and backup nodes—were fully initialized and remained active throughout the testing process. This setup ensured that the system was in a ready state to dynamically manage load distribution and failover if necessary.

The user load conditions in this test reused the same thread configurations as the baseline scenarios, representing 10, 20, 50, and 100 concurrent virtual users. Each test was conducted over a 20-s duration during which the system continuously processed user-driven operations such as dynamic room creation, session joining, and real-time message forwarding. These operations were designed to simulate typical conferencing workloads under varying levels of concurrency.

To monitor system resource consumption, CPU usage data was sampled at 1-s intervals throughout the duration of each test. The primary metric analyzed was the average CPU utilization of super nodes, which serves as an indicator of workload distribution and the effectiveness of the system’s load-balancing mechanisms. Special attention was given to the behavior of backup nodes: whether they became actively engaged when the primary node approached saturation, thereby demonstrating the system’s ability to support elastic scaling and failover.

CPU data collection was implemented using a lightweight custom monitoring script developed in Go. This script leveraged the runtime package along with Linux’s /proc/stat interface to parse per-core CPU statistics and compute dynamically updated average CPU utilization. The monitoring approach introduced minimal overhead while maintaining high responsiveness, making it well suited for capturing transient spikes and sustained load patterns during intensive session control and message processing.

Through this experiment, we were able to identify potential performance bottlenecks in the super node design and assess the system’s capacity for fault tolerance and dynamic resource reallocation. The results offer valuable insights for optimizing future task scheduling policies, adaptive node activation strategies, and the design of scalable, cooperative multi-node architectures.

6.3.2. Results

Table 2 presents the load distribution across three super nodes A, B, and C as the number of concurrent users increases from 10 to 100. As expected, the load rates rise progressively with user concurrency, indicating a direct correlation between user demand and system resource utilization.

Under low user loads (10 concurrent users), each super node operates well below 20% load, demonstrating substantial spare capacity. This confirms that the system is highly responsive and elastic in handling small-scale conferencing scenarios, with ample headroom for load fluctuations. As the user load scales to 50 and 100 concurrent users, the load rates of the super nodes increase significantly, ranging from 70% to 90%. Importantly, the load remains within the operational threshold (i.e., load rate < 1), indicating that the system can stably accommodate medium- to large-scale conferencing sessions without approaching overload conditions.

The sharp increase in load observed as the number of users grows can be attributed to the system’s fundamental communication model. In our architecture, each user in a conversation group receives video streams from every other participant. This results in a total number of video streams that scales proportionally to the square of the number of users. Given that the number of super nodes is fixed in our experimental setup, the processing load per super node also increases approximately quadratically with respect to the number of participants.

To further enhance reliability, the system employs a dynamic load-balancing mechanism through backup node D. At the 100-user level, backup node D is automatically activated when the primary super nodes near their critical load levels. It offloads approximately 20% of the overall traffic, effectively distributing the processing burden and preventing system saturation.

This dynamic backup activation illustrates the effectiveness of the system’s high-availability strategy. By seamlessly integrating the backup node into the operational flow, the system ensures continuous service quality even under stress. Moreover, the multi-super node architecture significantly boosts scalability by distributing workloads across multiple nodes and avoiding single points of failure.

In summary, this result demonstrates strong architectural resilience, robust load-balancing behavior, and adaptive elasticity—key properties for maintaining service continuity and quality under diverse and growing user demands.

6.4. Dynamic Load Mutation Scenario

To further assess the adaptability and resilience of the system under extreme fluctuations in user load, we designed a dynamic load mutation scenario in which the number of concurrent users abruptly drops from 100 to 50. This setup emulates real-world events such as large-scale user disconnections, network outages, or abrupt session terminations.

6.4.1. Setup

In this scenario, the system is subjected to continuous user activity involving room creation, user joining, and leaving operations, while the load dynamically shifts mid-execution. Importantly, the join operations after the load drop do not represent users rejoining the same original room. Instead, they simulate new users joining entirely separate sessions. To achieve this, the threads responsible for these join operations are independently defined and isolated from the original set of 100 users, ensuring that the effective participant count in the original session remains unaffected.

This design maintains the validity of our core assumption—that the number of active concurrent users sharply decreases from 100 to 50 during the test. The primary objective is to evaluate the system’s capacity to (i) promptly release unused resources, (ii) downgrade overloaded super nodes, (iii) activate and manage backup nodes appropriately, and (iv) reconfigure media routing paths on the fly, all while preserving communication quality and service continuity.

Test Configuration Parameters:

Number of Threads (Concurrent Virtual Users): 100 dropping to 50.
All 100 threads are initiated simultaneously at time 0. At exactly 10 s, 50 threads are terminated instantly to simulate an abrupt decrease in system load.
Ramp-Up Period: 0 s.
All threads are launched simultaneously without any ramp-up delay.
Loop Count: Infinite.
Each thread continues to execute the defined operations until the total test duration elapses. The full duration of the test is 20 s.
Thread Lifetime Control: Enabled (“Specify thread duration”).
Each thread is configured to run for a set duration rather than a fixed number of iterations. This method offers higher fidelity in simulating real-time load transitions and usage patterns.

Thread Count Variation Over Time:

0–10 s: 100 threads are maintained.
At 10 s: Thread count drops abruptly from 100 to 50.
11–20 s: 50 threads are maintained.

Unlike scenarios that employ a gradual or linear load reduction, this test deliberately simulates a sudden, large-scale drop in user activity to evaluate the system’s responsiveness to abrupt load changes. By applying such a stress pattern, we aim to observe how efficiently the system adapts in terms of resource reallocation, fault tolerance, and load redistribution across super nodes.

By conducting experiments under various dynamic and static load scenarios, we are able to quantitatively evaluate the system’s responsiveness, performance variability, and stability across diverse conditions. These evaluations contribute to a holistic understanding of the scalability, resilience, and operational efficiency of the WebRTC-based conferencing system.

6.4.2. Results

Table 2 shows the system behavior under different user loads. When simulating 50 concurrent users, no backup node was activated. In contrast, the 100-user scenario triggered the activation of one backup node. This observation suggests that the threshold for backup node activation lies somewhere between 50 and 100 users. Identifying this activation boundary is crucial not only for validating the system’s adaptability under sudden load surges, but also for evaluating the performance impact of the backup mechanism itself.

To further investigate system behavior under dynamic load changes, we designed an experiment simulating a sharp drop in user load—from 100 to 50 users—within a short time frame. We employed Docker containers to precisely control the activation and deactivation of nodes during this test, thereby allowing us to observe how the system adjusts its resources in real time.

In the first scenario, where backup nodes were disabled, the system processed a total of 3506 samples. These included 1123 Create Room, 1189 Join Call, and 1194 Leave Call requests. To visualize the distribution of these data points, a scatter plot was generated, as shown in Figure 3. Additionally, we computed the average response time per second for each function to assess temporal performance trends, as depicted in Figure 4.

In the second scenario, where all backup nodes were enabled, the system handled a total of 3541 samples—comprising 1194 Create Room, 1163 Join Call, and 1184 Leave Call operations. Figure 5 presents the corresponding scatter plot for this case. As in the previous scenario, average per-second response times were calculated and are visualized in Figure 6 for direct comparison.

6.5. Observations

Through analysis of the experimental data, we derived several important insights regarding system behavior under fluctuating load conditions and the role of backup nodes in maintaining performance stability.

6.5.1. Overall Response Time Trends

High-Concurrency Phase (0–10 s): During the initial phase with 100 concurrent users, the overall response times across all three operations—Create Room, Join Call, and Leave Call—were noticeably elevated. Among them, Join Call consistently recorded median response times exceeding 800 ms, with certain intervals spiking above 1000 ms. Leave Call also exhibited significant fluctuation and a dense concentration of outliers, indicating instability under intense system load. While Create Room showed relatively more stable behavior, it, too, reflected the impact of concurrent pressure. The wide interquartile ranges (IQRs) and frequent outliers observed in all three functions clearly point to an overloaded state during this high-concurrency phase.

Post-Drop Phase (10–20 s): Following a sharp reduction in user count from 100 to 50 at the 10-s mark, system response times declined dramatically starting from the 11th second. Both Create Room and Leave Call quickly stabilized, with median response times falling into the 200–400 ms range. The Join Call, though still slower in comparison, also showed improved consistency. Notably, the response time distributions became more compact, variability decreased, and the frequency of outliers dropped significantly. These results demonstrate the system’s rapid recovery capability and confirm its inherent elasticity in adapting to abrupt load changes.

6.5.2. Impact of Backup Nodes on Response Time

A comparative examination of the results from systems with and without backup nodes (Figure 3 and Figure 5, respectively) reveals several critical differences:

Without Backup Nodes: The system under high-concurrency conditions without backup support exhibited a high volume of outliers, particularly in Join Call and Leave Call. Response time distributions were broad and erratic, reflecting poor stability and the emergence of processing bottlenecks. These suggest that the primary super nodes alone were insufficient to handle peak loads effectively, resulting in degraded service quality.

With Backup Nodes: In contrast, enabling backup nodes yielded significant improvements. The number of outliers decreased substantially across all operations, and the response time distributions narrowed. Even during the 100-user high-load phase, Create Room and Leave Call maintained moderate and consistent response times. Join Call, although still the slowest, demonstrated reduced variability and better latency control. These findings indicate that the backup nodes effectively offloaded traffic from overloaded primary super nodes, thereby enhancing system resilience and maintaining performance.

Furthermore, a comparison of Figure 4 and Figure 6 reinforces this conclusion. The temporal patterns of response times in the backup-enabled configuration showed much smoother transitions, especially during the initial 10-s high-load period. This further validates the backup mechanism’s key role in ensuring robust system performance under large-scale concurrent request scenarios.

6.6. Comparison with Existing Methods

The proposed approach relays media streams through multiple supernodes to achieve both resilience and scalability. By avoiding reliance on a single SFU server, it eliminates the single point of failure and mitigates resource limitations such as upload bandwidth by distributing the load across multiple servers. To the best of our knowledge, no existing WebRTC-based video streaming system adopts such a multi-hop architecture like ours. While this makes direct comparison with prior work difficult, we demonstrate the advantages of our method through comparisons with a single SFU server setup.

6.6.1. Performance Evaluation with a Single SFU Server

The thread count variation over time follows the setup described in Section 6.4.1. Below, we present the performance evaluation results for the single SFU server configuration, organized chronologically:

0–10 s: During this period, the system maintained 100 concurrent threads. The CPU on the single-node setup was fully saturated, operating at 100% utilization. Under this high-load condition, the response times for each operation were as follows:

Create Room: Max 1903 ms, Min 273 ms, Avg 699.33 ms, Std Dev 214.48 ms.
Join Call: Max 2871 ms, Min 364 ms, Avg 1425 ms, Std Dev 400.75 ms.
Leave Call: Max 2655 ms, Min 281 ms, Avg 1247 ms, Std Dev 374.20 ms.

10 s: The thread count was abruptly reduced from 100 to 50.

11–20 s: With 50 threads, CPU utilization dropped slightly and stabilized around 90%, indicating a moderate load. Under these conditions, response times improved significantly:

Create Room: Max 1588 ms, Min 132 ms, Avg 393.58 ms, Std Dev 146.42 ms.
Join Call: Max 2171 ms, Min 259 ms, Avg 847 ms, Std Dev 295.44 ms.
Leave Call: Max 1850 ms, Min 253 ms, Avg 682 ms, Std Dev 269.75 ms.

6.6.2. Comparison with Multi-Node Setup (No Backup Nodes)

As described in Section 6.4.2, the performance of a multi-node configuration without backup nodes was evaluated under the same load conditions. During the 0–10 s interval, with 100 threads and high CPU utilization, the response times for each operation were as follows:

Create Room: Max 1783 ms, Min 72 ms, Avg 547 ms, Std Dev 166.30 ms.
Join Call: Max 1943 ms, Min 271 ms, Avg 987 ms, Std Dev 327.53 ms.
Leave Call: Max 1802 ms, Min 255 ms, Avg 804 ms, Std Dev 296.40 ms.

Compared to the single-node setup under the same high-load condition, the multi-node configuration yielded lower average response times and smaller standard deviations, indicating better overall stability and less performance variability.

During the 11–20 s interval, when the number of threads was reduced to 50, the observed metrics were as follows:

Create Room: Max 709 ms, Min 114 ms, Avg 379 ms, Std Dev 133.20 ms.
Join Call: Max 1439 ms, Min 183 ms, Avg 726 ms, Std Dev 287.51 ms.
Leave Call: Max 1452 ms, Min 89 ms, Avg 517 ms, Std Dev 271.44 ms.

Under this moderate load, the single-node system exhibited performance comparable to the multi-node configuration. While the multi-node setup still showed slightly better average response times, the standard deviations were nearly identical, suggesting that both configurations maintain similar levels of performance stability when not fully saturated.

In summary, the multi-node architecture provides significantly improved stability and lower response time variability under high-load conditions, while offering comparable performance to the single-node setup under moderate load.

6.7. Evaluation of the Efficiency of Turn Taking

This subsection evaluates the performance of the proposed system in managing speaker turn-taking during real-time group conversations. The focus is on assessing signaling latency and server-side responsiveness under varying levels of user concurrency.

6.7.1. Setup

To this end, we designed an experiment to measure the efficiency and scalability of speaker-switch handling within fixed WebRTC session groups. Apache JMeter was employed to simulate virtual users, grouped into a single session room to eliminate inter-room interference and ensure uniform signaling behavior. Four concurrency levels were tested: 10, 20, 50, and 100 users. In each scenario, all users simultaneously issued speaker-switch requests within a narrow time window, replicating high-concurrency interaction patterns observed in real-time conversations.

To support the real-time, bidirectional nature of WebRTC signaling, the WebSocket Sampler plugin is integrated into JMeter, alongside custom scripts that emulate speaker-switch commands. Each virtual user, after joining the session, sends a speaker-switch request via a WebSocket channel to the WebRTC signaling server. Upon receipt, the server executes the appropriate switching logic and returns a confirmation message through the same channel.

Client-side signaling performance is quantified using the Round-Trip Time (RTT), defined as the time elapsed between the issuance of a speaker-switch command and the reception of the corresponding confirmation message. This serves as a direct measure of end-to-end signaling latency from the user’s perspective when the processing time of the signaling server is negligible. In parallel, server-side processing latency is evaluated by recording two internal timestamps:

$T_{1}$ —The time at which the server receives the speaker-switch request.
$T_{2}$ —The time at which the speaker-switch operation completes and the updated speaker state is broadcast to all session participants.

The server-side switching latency is then calculated as follows:

T = T_{2} - T_{1}

This metric captures the internal processing delay associated with applying speaker-switch logic. It is worth noting that when a user requests to leave a conversation group, the perceived latency experienced by the user generally aligns with the previously defined RTT. However, in cases where the signaling server defers its response until after the corresponding overlay processing has been fully completed, the client’s perceived response time effectively becomes the sum of the server-side switching latency T and the RTT. This distinction is important for accurately interpreting end-to-end responsiveness under different types of user operations. To ensure the reliability and statistical significance of the results, each scenario is repeated 10 times. Furthermore, JMeter’s synchronized thread startup mode is enabled to guarantee that all virtual users initiate their requests simultaneously, thereby imposing consistent and controlled concurrency stress on the server.

6.7.2. Results

This experiment utilizes Apache JMeter to simulate speaker-switching operations with high fidelity under varying levels of user concurrency, enabling a comprehensive evaluation of system responsiveness across different load conditions. For each concurrency level, we conducted multiple rounds of sampling and applied statistical analysis to ensure the reliability of the results. Figure 7 shows the duration of ten speaker turn-taking operations measured under each concurrency setting, with each dashed line representing the mean of 10 independent trials. Two key performance metrics are used in this evaluation: RTT and T. RTT denotes the client-side round-trip time, capturing the end-to-end signaling delay experienced by the client. In contrast, T represents the server-side switching latency, reflecting the time required by the signaling server to internally process and complete the speaker transition.

The results indicate that the system performs reliably under medium-scale concurrency (≤50 users): both the RTT and server-side switching latency T exhibit low variance and a linear growth trend as the load increases. This suggests that the system maintains good real-time responsiveness and scalability in typical conferencing scenarios. Even under high-concurrency conditions (100 users), while the server-side latency increases significantly, the overall delay remains within the generally acceptable threshold for real-time video conferencing (<500 ms), validating the practical feasibility of the proposed system.

7. Concluding Remarks

This paper proposes a multi-hop P2P video streaming method that dynamically adapts to the evolving structure of online conversations. The proposed system is built on WebRTC with Mediasoup SFU and adopts a decentralized, super node-based architecture capable of real-time reconfiguration of the network topology in response to conversational changes. By aggregating video streams received from multiple peer nodes into several super nodes, the system eliminates reliance on a central media server, thereby providing a flexible and resilient communication infrastructure. To evaluate the system’s performance under varying loads, we conducted experiments using Docker containers and Apache JMeter. The results demonstrate that, due to the system’s inherent load-balancing capabilities, performance degradation with increasing group size remains graceful, while the total load increases in proportion to the square of the group size. Specifically, the average time required for a node to join a conversation group was approximately 104 ms for a group of 10 participants and 1101 ms for a group of 100. Moreover, throughput degradation remained below 10% even as the group size increased. In terms of turn-taking performance, the average response time for speaker transitions was 150 ms for a group of 10 and 550 ms for a group of 100, indicating that the system maintains real-time responsiveness across a wide range of concurrency levels.

One promising direction we are currently exploring is the prediction of speech initiation based on nonverbal cues such as gaze shifts and preparatory movements. Prior studies have demonstrated that speaker transitions can be predicted with approximately 70% accuracy between 0.3 to 1.0 s before speech onset [27,28,29]. Integrating such predictive techniques into our system could potentially compensate for the observed 550 ms delay under high user concurrency. Complementary to this, another important avenue for future work lies in addressing the architectural challenges introduced by delegating processing to backup nodes. While this strategy offers clear advantages, it also raises concerns regarding coordination and resource allocation, especially as the system scales. Tackling these issues—possibly through the use of software-defined networking (SDN)—will require further investigation beyond the current scope.

Author Contributions

Conceptualization, S.F.; methodology, S.F. and J.C.; software, J.C.; validation, S.F. and J.C.; data curation, S.F. and J.C.; writing—original draft preparation, S.F. and J.C.; writing—review and editing, S.F. and J.C.; visualization, J.C.; supervision, S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code for the prototype system described in this paper, the code for the experiments conducted on it, and the raw data of the experimental results can be obtained from the following URL: https://github.com/cjj19990216/code accessed on 22 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Aristotle, U. The politics. In Democracy: A Reader; Columbia University Press: New York, NY, USA, 2016; pp. 200–202. [Google Scholar]
Hook, D.; Franks, B.; Bauer, M. (Eds.) The Social Psychology of Communication; Palgrave Macmillan: London, UK, 2016. [Google Scholar]
Sidnell, J.; Stivers, T. (Eds.) The Handbook of Conversation Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Vangelisti, A.L. Communication in personal relationships. In APA Handbook of Personality and Social Psychology, Vol. 3. Interpersonal Relations; Mikulincer, M., Shaver, P.R., Simpson, J.A., Dovidio, J.F., Eds.; American Psychological Association: Washington, DC, USA, 2015; pp. 371–392. [Google Scholar]
Lin, Y.; Zheng, Y.; Zeng, M.; Shi, W. Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals. arXiv 2025, arXiv:2505.12654. [Google Scholar]
Skantze, G. Turn-taking in conversational systems and human-robot interaction: A review. Comput. Speech Lang. 2021, 67, 101178. [Google Scholar] [CrossRef]
Ter Bekke, M.; Drijvers, L.; Holler, J. Hand gestures have predictive potential during conversation: An investigation of the timing of gestures in relation to speech. Cogn. Sci. 2024, 48, e13407. [Google Scholar] [CrossRef] [PubMed]
Arora, N.; Suomalainen, M.; Pouke, M.; Center, E.G.; Mimnaugh, K.J.; Chambers, A.P.; Pouke, S.; LaValle, S.M. Augmenting immersive telepresence experience with a virtual body. IEEE Trans. Vis. Comput. Graph. 2022, 28, 2135–2145. [Google Scholar] [CrossRef] [PubMed]
Irlitti, A.; Latifoglu, M.; Hoang, T.; Syiem, B.V.; Vetere, F. Volumetric hybrid workspaces: Interactions with objects in remote and co-located telepresence. In Proceedings of the CHI ’24: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–16. [Google Scholar]
Hu, E.; Grønbæk, J.E.S.; Houck, A.; Heo, S. Openmic: Utilizing proxemic metaphors for conversational floor transitions in multiparty video meetings. In Proceedings of the CHI ’23: CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; pp. 1–17. [Google Scholar]
Tosato, L.; Fortier, V.; Bloch, I.; Pelachaud, C. Exploiting temporal information to detect conversational groups in videos and predict the next speaker. Pattern Recognit. Lett. 2024, 177, 164–168. [Google Scholar] [CrossRef]
Wang, P.; Han, E.; Queiroz, A.; DeVeaux, C.; Bailenson, J.N. Predicting and Understanding Turn-Taking Behavior in Open-Ended Group Activities in Virtual Reality. arXiv 2024, arXiv:2407.02896. [Google Scholar]
Kushwaha, R.; Bhattacharyya, R.; Singh, Y.N. ReputeStream: Mitigating Free-Riding through Reputation-Based Multi-Layer P2P Live Streaming. arXiv 2024, arXiv:2411.18971. [Google Scholar]
Ma, Y.; Fujita, S. Decentralized incentive scheme for peer-to-peer video streaming using solana blockchain. IEICE Trans. Inf. Syst. 2023, E106.D, 1686–1693. [Google Scholar] [CrossRef]
Wei, D.; Zhang, J.; Li, H.; Xue, Z.; Peng, Y.; Pang, X.; Han, R.; Ma, Y.; Li, J. Swarm: Cost-Efficient Video Content Distribution with a Peer-to-Peer System. arXiv 2024, arXiv:2401.15839. [Google Scholar]
Edan, N.M.; Al-Sherbaz, A.; Turner, S. WebNSM: A novel WebRTC signalling mechanism for one-to-many bi-directional video conferencing. In Proceedings of the 2018 SAI Computing Conference, London, UK, 10–12 July 2018; pp. 1–6. [Google Scholar]
Elleuch, W. Models for multimedia conference between browsers based on WebRTC. In Proceedings of the 2013 IEEE 9th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), Lyon, France, 7–9 October 2013; pp. 279–284. [Google Scholar]
Suciu, G.; Stefanescu, S.; Beceanu, C.; Ceaparu, M. WebRTC role in real-time communication and video conferencing. In Proceedings of the 2020 Global Internet of Things Summit (GIoTS), Dublin, Ireland, 3–5 June 2020; pp. 1–6. [Google Scholar]
Jawad, A.M. Using JavaScript on 5G networks to improve real-time communication through WebRTC. Al-Rafidain J. Eng. Sci. 2023, 1, 9–23. [Google Scholar] [CrossRef]
Hu, Y.-H.; Ito, K.; Igarashi, A. Improving real-time communication for educational metaverse by alternative WebRTC SFU and delegating transmission of avatar transform. In Proceedings of the 2023 International Conference on Consumer Electronics—Taiwan (ICCE-Taiwan), PingTung, Taiwan, 17–19 July 2023; pp. 201–202. [Google Scholar]
Singh, V.; Lozano, A.A.; Ott, J. Performance analysis of receive-side real-time congestion control for WebRTC. In Proceedings of the 2013 20th International Packet Video Workshop (PV), San Jose, CA, USA, 12–13 December 2013; pp. 1–8. [Google Scholar]
Jansen, B.; Goodwin, T.; Gupta, V.; Kuipers, F.; Zussman, G. Performance evaluation of WebRTC-based video conferencing. ACM Sigmetrics Perform. Eval. Rev. 2018, 45, 56–68. [Google Scholar] [CrossRef]
Lee, I.; Kim, S.; Sathyanarayana, S.; Bin, K.; Chong, S.; Lee, K.; Grunwald, D.; Ha, S. R-fec: Rl-based fec adjustment for better qoe in webrtc. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 2948–2956. [Google Scholar]
Sathyanarayana, S.D.; Lee, K.; Grunwald, D.; Ha, S. Converge: Qoe-driven multipath video conferencing over WebRTC. In Proceedings of the ACM SIGCOMM ’23: ACM SIGCOMM 2023 Conference, New York, NY, USA, 10–14 September 2023; pp. 637–653. [Google Scholar]
Schneider, L.; Goffman, E. Behavior in public places. Am. Sociol. Rev. 2008, 29, 427. [Google Scholar] [CrossRef]
Sacks, H.; Schegloff, E.A.; Jefferson, G. A simplest systematics for the organization of turn-taking for conversation. Language 1974, 50, 696–735. [Google Scholar] [CrossRef]
Ishii, R.; Otsuka, K.; Kumano, S.; Yamato, J. Prediction of who will be the next speaker and when using gaze behavior in multiparty meetings. ACM Trans. Interact. Intell. Syst. 2016, 6, 4. [Google Scholar] [CrossRef]
Kawahara, T.; Iwatate, T.; Takanashi, K. Prediction of Turn-Taking by Combining Prosodic and Eye-Gaze Information in Poster Conversations. In Proceedings of the Interspeech 2012, Portland, OR, USA, 9–13 September 2012; pp. 727–730. [Google Scholar]
Russell, S.O.C.; Harte, N. Visual cues enhance predictive turn-taking for two-party human interaction. arXiv 2025, arXiv:2505.21043. [Google Scholar]

Figure 1. Video streaming with three hops. The media stream originating from the speaker is first sent to the nearest super-node, then relayed through additional super nodes before reaching the listeners. In this figure, each path from the speaker to a listener includes two intermediate super nodes, resulting in a total of three hops.

Figure 2. Overview of the proposed system immediately after a client node joins a conversation. Media streams are transmitted using Secure Real-time Transport Protocol (SRTP), an extension of RTP that provides encryption, message authentication, and integrity protection. SRTP is widely used in WebRTC-based applications to ensure secure audio and video communication.

Figure 3. Response time for each operation in a configuration without backup nodes.

Figure 4. Box plot of response time dispersion for each operation without using backup nodes.

Figure 5. Response time for each operation in a configuration with backup nodes.

Figure 6. Box plot of response time dispersion for each operation using backup nodes.

Figure 7. Box plots of turn-taking completion times across different user group sizes (10, 20, 50, 100). For each group size, ten trials were conducted, each requiring ten turn-taking interactions. Each box represents the distribution of completion times over the resulting 100 trials.

Table 1. Performance under different user loads.

# of Users	Label	Samples	Avg [ms]	Min [ms]	Max [ms]	Recv [KB/s]	Sent [KB/s]
10	Create room	10	46.0	42	51	190.24	59.44
10	Join call	100	87.4	74	113	190.29	58.71
10	Leave call	100	84.9	66	107	189.80	58.44
20	Create room	10	45.4	41	52	187.59	58.70
20	Join call	200	148.5	72	183	188.45	59.52
20	Leave call	200	169.5	69	157	188.57	59.91
50	Create room	10	48.4	42	52	187.42	58.28
50	Join call	500	337.5	81	474	188.47	57.96
50	Leave call	500	301.8	74	429	186.53	59.44
100	Create room	10	46.4	40	51	187.33	56.37
100	Join call	1000	739.4	76	1124	188.49	58.86
100	Leave call	1000	648.2	73	947	186.74	56.12

Table 2. Super node load levels corresponding to varying numbers of concurrent users. Each value indicates the CPU load ratio of super node A, B, C, or D, where 1.0 represents full CPU utilization (i.e., 100% load).

# of Users	A	B	C	D (Back Up)
10	0.1633	0.1589	0.1598	0.0000
20	0.2479	0.2253	0.2392	0.0000
50	0.6933	0.6049	0.5831	0.0000
100	0.8824	0.8279	0.7942	0.6461

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Fujita, S. Adaptive Multi-Hop P2P Video Communication: A Super Node-Based Architecture for Conversation-Aware Streaming. Information 2025, 16, 643. https://doi.org/10.3390/info16080643

AMA Style

Chen J, Fujita S. Adaptive Multi-Hop P2P Video Communication: A Super Node-Based Architecture for Conversation-Aware Streaming. Information. 2025; 16(8):643. https://doi.org/10.3390/info16080643

Chicago/Turabian Style

Chen, Jiajing, and Satoshi Fujita. 2025. "Adaptive Multi-Hop P2P Video Communication: A Super Node-Based Architecture for Conversation-Aware Streaming" Information 16, no. 8: 643. https://doi.org/10.3390/info16080643

APA Style

Chen, J., & Fujita, S. (2025). Adaptive Multi-Hop P2P Video Communication: A Super Node-Based Architecture for Conversation-Aware Streaming. Information, 16(8), 643. https://doi.org/10.3390/info16080643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Multi-Hop P2P Video Communication: A Super Node-Based Architecture for Conversation-Aware Streaming

Abstract

1. Introduction

Our Contribution

2. WebRTC Overview

3. Related Work

3.1. WebRTC-Based Conferencing Systems

3.2. Performance Evaluation and Quality Enhancement of WebRTC

4. Multi-Hop Distributed Streaming Model

4.1. Overview

4.2. Optimized Media Transmission via RTP and RTCP Integration

4.3. Interconnecting Super Nodes via Mediasoup PipeTransport

4.4. Control of the Configuration of the Super Node Network

5. Efficient Support for Changes in Chat Groups and Speaker Rotation

5.1. Role Distribution of Super Nodes

5.2. Procedure for Joining a Conversation Group

5.3. Procedure for Leaving a Conversation Group

5.4. Procedure for Switching Speakers Within the Same Conversation Group

6. Experiments

6.1. Setup

6.1.1. Test Tool Preparation

6.1.2. Evaluation Metrics

6.2. Basic Test Scenarios

Results

6.3. CPU Usage Under Varying User Loads

6.3.1. Setup

6.3.2. Results

6.4. Dynamic Load Mutation Scenario

6.4.1. Setup

6.4.2. Results

6.5. Observations

6.5.1. Overall Response Time Trends

6.5.2. Impact of Backup Nodes on Response Time

6.6. Comparison with Existing Methods

6.6.1. Performance Evaluation with a Single SFU Server

6.6.2. Comparison with Multi-Node Setup (No Backup Nodes)

6.7. Evaluation of the Efficiency of Turn Taking

6.7.1. Setup

6.7.2. Results

7. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI