1. Introduction
Currently, Virtual Reality (VR) has achieved great significance due to the advancements of computing and display technologies. Filmmakers have already started to think creatively about VR technologies because it is not just a gaming trend that is going to get wider. The healthcare industry, immersive telepresence, telehealth, sports, education, etc. are being rapidly commercialized to meet the demands of the market and consumer expectations, etc. The VR market expects revenue of 108 billion USD until 2021 [
1].
As one of the essential VR applications, 360-degree video facilitates the user with an interactive experience that was never thought before. Many commercial broadcasters and video-sharing platforms are showing considerable interest in this domain. Microsoft has released its Windows Mixed [
2] Another platform called ARTE360 VR by ARTE [
3] enables the sharing and accessing of various omnidirectional videos through mobile applications.
Different 360-degree contents, including Natural Image (NI) and Computer Generated (CG) videos, are well suited to be visualized using the new Head-Mounted Displays (HMDs), like Oculus Rift [
4], HTC Vive [
5], Samsung Gear VR [
6], Google Cardboard [
7], among others. These HMDs equipped with multiple sensors are much more commonly used than conventional display devices to view 360-degree videos. 360-degree video lets the user control viewport via head movements within a spherical video coverage of 360 × 180 degree [
8]. 360-degree videos can also be experienced within the HTML5 environment. In this context, WebVR [
9] is a JavaScript API that uses the WebGL API to facilitate the web-based support to obtain a more relaxed VR experience. Within this framework, 360-degree videos need more optimization regarding camera settings, encryption, delivery, and rendering of immersive media content. The web-based 360-degree video rendering on the high-resolution desktop monitors provide no or very little sense of immersion. The latest HMDs, on the other hand, is the most demanding, fully immersive VR systems that offer a compelling experience of realism.
Figure 1 shows an equirectangular mapped 2K image where the yaw angle (−180 to 180 degrees) and pitch angle (−90 to +90 degrees) are mapped to the x-axis (0 to 1920 pixels) and y-axis (0 to 1080 pixels), respectively. Problems associated with 360-degree video streaming include huge storage requirements, limited Filed of View (FoV) related to the human visual system and display devices, interactivity, smooth user navigation, resource-intensive coding, etc. are considering high-resolution representations [
10]. A high-resolution 360-degree video is usually four to five times larger than regular videos [
11]. The popular HMDs have a limited FoV, i.e., 100 degrees for Google Cardboard and Samsung Gear VR and 110 degrees for Oculus Rift and HTC Vive, as shown in
Figure 2.
The transmission of 360-degree video is rather challenging, especially over the current generation cellular networks because of the limited capacity and dynamic nature. Various 360-degree streaming solutions exist, while one common solution is to project and split an equirectangular frame into several rectangular regions known as tiles, to overcome the bandwidth issue [
12,
13]. Among such solutions, some only stream a subset of tiles that cover the user’s current viewing region of the user [
14,
15]. Such schemes restrict the user to visualize only a limited part of the video in possible high quality. On the other hand, one may transmit all the tiles of a 360-degree frame in variable qualities [
16,
17] to compensate for the viewport prediction errors. The streaming of 360-degree video requires higher network bandwidth, as pixels are transmitted to users from every direction. Whether the user may unicast or multicast, the views depend on what type of application will be used. For VR/AR applications, the system uses the user information into an enhanced view (e.g., virtual reality classrooms [
18]).
Multicast has a unique potential for reducing bandwidth consumption by 360-degree videos. In contrast, the unicast streaming of immersive content uses the network resources and cannot meet each user’s demand to wear their HMD for watching the same content. On the other hand, multicast is considered a highly feasible solution because it introduces multiple challenges such as interactivity, ensuring fairness, ensuring smooth quality, etc. Multicast of the 360-degree video has gained importance in the literature so far. A multicast of virtual reality (MVR) [
20] has been proposed in LTE networks by considering the adaptive streaming of VR content. This algorithm divides the users by weight of tiles and finds the bitrate for each tile. Similarly, VRCast in [
21] was designed for cellular networks by supporting the multicast (e.g., LTE). It solves the complex live streaming issue, divides the 360-degree video into small tiles, maximizes the viewport’s quality, and ensures fairness between users. As current works on streaming always transmit panorama pictures in a unicast manner. As a result, viewers only watch a small portion of the video, wasting the extra bits being transmitted. The partial 360-degree video frames are transmitted to a single viewer in [
22]. Although an approach was proposed in [
23] for the optimization of network bandwidth through multicast that transmits the 360-degree video efficiently. The feasibility of partial multicast frames was presented by reducing the prediction errors that ensure the user quality of experience(QoE) [
24]. It is also essential for a seamless experience for end-users. 360-degree videos are complex, requiring fast decoding instances and sophisticated projection schemes that may aid in high overhead. This paper presents and discusses key technologies related to support 360-degree video streaming to enable interactive and immersive experiences. Specifically, the general video streaming system, different streaming approaches, the immersive standardization/project efforts, the latest tools, open software, and the possible challenges and implications are discussed. The main contributions can be gathered into the following:
- 1.
This paper addresses the architecture of 360-degree video streaming. The purpose is to stay as close as possible on 360-degree video principles by considering both low and high-level perspectives. The content pre-processing stages, e.g., content acquisition, stitching, projection, and encoding, are cogitated. Then the transmission and consumption of the 360-degree video are over-viewed.
- 2.
The sophisticated streaming technologies for 360-degree video, including viewport-based, tile-based, and QoE enabled solutions, are presented and discussed in detail. It also describes how high-resolution content is transmitted to single or multiple users.
- 3.
The audio-related technologies that support immersive experience are illustrated.
- 4.
The technological efforts to enable the technologies for an extended degree of freedom in immersive multimedia consumption are explained.
- 5.
Different technical and design-related challenges and implications are presented for the sake of an interactive, immersive, and engaging experience with 360-degree video.
- 6.
The potential of 360-degree video and guidelines for readers approaching research on 360-degree video streaming are presented.
The paper’s structure is as follows:
Section 2 provides an overview of a 360-degree video streaming system.
Section 3 outlines the major streaming approaches for 360-degree video.
Section 4 briefs some technical issues of spatial audio for 360 media.
Section 5 explains various technological efforts that aim to bring the virtual world close to the real world.
Section 6 signifies the potential growth of 360-degree video based on applications.
Section 7 provides some technical challenges and implications to create an immersive, interactive, and engaging experience with 360-degree video. Lastly,
Section 8 presents the discussion and conclusion. The schematic map of the paper is shown in
Figure 3.
3. Overview of Video Streaming
Academic and industrial research growth has made a great deal of effort by concentrating on coming up with solutions to stream the multimedia. Real-time requirements are important changes, which need special attention, between multimedia and other data network traffic. A lot of standardization organizations and protocols were obtained to enable multimedia streaming, and Quality-of-Service (QoS)-based streaming. Some early protocols built on top of the Internet Protocol (IP) were Integrated Services (IntServ) [
66] and Differentiated Services (DiffServ) [
67] that will ensure QoS-aware streaming and multimedia streaming. Resources are specifically stored by executing a signaling protocol [
68] called Resource Reservation Protocol (RSVP) to satisfy the application specifications. This protocol is used by IntServ that defines the QoS requirements for an application’s traffic. Real-time Transport Protocol (RTP) [
69] enables end-to-end multimedia streaming in real time by proposing a standardized packet format. Application-layer framing and integrated layer processing are two main concepts that are used to design RTP. A client-server-based connection is established by Real-time Streaming Protocol (RTSP) [
70] by controlling the multimedia servers before downloading the required video content. Some researchers in [
71] have shown that Transmission Control Protocol (TCP) is beneficial for transmitting the delay-tolerant-based videos. In return for its reliability, it permits efficient data transmission but must suffer from unpredictable delays. HTTP’s design over TCP ensures progressive downloading to download a video file of constant quality as quickly as TCP enables it. A major downside is that clients receive the same video content under different network conditions, which can lead to unnecessary stalls or rebuffering. This situation has led researchers to turn to the development of HTTP adaptive streams. 360-degree video streaming is different from traditional video streaming. This section presents a detailed overview of existing 360-degree video streaming solutions, followed by a summary of existing solutions.
3.1. Adaptive 360-Degree Video Streaming
360-degree video streaming has gained vast importance in the multimedia world over the years. Implementing adaptive streaming in a VR environment for 360-degree video content is difficult because it needs smart streaming and encoding techniques to deal with present and future services as well as applications. The video compression standard exploits information theory that provides source coding and characteristics of the human visual system to minimize spatial-temporal redundancies. Three essential aspects of video coding, visual perception, and quality assessment have focused on the research of perceptual compression of 360-degree video. Furthermore, a user can randomly switch to neighboring views during 360-degree video playback. The actual challenge is to facilitate a smooth viewport switching by providing a certain level of resilience to errors to eliminate error propagation due to different encoding frames. Thus, mostly the viewport-based streaming strategies save resources while transmitting video streams.
3.2. Viewport-Based Streaming
Viewport-based adaptive streaming has gained attention in both industrial and academic communities. The end-users’ corresponding viewport can be identified in viewport-dependent streaming based on the user’s head movement. Therefore, such solutions are adaptive during the streaming of 360-degree videos, as they dynamically select regions and adjust the quality to minimize the transmitted bitrate. It provides several adaptation sets at the server-side because it is a viable option to smooth the viewport during abrupt head movements. Each adaptation set contains the associated video area with a given viewing orientation. In [
57], the authors proposed a differentiated quality approach where the front viewport is transmitted with relatively high resolution compared to the other parts. They compared ERP and CMP multi-resolution variants with the current pyramid projection variants. Similarly, a viewport-adaptive 360-degree video streaming in [
1] is suggested to reduce the bandwidth. The concept of quality-focused regions (QERs) was introduced, making a particular region of higher quality video than the rest of the video. However, streaming approaches do not involve quality adjustment based on head movement prediction errors. The authors in [
72] evaluated the impacts of response delay based on viewport-based adaptive streaming. The system provides a server-based quality adjustment and view transmission to reduce the response delay by estimating the throughput and viewport signals. Based on the client’s response, network throughput is estimated by the proposed framework for the future viewing position. On the server-side, the necessary tiles are then streamed to satisfy the delay constraints. The viewport-dependent adaptive streaming is based primarily on small adaptation and buffering delays indicated in real-world experiments. The initial results illustrate this type of streaming is effective in case of short response delay. The nth interval of estimated viewport
and the
adaptation interval of estimated throughput
is given below:
where the
and
are the last reported viewport position and throughput, respectively. The bitrate computation
based on
is given as:
where
is the safety margin. In [
73], a joint adaptation was observed based on network and buffer delay. The proposed framework dynamically adjusts the viewport area to visualize the high likelihood of the scene at the time of rendering. It has shown that the proposed design provides flexible adaptation support to consume the available bandwidth efficiently. A navigation-aware optimization problem is studied in [
74] to reduce both view switching delay and navigation distortions. An optimal solution is provided to polynomial time complexity through a dynamic algorithm. The
frame of quality objective
is computed, such as:
The weight
in Equation (
4) shows that at
frame how much the tile
t overlaps the viewport. The weight
is computed as:
where
is the overlapped area of title
t and
is the total area of the viewport. The quality objective
is computed as follows:
Oculus’ viewport-based streaming implementation is performed in [
59], indicating that this implementation is inefficient: 20% of the bandwidth is lost downloading video segments that have never been used. The asymmetric viewport-based technique streams 360-degree content with different spatial resolutions to save bandwidth. During video playback, the client requests a version of the video based on the user’s orientation. This approach’s advantage is that even if the client incorrectly anticipates the user’s orientation, the low-quality content can still be made in user viewport. However, such a scheme involves huge storage and processing overheads in most cases. Viewport-independent streaming is the straightforward way that streams the content of 360-degree video since the entire frame is transmitted in the same quality as conventional videos. Simplicity implementation has been an appealing gateway to viewport-independent streaming. Though, the coding efficiency is 30% lower than viewport-dependent streaming [
75]. Additionally, invisible areas require a lot of bandwidth and decoding resources. This form of streaming [
76,
77] mainly applies to content streaming.
3.3. Tile-Based Streaming
In tile-based streaming, a video is divided temporally into segments as in traditional HTTP-based adaptive streaming. Moreover, these video segments are spatially divided into tiles, so that several spatial tiles compose each temporal segment. Since the client needs to store some amount of video to ensure continuous playback so it pre-fetches video segments based on viewport prediction. As an earlier work, [
78] performed tile-based coding that tries to adjust the resolution based on the user’s viewport. The video tiles are encoded with two different levels of resolution. The frame reconstruction process integrates high and low-resolution tiles within and outside of the viewport, respectively. A study [
79] explore the various tiling features by investigating 360-degree video steaming where each tile can be projected for quality adaptation based on different viewing regions. Moreover, the full delivery of basic streaming can save about 65% compared to full and partial delivery strategies. An equirectangular video in [
80] is partitioned into many tiles where a sampling weight is assigned to each horizontal tiles based on its content. The bitrate allocation is optimized based on sampling weight and bandwidth budget. An overlapping margin with two neighboring tiles is added to overcome the probability of viewport missing by applying an alpha blending on overlapping tile margins. In recent approaches [
81,
82,
83], each tile has multiple types of hierarchical representation to choose from, based on the user’s viewport. As a result, smoother quality degradation can be obtained. By using SVC, they surmount the randomness of both the network channels and head movements. The authors in [
81] use the visual attention metric that calculates the tiling patterns by introducing an adaptive-based streaming framework. Based on this metric, tiles are generated in different sizes to retain the advantages of larger tiles and smaller tiles with high coding efficiency and streaming decisions, respectively. The bitrate allocation strategy is assigned to the tiles belonging to different regions for optimal streaming for each selected pattern. The authors in [
84] presented an optimization framework that tries to minimize the pre-fetched tiles error. It also ensures continuous playback within a small buffer and builds a probabilistic model that predicts the viewport. The SRD extension of DASH achieves a higher bitrate and thus we can stream the videos to users with the highest quality. In addition, the motion-constrained HEVC tiles in [
85] minimize the complexity and synchronization problems between tiles such that a single decoder can be used. The three types of heuristic strategies are also presented for 360-degree video streaming. The experimental results indicate that the better coding efficiency has achieved by streaming the viewport tiles at the original captured resolution. The authors in [
65] designed an end-to-end VR video streaming to transmit 8K 360-degree videos. The proposed methodology assigns higher bitrates to the viewport tiles and gradually lower quality to the tiles that are outside of the viewport. The bitrate assigned to the
tile in the viewport is given as:
where
V and
represent the viewport and a set of tiles outside of the viewport, respectively.
is a constant that is defined by the client.
is the currently available bandwidth and
is the weight of the
tile. The bitrate estimation for the
tile in a set of tiles outside of the viewport
is calculated as:
Finally, for each tile representation, the client requests a bitrate which is represented as:
where
is a set of the tiles inside the viewport and
m is the DASH representation ID. The researchers in [
58] considered fetching unviewed part of the video at the lowest quality based on user head movement prediction as well as to decide the video playback quality adaptively for the viewed part based on bandwidth prediction. A two-tier system for 360-degree video streaming has proposed in [
86], where the entire video content is delivered by base tier at a lower quality with a long buffer. In contrast, the enhancement tier facilitates the predicted viewport with a short buffer at a higher quality. Consequently, a tradeoff between reliability and efficiency is achieved for 360-degree video streaming.
In [
86], the authors predicted the head movement (HM) of the viewer from his/her previous HM data, considering both angular velocity and angular acceleration. According to the predicted HM, a different quantization parameter (QP) is allocated to each tile. The experimental evaluation showed that angular velocity and angular acceleration-based HM prediction significantly reduces the prediction error and introduced low delay and the associated loss in visual fidelity compared to baseline approaches. A very similar solution but with HTTP/2 feature is presented in [
87] to overcome the bandwidth and request overheads. HTTP/2’s priority features enable priority transmission and stream termination features to enhance the user experience. Unlike the prosperity of above-mentioned viewport-based coding, the saliency-aware compression is still a challenge because the existing 2D saliency approaches are difficult to employ for 360-degree video. A work in [
88] proposes saliency-based sampling for a 360-degree video system, where low-resolution CMP is combined with unsampled salient regions. Spatial Relationship Description (SRD) feature extends the Media Presentation Description (MPD) that enables the DASH client to retrieve only certain user-relevant video streams at high resolution. The authors in [
83,
89] employed the MPEG-DASH SRD [
90] extensions to support tiled streaming and described a video as an exclusive collection of synchronized video. They also present several SRD use cases (e.g., zoomable) where the users are provided with a seamless experience. SRD facilitates the spatial positions of content, and thus DASH clients can determine which tiles have to request. The users always download low-resolution tiles to avoid rebuffering while the current view region is presented to support a high-quality zooming feature.
Table 6 represents a summary of different streaming schemes for 360-degree video streaming.
3.4. Quality of Experience Enabled Streaming
Multimedia streaming has gained considerable popularity among users everywhere, as there are many performance problems while delivering multimedia over different loaded networks. Even more so as the processing and transmission of 360-degree format bring along new challenges (i.e., bandwidth, distortions, etc.). To lower the bandwidth requirements, video material must be compressed to lower qualities, causing compression artifacts that may negatively affect the user’s quality of experience (QoE). QoE refers to the measure of customer satisfaction and experience from a service such as TV broadcasts, phone calls, and web browsing. As with traditional 2D videos, quality assessment of 360-degree videos can be done through both subjective and objective tests, which have their advantages and disadvantages.
3.4.1. Subjective Quality Assessment
Many subjective video quality assessment methods have been found for 2D videos from the past two decades. Many subjective methodologies have been proposed by the international telecommunication union (ITU). Two metrics are widely used in subjective assessment quality, such as one metric is MOS [
91], and another metric is DMOS [
92]. Currently, different types of subjective assessment methods are identified for omnidirectional videos. The authors in [
20] presented a testbed on omnidirectional video and image by suggesting an HMD as the displaying device. But, unfortunately, this study does not consider how to measure the subjective quality of 360-degree videos. Based on the testbed proposed in [
20], a dual stimulus method in [
93] has been used to measure the quality of High Dynamic Range (HDR) omnidirectional images. In contrast, authors in [
94] present a single stimulus ACR-based study for omnidirectional images. It has been found that the ideal viewing duration for 360-degree images is 20 seconds to explore the content entirely by the user. Moreover, different people might explore the content differently, looking at other parts, resulting in different experiences. Therefore, visual attention and salience are important aspects to consider in the subjective assessment of 360-degree video content [
95]. The authors in [
96,
97] elaborated on the subjective study by considering several parameters (i.e., resolution, bitrate, quantization parameter QP, content characteristics) and their effect on perceptual quality 360-degree video. A study by [
98] was also conducted towards QoE of 360-degree video streaming that mainly focuses on the impact of stalling. They performed subjective research in their lab, where they compared different stalling frequencies and duration and additionally compared results for the 360-degree video to traditional TV. Another study [
99] on the QoE of streaming 360-degree videos found that delay, quality variations, and interruptions could support the evaluation of the QoE these factors into their model, indicating these factors do influence the quality perception. There is still a lack of standardized methodologies for subjective studies and metrics for 360-degree video. The debate on how to develop these is ongoing; consensus has not yet been reached within the research community. Nevertheless, some studies on the subjective experience of omnidirectional content have been performed adapting methodologies from classical video quality assessment. However, this adaptation is not trivial as viewing through an HMD is substantially different from a regular display that presents different experiences. The viewer is more immersed in the content, and challenges regarding strain and cybersickness arise. Cybersickness is a potential barrier to achieve higher QoE levels and can cause discomfort. In [
100], two subjective experiments have conducted to evaluate the video perception level and cybersickness in viewport adaptive 360-degree video streaming with limited bandwidth and resolution. Also, a modified absolute category rating (M-ACR) method was proposed by using different devices [
101,
102] to analyze the cybersickness of 360-degree videos at varying conditions of bitrates and resolutions.
Table 7 depicts the comparison of different subjective quality assessment approaches.
3.4.2. Objective Quality Assessment
Currently, objective quality in 360-degree video is measured in the planar projection through structure similarity (SSIM) and standard peak-signal-to noise ratio (PSNR). However, they give similar importance to all parts of the spherical image, even though different parts have different viewing probabilities and thus different importance. Additionally, they still do not give a good representation of subjective quality. Viewport-based PSNR or SSIM metrics could be a solution closer to what the users perceive. However, all objective metrics still fail to consider perceptual artifacts such as, for example, visible seams [
95]. In a study by [
105], three metrics especially designed for omnidirectional content, were compared to conventional 2D metrics. They evaluated the spherical PSNR (S-PSNR), weighted spherical PSNR (WS-PSNR), and crater parabolic projection PSNR (CPP-PSNR). The results showed only moderate correlations with the subjective scores. Compared to traditional methods, the metrics developed for 360-degree video content did not work better. This was confirmed once more by studies on various quality metrics by [
106]. They considered 10 quality metrics in their study. Their results show a better correlation to the subjective metrics. Moreover, they showed as well that traditional PSNR outperformed the other metrics due to its simplicity. The data from subjective methods are used as ground truth, and the goal is to predict the quality scores (MOS) as close as possible through objective data about the video [
107] Several objective quality video assessment approaches advance the metric of PSNR. Hence, PSNR cannot represent subjective visual quality since human experience is not taken into account. For example, in region-of-interest (RoI), subjective quality is more likely to be affected by PSNR. The study [
108] considered the multi-level quality features and fusion model where the quality features are compared with RoI maps. These multiple quality features are then combined by a fusion model to obtain the overall quality score. Another study [
109] introduced S-PSNR, where PSNR is calculated based on uniformly sampled points. S-PSNR can generate the objective quality assessment of 360-degree videos by applying interpolation algorithms under various projections. In contrast, a weighted PSNR (W-PSNR) [
110] is proposed by using gamma-corrected pixel values. A Craster parabolic projection PSNR (CPP-PSNR) compares the different projection approaches by remapping pixels to CPP projection. SSIM is another quality evaluation metric to define multi-factor image distortion. The author in [
111] analyzed the SSIM results and introduced a spherical-SSIM (S-SSIM) metric to compare the similarity of impaired and original 360-degree videos. The overview of different approaches to quality evaluation is given in the following
Table 8.
Machine Learning ( ML) can bridge the gap between streaming approaches through objective and subjective QoE assessments. Reinforcement learning (RL) methods are used for video streaming bitrates to improve the QoE.
Table 9 provides a summary of different works in video streaming applications to define RL to improve QoE. In [
113], a method was investigated to adapt the variable video streaming. A two-stage model [
109] was proposed for QoE assessment. The research in [
114] aims to address the issue of quality variation that affects the QoE. A DRL model [
115] considered both eye and head movements data for the quality assessment of 360-degree video. The author in [
116] proposed a Q-learning algorithm for adaptive streaming services to improve the QoE in variable environments.
In summary, QoE research is important for the development of video streaming technology to most efficiently handle the tradeoff between providing good quality and limiting network burden. Additionally, the 360-degree video offers a substantially different experience compared to regular 2D. Therefore, it would be prudent to do more research on the QoE in the 360-degree video specifically.
4. Audio Technologies for 360-Degree Video
360-degree and panoramic videos can be break or make by an audio effect. Spatial audio [
117,
118] is known as a full-sphere surround sound approach that employs multiple audio channels to mimic the audio representation that we have in real life. The 360-degree video becomes more reliable due to the spatial audio because of the channeling properties of sound that enable it to pass through time and space. The Google VR Software Development Kit (SDK) [
119] optimizes the audio rendering engine for the mobile VR. The significance of the 360-degree video display system cannot be overstated in producing the spatial audio soundtrack. The Facebook spatial Workstation [
120] has the templates and numerous plugins that are used to support the synchronized audio playback for 360-degree video with the help of HMDs (e.g., albeit solely for OSX, Oculus Rift, etc.). The other audio production environment will be integrated with such type of video monitoring.
Two categories have been described for the reproduction techniques of spatial audio named physical reconstruction and perceptual reconstruction. The physical reconstruction technique is used to synthesize the whole sound field as close as possible to the desired signal. In contrast, the psychoacoustic techniques are used in perceptual reconstruction to produce a perception for the spatial sound characteristics [
121]. The stereo configuration uses the two speakers in the most popular methods of sound reproduction to facilitate the more spatial information (that includes distance, direction sense, ambiance, and sound stage ensemble). While Multi-channel reproduction methods [
122] are used in the acoustic environment and become popular in consumer devices.
A study in [
123] provides multi-channel reproduction techniques. The same acoustical pressure field is also produced with the other physical reconstruction techniques, as called Ambisonics and Wave Field Synthesis (WFS), as existing in the surroundings. An array of a microphone is needed to capture the more spatial sound field. Consequently, the microphone recordings demand the post-processing because they cannot be used directly without processing for the analysis of the sound field characteristics. Microphone arrays are used in speech enhancement, source separation, echo cancellation, and sound reproduction.
Ambisonics [
124], also known as 3D audio, is used to record, mix, and playing the 360-degree audio around a center point. Recently, it has been adopted in the VR industry and 360-degree applications but was investigated in the 1970s and never used before. Ambisonics audio is not like traditional surround technologies. The principle behind the two-channel stereo and traditional sound technologies is the same because all are used to create an audio by sending an audio signal to specific speakers. This is the reason Ambisonics becomes standard in the VR industry and 360-degree video. Ambisonics is not pre-limited to any specific speaker as it creates a smooth sound sphere even when the sound field rotates. Still, traditional surround formats provide excellent imaging only in case of audio scene static. Moreover, Ambisonics also delivers a full sphere to spread the sound evenly throughout the sphere.
There are six Ambisonics formats names as A, B, C, D, E, and G formats. The first-order Ambisonics or B-format microphones are used in the representation of linear VR by using a tetrahedral array. Furthermore, these are processed in four channels, such as “W” that provides a non-directional pressure level. At the same time, “X, Y, and Z” facilitate the front-to-back, side-to-side, and up-to-down directional information, respectively. The first-order Ambisonics is only useful for a comparatively smaller sweet spot because of its limited spatial fidelity that can affect the sound localization. For this, Higher-Order Ambisonics boosts the performance efficiency of first-order Ambisonics by adding more microphones. These are provided in linear VR and required more loudspeakers. The perceptual reconstruction techniques replicate the natural listening experience for spatial audio to represent the physical sound. Binaural recording [
125] that is an extended form of stereo recording, provides a 3D sound experience. Binaural recordings replicate the human ears as closely as possible by using the two 360-degree microphones the same as regular stereo recordings that capture the sound with directional microphones. 360-degree microphones to the dummy head [
125] are used to serve as proxies for the human ears because it provides the precise geometry of ears. The dummy head also produces the sound waves that interact with the human head contours. A spatially stereo image is captured more precisely as compared to any other recording method with the help of 360-degree microphones.
Head-Related Transfer Functions (HRTFs) [
126] are used in real-time techniques of binaural audio to reproduce the complex cues that help us to localize the sounds by filtering an audio signal. The multiple factors such as ears, head, and listening environment) can affect the cues because, in reality, we reorient ourselves to localize the sounds. Hence, it is essential for soundscape researchers to choose the proper sound recording/reproducing technique to enable the playback sounds the same as the natural listening scenarios.
Table 10 provides comprehensive detail of audio techniques that are mentioned above.
5. Standards
Currently, immersive media has gained enormous significance in exploring its technological and scientific challenges. Significant activities are being undertaken by academics and research institutions to facilitate the immersive media standardization, and a multi-phase scheme is being pursued to complete this set of standards. MPEG is currently working on ISO/IEC 23090 MPEG-I to support the immersive media coding. MPEG-I consists of the following parts: (1) Technical Report on Immersive Media, (2) Omnidirectional Media Format (OMAF), (3) Versatile Video Coding (VVC), (4) Immersive Audio Coding, (5) Point Cloud Compression, (6) Metrics, (7) Metadata, (8) Network-Based Media Processing, (9) Geometry-based Point Cloud Compression, (10) Carriage of Point Cloud Data, (11) Implementation Guidelines for Network-based Media Processing, and (12) Immersive Video.
OMAF standard defines the storage and delivery formats for omnidirectional media applications, concentrating on images, audio, and synchronized text of 360 degrees videos. Its first edition [
44] ensures the storage based on ISO Base Media File Format (ISOBMFF) and MPEG MediaTransport (MMT). OMAF includes several additions such as interactivity, temporal navigation, and natural viewing experience by supporting head motion parallax. MPEG has divided the standardization associated with VR into the following categories: monoscopic 360-degree video, binocular 3D 360-degree video, stereoscopic 360-degree video, and free-viewpoint video (FVV) [
127]. A set of 4 to 6 cameras take 360-degree video shots and then stitches those cameras’ images into a single spherical view. In the monoscopic 360-degree video, the data is represented as 2D images but with pixels coordinates interpreted as values. While viewing a 360-degree video on HMD, the movement of users can be explained with three directions (i.e., yaw, pitch, and roll). Therefore, 360-video is also called 3DoF (degree of freedom) because both the user’s eyes see the same panorama, and there is no depth impression. At the end of 2017, the part 1a of the first phase of OMAF enabled the streaming of 3DoF 360-degree video with existing comparison technologies. In 3DoF, the user is static, but the head can change orientation to look around the 360-degree video.
Part 1b of the first phase of the OMAF aims to enhance the 360-degree video with depth information named enhanced 3DOF or 3DOF+ because 3DoF cannot represent the scene behind the objects. 3DoF+ ensures the accurate parallax for a limited range of motion and leverages much of the existing 360-degree video infrastructure. The additional sensor data is used by 3DoF+ to produce a depth map to allow a player to re-project the video frames that depict the virtual movement in space. However, 3DoF+ has some disadvantages as follows: (1) visible artifacts will be minimized if not eliminated by machine learning (ML) methods, (2) no current standards for depth layer representations, and (3) user’s movement is only in a limited range. The second phase of the OMAF aims to develop the full support for 6DoF [
128,
129] by including point cloud coding, natural 6DoF representation (i.e., light fields), and rendering centric interactive 6DoF.
In March 2019, a Call for Proposals (CFPs) on 3DoF+ videos was announced by MPEG to establish a coding solution based on standardization of the HEVC and 3DoF+ metadata. MPEG-I TM2 for immersive video common test conditions (CTC) is desirable to conduct coding experiments in a well-defined environment. In this context, the Test Model of Immersive Video TMIV [
130] specifies the standard test conditions, i.e., coding efficiency, subjective quality, pixel rate, user experience, and assessment of immersive video applications. The technical approach follows these steps: (1) compressing test content, (2) synthesizing intermediate views from decoded views and metadata (when available), (3) rendering viewports of real/virtual pose traces with a limited or a wider movement, and (4) evaluating coding efficiency and parallax effect considering both decoded and synthesized views. The bit-stream should be viewer independent, meaning that neither the position nor the orientation of the same scene from a range of locations promise the incredibly realistic immersive imagery with correct specular effects. MPEG has carried out explorations on technologies that enable 6DoF to allow the user not only to change the viewer should be considered when compressing the test content. The range of supported possible viewer position is constrained and known. Three different anchors are used, the first one includes MIV (Metadata for Immersive Video) anchor based on HEVC+TMIV. The second one includes the MIV view anchor is also HEVC + TMIV-based but directly encodes a subset of the source views. The third anchor, the MV-HEVC anchor is based on MV-HEVC and VVS. Stereoscopic 360-degree video is a 3D extension of a 360-degree video, where two panoramas of a scene are used and represented with a circular projection. In each time frame, each panorama gives an image that is captured through a rotating camera with narrow horizontal FoV. Presenting different views for the left and right eyes produce the depth sensation in a scene. However, in such type of visualization, the user has limited movements because it can produce the unnatural 3D impression with fast head movements [
131].
6. Applications of 360-Degree Video
The possibilities for new immersive experiences are endless with 360-degree video. Technology’s adoption by consumers is still in its early stages but proves very popular in the gaming industry. The applications of 360-degree videos are just not confined to gaming. There are many more 360-degree video uses, which range from academic research to engineering, design, business, arts, and entertainment [
132,
133]. The user will be able to virtually attend live sports with a favorite seat, listen to a live singer, or watch movies. Several VR simulators have been designed for training and education purposes in different fields, e.g., power plants, submarines, cranes, surgery, planes operation, and air traffic control, etc. [
134].
Figure 6 signifies the growth potential of the 360-degree video market based on applications such as professional sports, travel, live events, movies, news, and TV shows. Next, the applicability of 360-degree video to various fields is briefly described.
6.1. Architectural Design
The architecture industry has achieved immense growth due to the increased immersive media technology. 360-degree video can present a model to millions of viewers just in few minutes with no or minimum loss of information. 360-degree video can preserve lifetime descriptions of engineering drawing or static components in the form of 3D models. This applicability enables researchers to demonstrate the components to be gathered, synthesized, tested and examined with possibly low time and cost consumption [
134,
135].
6.2. Construction Progress Monitoring
Presently, techniques of image-based visualization enable the reporting of the construction progress [
136]. 360-degree interactive and immersive media can ensure the success of a construction project. Alternatively, it may be used to do exact measurement and performing advance control along other suitable procedures to be fulfilled in a specific time [
137]. Researchers argue that this application is used as an e-learning tool and that [
138] must be interoperable, robust, and reusable.
6.3. Medicine
The apparent and most practical applicability of 360-degree video extends to the medical area. It is proving popular in molecular modeling, ultrasound echography, computational neuroscience, and treating phobias, etc. These advancements have significantly saved time and practical costs at the training and education level. Another 360-degree video medicinal area targets to develop surgical skills without harming human beings or animals [
139].
6.4. Data Visualization
It is used for graphical representation of information for making several characteristics or values more apparent. This type of application is implemented for a 3D data set resulting from Computational Fluid Dynamics (CFD) [
140]. The data is visualized using the mapping of geometric objects, i.e., particle clouds or arrows to data values. For instance, arrows are implemented to data values to visualize the airflow where the width can show the volumetric flow rate, direction indicates airflow and color represent temperature.
6.5. News Broadcasting
News is always exciting and informative for a viewer. Different news broadcasters have set up 360-degree sections on their web portals, as shown in
Figure 7.
6.6. Sports and Entertainment
360-degree video is found applicable for sports, for example, a round of golf can be played through a large projection screen. Presently, TV cartoons are also making use of 360-degree video applications (e.g., the BBC’s Ratz, during live broadcast, the cat is animated in real time using a tracking system on puppeteer.) [
132,
139]. Similarly, “Trump World” is also involved in tackling new things from technical perspectives. It was the first-ever effort to develop a system that can deliver 360-degree videos being synchronized with a television broadcast.
The media industry has deployed several technologies that enable synchronizing transmission with video delivery, including specific software and hybrid cast-capable televisions. The live 360-degree video technology possibly brings liveliness delay compared to recorded programs for which chunk files are prepared in advance. Moreover, the rapid creation of chunk files enables fast replays from 360-degree video perspectives even during a live broadcast of an event. The speculations for 2020 necessitate the development of this technology further and making more investments to deliver live 360-degree videos.
6.7. Education
360-degree video is used in education, showing complex scenes that are difficult to explain in the conventional video, images, and even words. In biological sciences, 360-degree video cameras are used to record field trips and the crime scenes in forensic science to help the students to examine it. 360-degree video recording can be a more authentic way to record classrooms as it is a powerful tool for pre-service teachers to explain all the activities performed by students. The advantages and disadvantages of every system are based on characteristics of the application environment. Some applications are highly beneficial if these are implemented using a fully immersive environment and not useful if these are implemented using a non-immersive environment.
Table 11 depicts the applications with all possible systems and explains the type of system and whether it is good or not related to the application used.
8. Discussion and Conclusions
The emerging 360-degree video has attracted the attention of many researchers. It has been all-time popular in multimedia applications such as gaming, education, entertainment, tourism, and sports, among others. Through years a vast number of better works have been focused on improving 360-degree video streaming. However, it has always been challenging because such types of videos need a higher bitrate than traditional videos because of high-resolution (6K and beyond).
This paper explained the streaming architecture of 360-degree video that is compatible with MPEG-DASH and traditional CDNs. Several distortions associated with capturing, stitching, projection, encoding, transmission, and displaying are presented. Projection approaches play a critical role in deciding the overall quality of the frames. The cubemap projection is more efficient compared to the equirectangular version based on the current 4k encoding techniques. CMP transmits more information to the user’s as compared to the un-oriented projections [
36].
The modern streaming approaches such as viewport-based and tile-based streaming which aim to reduce the bandwidth and latency requirements of high-resolution content are presented and explained. Viewport-based streaming considers differential quality streaming and needs to prepare several adaptation sets at the server-side. Such types of adaptation involve huge storage and processing overheads. Tile-based streaming has low storage overhead and provides efficient caching and computation support [
15,
16]. The bitrate allocation decisions for both streaming technologies should try to balance several environmental factors such as viewport prediction errors, rebuffering, response delay, viewport quality, resource use. The audio and video related technologies and standardization efforts are explained in detail to enable a higher degree-of-freedom immersive environment. This paper described the salient features and technical challenges and implications for the viable implementation of 360-degree video.
Despite the popularity of the topic and abundant research efforts, several research challenges (mainly concerning projection, encoding, tiling selection, bitrate adaptation, viewport prediction, etc.) still exist. The standardization efforts are already showing much interest to provide important insights for 360-degree video streaming. Such issues should be addressed before real implementation to ensure the user’s best experience.