Multi-view video and MIV (MPEG Immersive Video standard)

We consider the problem of efficient compression of immersive video content in the Multiview Video plus Depth (MVD) format. In this format, geometry information is provided through depth maps associated with each camera. Compared to traditional 2D video, immersive video requires a substantial amount of data to provide viewers with an accurate perception of scene depth. Due to the increasing demand for immersive video consumption, efficient compression and transmission of immersive media have become crucial tasks for standardization bodies.

The Moving Picture Experts Group (MPEG) standard focused on MVD data transmission is the ISO/IEC 23090 Part 12, known as MPEG Immersive Video (MIV). MIV employs 2D video codecs to compress texture and depth information from the source, which undergoes preprocessing before compression. Immersive video coding is more complex compared to traditional video coding. Apart from the trade-off between bitrate and quality, it is also constrained by pixel bitrate. To address this, MIV employs a ‘pruning’ mechanism that reduces pixel bitrate and correlations between views. Subsequently, a mosaic of image patches is packed and transmitted.

Additionally, an alternative approach has emerged known as Decoder-Side Depth Estimation (DSDE), which enhances the immersive video system by avoiding the transmission of depth maps and shifting the depth estimation process to the decoder side. The DSDE approach has been studied in cases of fully transmitted multi-view content (without pruning). Our work extends beyond the DSDE approach and proposes incorporating the DSDE paradigm into content that has undergone MIV pruning.

The first method involves excluding a subset of depth maps from the transmission. This study investigates the distinct effect of depth restoration at the patch level on the decoder side. The second proposed approach introduces further improvement. By assessing the quality of depth patches estimated on the encoder side, we demonstrate the ability to distinguish between depth patches that need to be transmitted and those that can be reconstructed on the decoder side. This reduces pixel bitrate and enhances visual quality, as demonstrated in our experiments.

Furthermore, we explore the use of Image-Based Rendering (IBR) neural techniques to enhance the quality of synthesizing new views. These techniques yield good results for non-Lambertian objects and complex scenes, also eliminating the need for depth capture and estimation. However, the effectiveness of IBR neural methods depends on the availability of numerous source views of a scene, posing a significant challenge for deployment in existing standards like MIV. In this context, we address the issue of pruning source pixels for IBR neural methods, achieving a good compromise between pixel rate and synthesis quality.

In summary, this thesis has demonstrated several potential advancements in immersive video coding, with a focus on source content pruning. We have improved the overall design of the immersive video coding system by proposing the DSDE approach at the patch level, resulting in an average 4.63% BD-rate improvement for Y-PSNR. Additionally, we have shown for the first time that neural synthesis provides the necessary information for content pruning, leading to an average 3.6dB improvement in view synthesis quality. Therefore, our work encourages broader adoption of the MIV standard and further development of neural IBR in such a context.”