Towards Robust And Practical Neural Video-Conferencing

Video conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs cannot operate at extremely low-bitrates or under lossy network conditions without frame corruption or video freezes. To tackle the low-bitrate problem, several neural alterna...

Full description

Bibliographic Details
Main Author:	Sivaraman, Vibhaalakshmi
Other Authors:	Alizadeh, Mohammad
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/153826 https://orcid.org/0000-0001-8842-4497

_version_	1826204122913701888
author	Sivaraman, Vibhaalakshmi
author2	Alizadeh, Mohammad
author_facet	Alizadeh, Mohammad Sivaraman, Vibhaalakshmi
author_sort	Sivaraman, Vibhaalakshmi
collection	MIT
description	Video conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs cannot operate at extremely low-bitrates or under lossy network conditions without frame corruption or video freezes. To tackle the low-bitrate problem, several neural alternatives have been proposed that reconstruct talking head videos using sparse representations of each frame such as facial landmark information. However, these approaches produce poor reconstructions in scenarios with major movement or occlusions over the course of a call, and do not scale to higher resolutions. To cope with packet loss, most systems use retransmissions or Forward Error Correction (FEC) techniques. Retransmissions are impractical in real-time settings due to their slow turnaround times while Forward Error Correction (FEC) techniques require extensive tuning to ensure the right level of redundancy. Instead, this dissertation develops a new paradigm for video conferencing using a suite of generative techniques based on super-resolution and attention mechanisms to improve video conferencing experience across both classes of poor network conditions. First, we present Gemino, a new neural compression system for video conferencing based on a novel high-frequency-conditional super-resolution pipeline. Gemino upsamples a very low-resolution version of each target frame while enhancing high-frequency details (e.g., skin texture, hair, etc.) based on information extracted from a single high resolution reference image. Such a design overcomes the robustness issues of models that rely on only facial landmarks under extreme motion. Gemino’s design includes a multi-scale architecture that runs different components of Gemino at different resolutions, allowing it to scale to resolutions comparable to 720p. We also personalize the model to learn specific details of each person, achieving much better fidelity at low bitrates. We implement Gemino atop aiortc, an open-source Python implementation of WebRTC, and show that it operates on 1024x1024 videos in real-time on a Titan X GPU, and achieves 2.2-5x lower bitrate than traditional video codecs for the same perceptual quality. Since Gemino is not designed to leverage high-resolution information from multiple references, we further design Gemino (Attention), a version of Gemino that computes "attention" or a weighted correspondence between regions of different reference frames and the target frame. This attention design is in contrast to the optical flow framework within Gemino that is restricted to merely linear translations from regions of a single reference frame to its target region. Such an attention-based design is, instead, able to combine information across different references and use the best parts of each reference frame to improve the fidelity of the reconstruction. Lastly, we develop Reparo, a loss-resilient generative codec for video conferencing that reduces the duration and impact of video freezes during outages. Reparo’s compression does not depend on temporal differences across frames, making it less brittle in the event of packet loss. Reparo automatically generates missing information when a frame or part of a frame is lost, based on the data received so far, and the model’s knowledge of how people look, dress, and interact in the visual world. Together, these approaches suggest an alternate future for video conferencing powered by neural codecs that can operate in extremely low-bandwidth scenarios as well as under lossy network conditions to enable a smoother video conferencing experience.
first_indexed	2024-09-23T12:49:15Z
format	Thesis
id	mit-1721.1/153826
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T12:49:15Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1538262024-03-22T03:37:40Z Towards Robust And Practical Neural Video-Conferencing Sivaraman, Vibhaalakshmi Alizadeh, Mohammad Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Video conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs cannot operate at extremely low-bitrates or under lossy network conditions without frame corruption or video freezes. To tackle the low-bitrate problem, several neural alternatives have been proposed that reconstruct talking head videos using sparse representations of each frame such as facial landmark information. However, these approaches produce poor reconstructions in scenarios with major movement or occlusions over the course of a call, and do not scale to higher resolutions. To cope with packet loss, most systems use retransmissions or Forward Error Correction (FEC) techniques. Retransmissions are impractical in real-time settings due to their slow turnaround times while Forward Error Correction (FEC) techniques require extensive tuning to ensure the right level of redundancy. Instead, this dissertation develops a new paradigm for video conferencing using a suite of generative techniques based on super-resolution and attention mechanisms to improve video conferencing experience across both classes of poor network conditions. First, we present Gemino, a new neural compression system for video conferencing based on a novel high-frequency-conditional super-resolution pipeline. Gemino upsamples a very low-resolution version of each target frame while enhancing high-frequency details (e.g., skin texture, hair, etc.) based on information extracted from a single high resolution reference image. Such a design overcomes the robustness issues of models that rely on only facial landmarks under extreme motion. Gemino’s design includes a multi-scale architecture that runs different components of Gemino at different resolutions, allowing it to scale to resolutions comparable to 720p. We also personalize the model to learn specific details of each person, achieving much better fidelity at low bitrates. We implement Gemino atop aiortc, an open-source Python implementation of WebRTC, and show that it operates on 1024x1024 videos in real-time on a Titan X GPU, and achieves 2.2-5x lower bitrate than traditional video codecs for the same perceptual quality. Since Gemino is not designed to leverage high-resolution information from multiple references, we further design Gemino (Attention), a version of Gemino that computes "attention" or a weighted correspondence between regions of different reference frames and the target frame. This attention design is in contrast to the optical flow framework within Gemino that is restricted to merely linear translations from regions of a single reference frame to its target region. Such an attention-based design is, instead, able to combine information across different references and use the best parts of each reference frame to improve the fidelity of the reconstruction. Lastly, we develop Reparo, a loss-resilient generative codec for video conferencing that reduces the duration and impact of video freezes during outages. Reparo’s compression does not depend on temporal differences across frames, making it less brittle in the event of packet loss. Reparo automatically generates missing information when a frame or part of a frame is lost, based on the data received so far, and the model’s knowledge of how people look, dress, and interact in the visual world. Together, these approaches suggest an alternate future for video conferencing powered by neural codecs that can operate in extremely low-bandwidth scenarios as well as under lossy network conditions to enable a smoother video conferencing experience. Ph.D. 2024-03-21T19:08:22Z 2024-03-21T19:08:22Z 2024-02 2024-02-21T17:19:11.675Z Thesis https://hdl.handle.net/1721.1/153826 https://orcid.org/0000-0001-8842-4497 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Sivaraman, Vibhaalakshmi Towards Robust And Practical Neural Video-Conferencing
title	Towards Robust And Practical Neural Video-Conferencing
title_full	Towards Robust And Practical Neural Video-Conferencing
title_fullStr	Towards Robust And Practical Neural Video-Conferencing
title_full_unstemmed	Towards Robust And Practical Neural Video-Conferencing
title_short	Towards Robust And Practical Neural Video-Conferencing
title_sort	towards robust and practical neural video conferencing
url	https://hdl.handle.net/1721.1/153826 https://orcid.org/0000-0001-8842-4497
work_keys_str_mv	AT sivaramanvibhaalakshmi towardsrobustandpracticalneuralvideoconferencing

Towards Robust And Practical Neural Video-Conferencing

Similar Items