Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks

Audio inpainting plays an important role in addressing incomplete, damaged, or missing audio signals, contributing to improved quality of service and overall user experience in multimedia communications over the Internet and mobile networks. This paper presents an innovative solution for speech inpa...

Full description

Bibliographic Details
Main Authors: Haohan Shi, Xiyu Shi, Safak Dogan
Format: Article
Language:English
Published: MDPI AG 2024-02-01
Series:Future Internet
Subjects:
Online Access:https://www.mdpi.com/1999-5903/16/2/63
_version_ 1797298185257877504
author Haohan Shi
Xiyu Shi
Safak Dogan
author_facet Haohan Shi
Xiyu Shi
Safak Dogan
author_sort Haohan Shi
collection DOAJ
description Audio inpainting plays an important role in addressing incomplete, damaged, or missing audio signals, contributing to improved quality of service and overall user experience in multimedia communications over the Internet and mobile networks. This paper presents an innovative solution for speech inpainting using Long Short-Term Memory (LSTM) networks, i.e., a restoring task where the missing parts of speech signals are recovered from the previous information in the time domain. The lost or corrupted speech signals are also referred to as gaps. We regard the speech inpainting task as a time-series prediction problem in this research work. To address this problem, we designed multi-layer LSTM networks and trained them on different speech datasets. Our study aims to investigate the inpainting performance of the proposed models on different datasets and with varying LSTM layers and explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and a frequency analysis of the spectrogram. Our proposed multi-layer LSTM models are able to restore up to 1 s of gaps with high perceptual audio quality using the features captured from the time domain only. Specifically, for gap lengths under 500 ms, the MOS can reach up to 3~4, and for gap lengths ranging between 500 ms and 1 s, the MOS can reach up to 2~3. In the time domain, the proposed models can proficiently restore the envelope and trend of lost speech signals. In the frequency domain, the proposed models can restore spectrogram blocks with higher similarity to the original signals at frequencies less than 2.0 kHz and comparatively lower similarity at frequencies in the range of 2.0 kHz~8.0 kHz.
first_indexed 2024-03-07T22:31:14Z
format Article
id doaj.art-02d02b2081d14128b7d05af0d9d708f1
institution Directory Open Access Journal
issn 1999-5903
language English
last_indexed 2024-03-07T22:31:14Z
publishDate 2024-02-01
publisher MDPI AG
record_format Article
series Future Internet
spelling doaj.art-02d02b2081d14128b7d05af0d9d708f12024-02-23T15:17:21ZengMDPI AGFuture Internet1999-59032024-02-011626310.3390/fi16020063Speech Inpainting Based on Multi-Layer Long Short-Term Memory NetworksHaohan Shi0Xiyu Shi1Safak Dogan2Institute for Digital Technologies, Loughborough University London, Queen Elizabeth Olympic Park, Here East, London E20 3BS, UKInstitute for Digital Technologies, Loughborough University London, Queen Elizabeth Olympic Park, Here East, London E20 3BS, UKInstitute for Digital Technologies, Loughborough University London, Queen Elizabeth Olympic Park, Here East, London E20 3BS, UKAudio inpainting plays an important role in addressing incomplete, damaged, or missing audio signals, contributing to improved quality of service and overall user experience in multimedia communications over the Internet and mobile networks. This paper presents an innovative solution for speech inpainting using Long Short-Term Memory (LSTM) networks, i.e., a restoring task where the missing parts of speech signals are recovered from the previous information in the time domain. The lost or corrupted speech signals are also referred to as gaps. We regard the speech inpainting task as a time-series prediction problem in this research work. To address this problem, we designed multi-layer LSTM networks and trained them on different speech datasets. Our study aims to investigate the inpainting performance of the proposed models on different datasets and with varying LSTM layers and explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and a frequency analysis of the spectrogram. Our proposed multi-layer LSTM models are able to restore up to 1 s of gaps with high perceptual audio quality using the features captured from the time domain only. Specifically, for gap lengths under 500 ms, the MOS can reach up to 3~4, and for gap lengths ranging between 500 ms and 1 s, the MOS can reach up to 2~3. In the time domain, the proposed models can proficiently restore the envelope and trend of lost speech signals. In the frequency domain, the proposed models can restore spectrogram blocks with higher similarity to the original signals at frequencies less than 2.0 kHz and comparatively lower similarity at frequencies in the range of 2.0 kHz~8.0 kHz.https://www.mdpi.com/1999-5903/16/2/63speech signal processingspeech inpaintingaudio inpaintinglong short-term memorydeep learning
spellingShingle Haohan Shi
Xiyu Shi
Safak Dogan
Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks
Future Internet
speech signal processing
speech inpainting
audio inpainting
long short-term memory
deep learning
title Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks
title_full Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks
title_fullStr Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks
title_full_unstemmed Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks
title_short Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks
title_sort speech inpainting based on multi layer long short term memory networks
topic speech signal processing
speech inpainting
audio inpainting
long short-term memory
deep learning
url https://www.mdpi.com/1999-5903/16/2/63
work_keys_str_mv AT haohanshi speechinpaintingbasedonmultilayerlongshorttermmemorynetworks
AT xiyushi speechinpaintingbasedonmultilayerlongshorttermmemorynetworks
AT safakdogan speechinpaintingbasedonmultilayerlongshorttermmemorynetworks