Read and attend: temporal localisation in sign language videos

Read and attend: temporal localisation in sign language videos

The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a large-scale collection of signing footage with weakly-aligned subtitles. We...

Full description

Bibliographic Details
Main Authors:	Varol, G, Momeni, L, Albanie, S, Afouras, T, Zisserman, A
Format:	Conference item
Language:	English
Published:	IEEE 2021

Similar Items

Aligning subtitles in sign language videos
by: Bull, H, et al.
Published: (2022)

Scaling up sign spotting through sign language dictionaries
by: Varol, G, et al.
Published: (2022)

Watch, read and lookup: learning to spot signs from multiple supervisors
by: Momeni, L, et al.
Published: (2021)

Automatic dense annotation of large-vocabulary sign language videos
by: Momeni, L, et al.
Published: (2022)

Weakly-supervised fingerspelling recognition in British Sign Language videos
by: Prajwal, KR, et al.
Published: (2022)

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
by: Albanie, S, et al.
Published: (2020)

Sign language segmentation with temporal convolutional networks
by: Renz, K, et al.
Published: (2021)

Seeing wake words: Audio-visual keyword spotting
by: Momeni, L, et al.
Published: (2020)

SLRTP 2020: The Sign Language Recognition, Translation and Production Workshop
by: Camgöz, NC, et al.
Published: (2021)

Sub-word level lip reading with visual attention
by: Prajwal, KR, et al.
Published: (2022)

Reading to listen at the cocktail party: multi-modal speech separation
by: Rahimi, A, et al.
Published: (2022)

Automatic and efficient human pose estimation for sign language videos
by: Charles, J, et al.
Published: (2013)

Deep lip reading: a comparison of models and an online application
by: Afouras, T, et al.
Published: (2018)

Now you're speaking my language: visual language identification
by: Afouras, T, et al.
Published: (2020)

ASR is all you need: cross-modal distillation for lip reading
by: Afouras, T, et al.
Published: (2020)

Visual keyword spotting with attention
by: Prajwal, KR, et al.
Published: (2022)

Self-supervised learning of audio-visual objects from video
by: Afouras, T, et al.
Published: (2020)

Verbs in action: improving verb understanding in video-language models
by: Momeni, L, et al.
Published: (2024)

Use what you have: Video retrieval using representations from collaborative experts
by: Liu, Y, et al.
Published: (2020)

SEEHEAR: signer diarisation and a new dataset
by: Albanie, S, et al.
Published: (2021)

Signer diarisation in the wild
by: Albanie, S, et al.
Published: (2021)

Learning to lip read words by watching videos
by: Chung, J, et al.
Published: (2018)

Sign language understanding using multimodal learning
by: Momeni, L
Published: (2024)

QUERYD: a video dataset with high-quality text and audio narrations
by: Oncescu, A-M, et al.
Published: (2021)

Signs in time: Encoding human motion as a temporal image
by: Chung, J, et al.
Published: (2016)

Segmentation of word gestures in sign language video
by: Dang Khanh, et al.
Published: (2023-10-01)

Temporal alignment networks for long-term video
by: Han, T, et al.
Published: (2022)

Large-scale learning of sign language by watching TV
by: Pfister, T, et al.
Published: (2013)

On spatial fragmentisers as temporal localisation markers
by: Šapić Julija L.
Published: (2021-01-01)

Automatic sign language detector for video call
by: Chua, Mark De Wen
Published: (2021)

Synthetic data for text localisation in natural images
by: Gupta, A, et al.
Published: (2016)

Synthetic data for text localisation in natural images
by: Gupta, A, et al.
Published: (2016)

TEACHTEXT: CrossModal generalized distillation for text-video retrieval
by: Croitoru, I, et al.
Published: (2022)

Temporal query networks for fine-grained video understanding
by: Zhang, C, et al.
Published: (2021)

A sound approach: using large language models to generate audio descriptions for egocentric text-audio retrieval
by: Oncescu, A-M, et al.
Published: (2024)

Inductive visual localisation: factorised training for superior generalisation
by: Gupta, A, et al.
Published: (2018)

Inductive visual localisation: factorised training for superior generalisation
by: Gupta, A, et al.
Published: (2018)

Voicevector: multimodal enrolment vectors for speaker separation
by: Rahimi, A, et al.
Published: (2024)

Spatio-temporal action instance segmentation and localisation
by: Saha, S, et al.
Published: (2020)

Speech recognition models are strong lip-readers
by: Prajwal, KR, et al.
Published: (2024)