Showing 1 - 12 results of 12 for search '"Youtube"', query time: 0.06s Refine Results
  1. 1

    Synchformer: efficient synchronization from sparse cues by Iashin, V, Xie, W, Rahtu, E, Zisserman, A

    Published 2024
    “…Our objective is audio-visual synchronization with a focus on ‘in-the-wild’ videos, such as those on YouTube, where synchronization cues can be sparse. …”
    Conference item
  2. 2

    Spot the conversation: Speaker diarisation in the wild by Chung, JS, Huh, J, Nagrani, A, Afouras, T, Zisserman, A

    Published 2020
    “…First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. …”
    Conference item
  3. 3

    QUERYD: a video dataset with high-quality text and audio narrations by Oncescu, A-M, Henriques, JF, Liu, Y, Zisserman, A, Albanie, S

    Published 2021
    “…The dataset is based on YouDescribe [1], a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos. This ever-growing collection of videos contains highly detailed, temporally aligned audio and text annotations. …”
    Conference item
  4. 4

    End-to-end learning of visual representations from uncurated instructional videos by Miech, A, Alayrac, J-B, Smaira, L, Laptev, I, Sivic, J, Zisserman, A

    Published 2020
    “…We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to- video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). …”
    Conference item
  5. 5

    VoxCeleb: a large-scale speaker identification dataset by Nagrani, A, Chung, J, Zisserman, A

    Published 2017
    “…Our pipeline involves obtaining videos from YouTube; performing active speaker verifi- cation using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. …”
    Conference item
  6. 6

    Condensed movies: story based retrieval with contextual embeddings by Bain, M, Nagrani, A, Brown, A, Zisserman, A

    Published 2021
    “…The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use. …”
    Conference item
  7. 7

    Exploiting temporal context for 3D human pose estimation in the wild by Arnab, A, Doersch, C, Zisserman, A

    Published 2020
    “…Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. …”
    Conference item
  8. 8

    Deep audio-visual speech recognition by Afouras, T, Chung, J, Senior, A, Vinyals, O, Zisserman, A

    Published 2018
    “…Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release two new datasets for audio-visual speech recognition: LRS2-BBC, consisting of thousands of natural sentences from British television; and LRS3-TED, consisting of hundreds of hours of TED and TEDx talks obtained from YouTube. The models that we train surpass the performance of all previous work on lip reading benchmark datasets by a significant margin.…”
    Journal article
  9. 9

    Voxceleb: large-scale speaker verification in the wild by Nagrani, A, Chung, JS, Xie, W, Zisserman, A

    Published 2019
    “…Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. …”
    Journal article
  10. 10

    Quo Vadis, action recognition? A new model and the kinetics dataset by Carreira, J, Zisserman, A

    Published 2017
    “…Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. …”
    Conference item
  11. 11

    Personalizing human video pose estimation by Charles, J, Pfister, T, Maggee, D, Hogg, D, Zisserman, A

    Published 2016
    “…</p> <br/> <p>Our method outperforms the state of the art (including top ConvNet methods) by a large margin on three standard benchmarks, as well as on a new challenging YouTube video dataset. Furthermore, we show that training from the automatically generated annotations can be used to improve the performance of a generic ConvNet on other benchmarks.…”
    Conference item
  12. 12

    Self-supervised and cross-modal learning from videos by Koepke, AS

    Published 2019
    “…We curate new datasets of violin and piano playing which consist of video recordings in constrained settings and of in-the-wild videos downloaded from YouTube. For violin playing, the in-the-wild videos exhibit significant variation in viewpoints and body pose of the violinists; for piano playing we only consider top-view recordings.…”
    Thesis