Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild, based on combining A-VAD and face detection. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with such a dataset.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Traitement du signal et de l'image [eess.SP] Apprentissage [cs.LG] Son [cs.SD]

Fichier principal

GUY_ICPR2020_sub.pdf (6.38 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Perception team : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02882229

Soumis le : mercredi 23 septembre 2020-11:39:19

Dernière modification le : mardi 19 décembre 2023-10:57:32

Dates et versions

hal-02882229 , version 1 (26-06-2020)

hal-02882229 , version 2 (23-09-2020)

hal-02882229 , version 3 (16-10-2020)

hal-02882229 , version 4 (16-10-2020)

Identifiants

HAL Id : hal-02882229 , version 2

Citer

Sylvain Guy, Stéphane Lathuilière, Pablo Mesejo, Radu Horaud. Learning Visual Voice Activity Detection with an Automatically Annotated Dataset. International Conference on Pattern Recognition, Jan 2021, Milano, Italy. ⟨hal-02882229v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

566 Consultations

658 Téléchargements