Media Scene Learning: A novel framework for automatically extracting meaningful parts from audio and video signals

Abstract

We describe a novel framework called Media Scene Learning (MSL) for automatically extracting key components such as the sound of a single instrument from a given audio signal or a target object from a given video signal. In particular, we introduce two key methods: 1) the Composite Auto-Regressive System (CARS) for decomposing audio signals into several sound components on the basis of a generative model of sounds and 2) Saliency-Based Image Learning (SBIL) for extracting object-like regions from a given video signal on the basis of the characteristics of the human visual system.

Publication
NTT Technical Review

Related