Seeing through Sounds: Predicting Visual Semantic Segmentation Results from Multichannel Audio Signals


Sounds provide us with vast amounts of information about surrounding objects and can even remind us visual images of them. Is it possible to implement this noteworthy human ability on machines? In this paper, we study a new task that consists of predicting image recognition results in the form of semantic segmentation with given multichannel audio signals. Our approach uses a convolutional neural network that is designed to directly output semantic segmentation results by taking audio features as its inputs. A bilinear feature fusion scheme is incorporated that efficiently models underlying higher-order interactions between audio and visual sources. Experimental evaluations with both synthetic and real sound datasets show that our approach can recover the desired segmented images reasonably well.

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)