productionbas.blogg.se

In quantitative and qualitative evaluations, our sound-based model outperforms label-based approaches. Given a dataset of paired, unlabeled audio-visual data, the model learns to manipulate input images such that, after manipulation, they are more likely to co-occur with other input sounds. Our model learns to manipulate the texture of a scene to match a sound, a problem we term audio-driven image stylization. In this paper, we present a method for learning visual styles from paired audio-visual data. After training with an unlabeled dataset of egocentric hiking videos, our model learns visual styles for a variety of ambient sounds, such as light and heavy rain, as well as physical interactions, such as footsteps.įrom the patter of rain to the crunch of snow, the sounds we hear often reveal the visual textures within a scene. We manipulate the style of an image to match a sound. Learning Visual Styles from Audio-Visual AssociationsĪudio-driven image stylization.