The field of audio-visual event localisation and scene understanding explores how systems can jointly analyse auditory and visual cues to accurately identify, segment and classify events within ...