SWAG-V: Explanations for Video Using Superpixels Weighted by Average Gradients

Abstract

CNN architectures that take videos as an input are often overlooked when it comes to the development of explanation techniques. This is despite their use in often critical domains such as surveillance and healthcare. Explanation techniques developed for these networks must take into account the additional temporal domain if they are to be successful. In this paper we introduce SWAG-V, an extension of SWAG for use with networks that take video as an input. In addition we show how these explanations can be created in such a way that they are balanced between fine and coarse explanations. By creating superpixels that incorporate the frames of the input video we are able to create explanations that better locate regions of the input that are important to the networks prediction. We compare SWAG-V against a number of similar techniques using metrics such as insertion and deletion, and weak localisation. We compute these using Kinetics-400 with both the C3D and R(2+1)D network architectures and find that SWAG-V is able to outperform multiple techniques.

Publication
In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Kirill Sidorov
Kirill Sidorov
Lecturer