The 31st British Machine Vision (Virtual) Conference 2020 : Mid-level Fusion for End-to-End Temporal Activity Detection in Untrimmed Video

Mid-level Fusion for End-to-End Temporal Activity Detection in Untrimmed Video

Md Atiqur Rahman and Robert Laganiere

Keywords: temporal activity detection action detection untrimmed video processing single-stage detection

Abstract: In this paper, we address the problem of human activity detection in temporally untrimmed long video sequences, where the goal is to classify and temporally localize each activity instance in the input video. Inspired by the recent success of the single-stage object detection methods, we propose an end-to-end trainable framework capable of learning task-specific spatio-temporal features of a video sequence for direct classification and localization of the activities. We, further, systematically investigate how and where to fuse multi-stream feature representations of a video and propose a new fusion strategy for temporal activity detection. Together with the proposed fusion strategy, the novel architecture sets new state-of-the-art on the highly challenging THUMOS'14 benchmark -- up from 44.2% to 53.9% mAP (an absolute 9.7% improvement).

Paper Poster Session 2

Mid-level Fusion for End-to-End Temporal Activity Detection in Untrimmed Video

Md Atiqur Rahman and Robert Laganiere

Keywords: temporal activity detection action detection untrimmed video processing single-stage detection

Video

Discussion