Abstract: In this paper, we address the problem of human activity detection in temporally untrimmed long video sequences, where the goal is to classify and temporally localize each activity instance in the input video. Inspired by the recent success of the single-stage object detection methods, we propose an end-to-end trainable framework capable of learning task-specific spatio-temporal features of a video sequence for direct classification and localization of the activities. We, further, systematically investigate how and where to fuse multi-stream feature representations of a video and propose a new fusion strategy for temporal activity detection. Together with the proposed fusion strategy, the novel architecture sets new state-of-the-art on the highly challenging THUMOS'14 benchmark -- up from 44.2% to 53.9% mAP (an absolute 9.7% improvement).