Keywords

1 Introduction

Nowadays, with fast development of digital cameras and Internet, the number of videos is continuously booming, making automatic video content analysis methods widely required. One major branch of video analysis is action recognition, which aims to classify manually trimmed video clips containing only one action instance. However, videos in real scenarios are usually long, untrimmed and contain multiple action instances along with irrelevant contents. This problem requires algorithms for another challenging task: temporal action detection, which aims to detect action instances in untrimmed video including both temporal boundaries and action classes. It can be applied in many areas such as video recommendation and smart surveillance.

Fig. 1.
figure 1

Overview of our approach. Given an untrimmed video, (1) we evaluate boundaries and actionness probabilities of each temporal location and generate proposals based on boundary probabilities, and (2) we evaluate the confidence scores of proposals with proposal-level feature to get retrieved proposals.

Similar with object detection in spatial domain, temporal action detection task can be divided into two stages: proposal and classification. Proposal generation stage aims to generate temporal video regions which may contain action instances, and classification stage aims to classify classes of candidate proposals. Although classification methods have reached convincing performance, the detection precision is still low in many benchmarks [6, 22]. Thus recently temporal action proposal generation has received much attention [4, 5, 9, 13], aiming to improve the detection performance by improving the quality of proposals. High quality proposals should come up with two key properties: (1) proposals can cover truth action regions with both high recall and high temporal overlap, (2) proposals are retrieved so that high recall and high overlap can be achieved using fewer proposals to reduce the computation cost of succeeding steps.

To achieve high proposal quality, a proposal generation method should generate proposals with flexible temporal durations and precise temporal boundaries, then retrieve proposals with reliable confidence scores, which indicate the probability of a proposal containing an action instance. Most recently proposal generation methods [4, 5, 9, 32] generate proposals via sliding temporal windows of multiple durations in video with regular interval, then train a model to evaluate the confidence scores of generated proposals for proposals retrieving, while there is also method [13] making external boundaries regression. However, proposals generated with pre-defined durations and intervals may have some major drawbacks: (1) usually not temporally precise; (2) not flexible enough to cover variable temporal durations of ground truth action instances, especially when the range of temporal durations is large.

To address these issues and generate high quality proposals, we propose the Boundary-Sensitive Network (BSN), which adopts “local to global” fashion to locally combine high probability boundaries as proposals and globally retrieve candidate proposals using proposal-level feature as shown in Fig. 1. In detail, BSN generates proposals in three steps. First, BSN evaluates the probabilities of each temporal location in video whether it is inside or outside, at or not at the boundaries of ground truth action instances, to generate starting, ending and actionness probabilities sequences as local information. Second, BSN generates proposals via directly combining temporal locations with high starting and ending probabilities separately. Using this bottom-up fashion, BSN can generate proposals with flexible durations and precise boundaries. Finally, using features composed by actionness scores within and around proposal, BSN retrieves proposals by evaluating the confidence of whether a proposal contains an action. These proposal-level features offer global information for better evaluation.

In summary, the main contributions of our work are three-folds:

  1. (1)

    We introduce a new architecture (BSN) based on “local to global” fashion to generate high quality temporal action proposals, which locally locates high boundary probability locations to achieve precise proposal boundaries and globally evaluates proposal-level feature to achieve reliable proposal confidence scores for retrieving.

  2. (2)

    Extensive experiments demonstrate that our method achieves significantly better proposal quality than other state-of-the-art proposal generation methods, and can generate proposals in unseen action classes with comparative quality.

  3. (3)

    Integrating our method with existing action classifier into detection framework leads to significantly improved performance on temporal action detection task.

2 Related Work

Action Recognition. Action recognition is an important branch of video related research areas and has been extensively studied. Earlier methods such as improved Dense Trajectory (iDT) [38, 39] mainly adopt hand-crafted features such as HOF, HOG and MBH. In recent years, convolutional networks are widely adopted in many works [10, 33, 35, 41] and have achieved great performance. Typically, two-stream network [10, 33, 41] learns appearance and motion features based on RGB frame and optical flow field separately. C3D network [35] adopts 3D convolutional layers to directly capture both appearance and motion features from raw frames volume. Action recognition models can be used for extracting frame or snippet level visual features in long and untrimmed videos.

Object Detection and Proposals. Recent years, the performance of object detection has been significantly improved with deep learning methods. R-CNN [17] and its variations [16, 30] construct an important branch of object detection methods, which adopt “detection by classifying proposals” framework. For proposal generation stage, besides sliding windows [11], earlier works also attempt to generate proposals by exploiting low-level cues such as HOG and Canny edge [37, 50]. Recently some methods [25, 28, 30] adopt deep learning model to generate proposals with faster speed and stronger modelling capacity. In this work, we combine the properties of these methods via evaluating boundaries and actionness probabilities of each location using neural network and adopting “local to global” fashion to generate proposals with high recall and accuracy.

Boundary probabilities are also adopted in LocNet [15] for revising the horizontal and vertical boundaries of existing proposals. Our method differs in (1) BSN aims to generate while LocNet aims to revise proposals and (2) boundary probabilities are calculated repeatedly for all boxes in LocNet but only once for a video in BSN.

Temporal Action Detection and Proposals. Temporal action detection task aims to detect action instances in untrimmed videos including temporal boundaries and action classes, and can be divided into proposal and classification stages. Most detection methods [32, 34, 49] take these two stages separately, while there is also method [3, 26] taking these two stages jointly. For proposal generation, earlier works [23, 29, 40] directly use sliding windows as proposals. Recently some methods [4, 5, 9, 13, 32] generate proposals with pre-defined temporal durations and intervals, and use multiple methods to evaluate the confidence score of proposals, such as dictionary learning [5] and recurrent neural network [9]. TAG method [49] adopts watershed algorithm to generate proposals with flexible boundaries and durations in local fashion, but without global proposal-level confidence evaluation for retrieving. In our work, BSN can generate proposals with flexible boundaries meanwhile reliable confidence scores for retrieving.

Recently temporal action detection method [48] detects action instances based on class-wise start, middle and end probabilities of each location. Our method is superior than [48] in two aspects: (1) BSN evaluates probabilities score using temporal convolution to better capture temporal information and (2) “local to global” fashion adopted in BSN brings more precise boundaries and better retrieving quality.

3 Our Approach

3.1 Problem Definition

An untrimmed video sequence can be denoted as \(X=\left\{ x_n \right\} _{n=1}^{l_v}\) with \(l_v\) frames, where \(x_n\) is the n-th frame in X. Annotation of video X is composed by a set of action instances \(\varPsi _g = \left\{ \varphi _n=\left( t_{s,n},t_{e,n} \right) \right\} _{n=1}^{N_g}\), where \(N_g\) is the number of truth action instances in video X, and \(t_{s,n}\), \(t_{e,n} \) are starting and ending time of action instance \(\varphi _n\) separately. Unlike detection task, classes of action instances are not considered in temporal action proposal generation. Annotation set \(\varPsi _g\) is used during training. During prediction, generated proposals set \(\varPsi _p\) should cover \(\varPsi _g\) with high recall and high temporal overlap.

Fig. 2.
figure 2

The framework of our approach. (a) Two-stream network is used for encoding visual features in snippet-level. (b) The architecture of Boundary-Sensitive Network: temporal evaluation module handles the input feature sequence, and evaluates starting, ending and actionness probabilities of each temporal location; proposal generation module generates proposals with high starting and ending probabilities, and construct Boundary-Sensitive Proposal (BSP) feature for each proposal; proposal evaluation module evaluates confidence score of each proposal using BSP feature. (c) Finally, we use Soft-NMS algorithm to suppress redundant proposals by decaying their scores.

3.2 Video Features Encoding

To generate proposals of input video, first we need to extract feature to encode visual content of video. In our framework, we adopt two-stream network [33] as visual encoder, since this architecture has shown great performance in action recognition task [42] and has been widely adopted in temporal action detection and proposal generation tasks [12, 26, 49]. Two-stream network contains two branches: spatial network operates on single RGB frame to capture appearance feature, and temporal network operates on stacked optical flow field to capture motion information.

To extract two-stream features, as shown in Fig. 2(a), first we compose a snippets sequence \(S=\left\{ s_n \right\} _{n=1}^{l_s}\) from video X, where \(l_s\) is the length of snippets sequence. A snippet \(s_n=(x_{t_n}, o_{t_n})\) includes two parts: \(x_{t_n}\) is the \(t_n\)-th RGB frame in X and \(o_{t_n}\) is stacked optical flow field derived around center frame \(x_{t_n}\). To reduce the computation cost, we extract snippets with a regular frame interval \(\sigma \), therefore \(l_s=l_v/\sigma \). Given a snippet \(s_n\), we concatenate output scores in top layer of both spatial and temporal networks to form the encoded feature vector \(f_{t_n}=(f_{S,t_n},f_{T,t_n})\), where \(f_{S,t_n}\), \(f_{T,t_n}\) are output scores from spatial and temporal networks separately. Thus given a snippets sequence S with length \(l_s\), we can extract a feature sequence \(F=\left\{ f_{t_n} \right\} _{n=1}^{l_s}\). These two-stream feature sequences are used as the input of BSN.

3.3 Boundary-Sensitive Network

To achieve high proposal quality with both precise temporal boundaries and reliable confidence scores, we adopt “local to global” fashion to generate proposals. In BSN, we first generate candidate boundary locations, then combine these locations as proposals and evaluate confidence score of each proposal with proposal-level feature.

Network Architecture. The architecture of BSN is presented in Fig. 2(b), which contains three modules: temporal evaluation, proposal generation and proposal evaluation. Temporal evaluation module is a three layers temporal convolutional neural network, which takes the two-stream feature sequences as input, and evaluates probabilities of each temporal location in video whether it is inside or outside, at or not at boundaries of ground truth action instances, to generate sequences of starting, ending and actionness probabilities respectively. Proposal generation module first combines the temporal locations with separately high starting and ending probabilities as candidate proposals, then constructs Boundary-Sensitive Proposal (BSP) feature for each candidate proposal based on actionness probabilities sequence. Finally, proposal evaluation module, a multilayer perceptron model with one hidden layer, evaluates the confidence score of each candidate proposal based on BSP feature. Confidence score and boundary probabilities of each proposal are fused as the final confidence score for retrieving.

Temporal Evaluation Module. The goal of temporal evaluation module is to evaluate starting, ending and actionness probabilities of each temporal location, where three binary classifiers are needed. In this module, we adopt temporal convolutional layers upon feature sequence, with good modelling capacity to capture local semantic information such as boundaries and actionness probabilities.

A temporal convolutional layer can be simply denoted as \(Conv(c_f,c_k,Act)\), where \(c_f\), \(c_k\) and Act are filter numbers, kernel size and activation function of temporal convolutional layer separately. As shown in Fig. 2(b), the temporal evaluation module can be defined as \(Conv(512,3,Relu)\rightarrow Conv(512,3,Relu)\rightarrow Conv(3,1,Sigmoid)\), where the three layers have same stride size 1. Three filters with sigmoid activation in the last layer are used as classifiers to generate starting, ending and actionness probabilities separately. For convenience of computation, we divide feature sequence into non-overlapped windows as the input of temporal evaluation module. Given a feature sequence F, temporal evaluation module can generate three probability sequences \(P_S=\left\{ p^s_{t_n} \right\} _{n=1}^{l_s}\), \(P_E=\left\{ p^e_{t_n} \right\} _{n=1}^{l_s}\) and \(P_A=\left\{ p^a_{t_n} \right\} _{n=1}^{l_s}\), where \(p^s_{t_n}\), \(p^e_{t_n}\) and \(p^a_{t_n}\) are respectively starting, ending and actionness probabilities in time \(t_n\).

Fig. 3.
figure 3

Details of proposal generation module. (a) Generate proposals. First, to generate candidate boundary locations, we choose temporal locations with high boundary probability or being a probability peak. Then, we combine candidate starting and ending locations as proposals when their duration satisfying condition. (b) Construct BSP feature. Given a proposal and actionness probabilities sequence, we can sample actionness sequence in starting, center and ending regions of proposal to construct BSP feature.

Proposal Generation Module. The goal of proposal generation module is to generate candidate proposals and construct corresponding proposal-level feature. We achieve this goal in two steps. First we locate temporal locations with high boundary probabilities, and combine these locations to form proposals. Then for each proposal, we construct Boundary-Sensitive Proposal (BSP) feature.

As shown in Fig. 3(a), to locate where an action likely to start, for starting probabilities sequence \(P_S\), we record all temporal location \(t_n\) where \(p^s_{t_n}\) (1) has high score: \(p^s_{t_n}> 0.9\) or (2) is a probability peak: \(p^s_{t_n}> p^s_{t_{n-1}}\) and \(p^s_{t_n}> p^s_{t_{n+1}}\). These locations are grouped into candidate starting locations set \(B_S=\left\{ t_{s,i} \right\} _{i=1}^{N_S}\), where \(N_S\) is the number of candidate starting locations. Using same rules, we can generate candidate ending locations set \(B_E\) from ending probabilities sequence \(P_E\). Then, we generate temporal regions via combing each starting location \(t_{s}\) from \(B_S\) and each ending location \(t_{e}\) from \(B_E\). Any temporal region \(\left[ t_s, t_e \right] \) satisfying \( d=t_e- t_s \in [d_{min}, d_{max}]\) is denoted as a candidate proposal \(\varphi \), where \(d _{min}\) and \(d_{max}\) are minimum and maximum durations of ground truth action instances in dataset. Thus we can get candidate proposals set \(\varPsi _p=\left\{ \varphi _i \right\} _{i=1}^{N_p}\), where \(N_p\) is the number of proposals.

To construct proposal-level feature as shown in Fig. 3(b), for a candidate proposal \(\varphi \), we denote its center region as \(r_C=[t_s,t_e]\) and its starting and ending region as \(r_S=[ t_s-d/5,t_s+d/5 ]\) and \(r_E= [ t_e-d/5,t_e+d/5 ]\) separately. Then, we sample the actionness sequence \(P_A\) within \(r_c\) as \(f_{c}^A\) by linear interpolation with 16 points. In starting and ending regions, we also sample actionness sequence with 8 linear interpolation points and get \(f_{s}^A\) and \(f_{e}^A\) separately. Concatenating these vectors, we can get Boundary-Sensitive Proposal (BSP) feature \(f_{BSP}=(f_s^A,f_c^A,f_e^A)\) of proposal \(\varphi \). BSP feature is highly compact and contains rich semantic information about corresponding proposal. Then we can represent a proposal as \(\varphi =(t_s,t_e,f_{BSP})\).

Proposal Evaluation Module. The goal of proposal evaluation module is to evaluate the confidence score of each proposal whether it contains an action instance within its duration using BSP feature. We adopt a simple multilayer perceptron model with one hidden layer as shown in Fig. 2(b). Hidden layer with 512 units handles the input of BSP feature \(f_{BSP}\) with Relu activation. The output layer outputs confidence score \(p_{conf}\) with sigmoid activation, which estimates the overlap extent between candidate proposal and ground truth action instances. Thus, a generated proposal can be denoted as \(\varphi =(t_s,t_e,p_{conf},p^s_{t_s},p^e_{t_e})\), where \(p^s_{t_s}\) and \(p^e_{t_e}\) are starting and ending probabilities in \(t_s\) and \(t_e\) separately. These scores are fused to generate final score during prediction.

3.4 Training of BSN

In BSN, temporal evaluation module is trained to learn local boundary and actionness probabilities from video features simultaneously. Then based on probabilities sequence generated by trained temporal evaluation module, we can generate proposals and corresponding BSP features and train the proposal evaluation module to learn the confidence score of proposals. The training details are introduced in this section.

Temporal Evaluation Module. Given a video X, we compose a snippets sequence S with length \(l_s\) and extract feature sequence F from it. Then we slide windows with length \(l_w=100\) in feature sequence without overlap. A window is denoted as \(\omega =\left\{ F_{\omega }, \varPsi _{\omega } \right\} \), where \(F_{\omega }\) and \(\varPsi _{\omega }\) are feature sequence and annotations within the window separately. For ground truth action instance \(\varphi _g=( t_{s},t_{e} )\) in \(\varPsi _{\omega }\), we denote its region as action region \(r_{g}^a\) and its starting and ending region as \(r_g^s=[ t_s-d_g/10,t_s+d_g/10 ]\) and \(r_g^e= [ t_e-d_g/10,t_e+d_g/10 ]\) separately, where \(d_g=t_e-t_s\).

Taking \(F_{\omega }\) as input, temporal evaluation module generates probabilities sequence \(P_{S,\omega }\), \(P_{E,\omega }\) and \(P_{A,\omega }\) with same length \(l_w\). For each temporal location \(t_n\) within \(F_{\omega }\), we denote its region as \(r _{t_n}=[ t_n-d_s/2,t_n+d_s/2 ]\) and get corresponding probability scores \(p^s_{t_n}\), \(p^e_{t_n}\) and \(p^a_{t_n}\) from \(P_{S,\omega }\), \(P_{E,\omega }\) and \(P_{A,\omega }\) separately, where \(d_s=t_{n}-t_{n-1}\) is temporal interval between two snippets. Then for each \(r_{t_n}\), we calculate its IoP ratio with \(r_{g}^a\), \(r_g^s\) and \(r_g^e\) of all \(\varphi _g\) in \(\varPsi _{\omega }\) separately, where IoP is defined as the overlap ratio with groundtruth proportional to the duration of this proposal. Thus we can represent information of \(t_n\) as \(\phi _{n}=(p^a_{t_n},p^s_{t_n},p^e_{t_n},g^{a}_{t_n},g^{s}_{t_n},g^{e}_{t_n})\), where \(g^{a}_{t_n}\), \(g^{s}_{t_n}\), \(g^{e}_{t_n}\) are maximum matching overlap IoP of action, starting and ending regions separately.

Given a window of matching information as \(\varPhi _{\omega }=\left\{ \phi _n \right\} _{n=1}^{l_s}\), we can define training objective of this module as a three-task loss function. The overall loss function consists of actionness loss, starting loss and ending loss:

$$\begin{aligned} L_{TEM}=\lambda \cdot L_{bl}^{action}+L_{bl}^{start}+L_{bl}^{end}, \end{aligned}$$
(1)

where \(\lambda \) is the weight term and is set to 2 in BSN. We adopt the sum of binary logistic regression loss function \(L_{bl}\) for all three tasks, which can be denoted as:

$$\begin{aligned} L_{bl}=\frac{1}{l_w}\sum _{i=1}^{l_w} \left( \alpha ^{+} \cdot b_i \cdot log(p_i)+\alpha ^{-} \cdot (1-b_i) \cdot log(1-p_i) \right) , \end{aligned}$$
(2)

where \(b_i=sign(g_i-\theta _{IoP})\) is a two-values function for converting matching score \(g_i\) to \(\left\{ 0,1 \right\} \) based on threshold \(\theta _{IoP}\), which is set to 0.5 in BSN. Let \(l^+=\sum g_i\) and \(l^-=l_w-l^+\), we can set \(\alpha ^+=\frac{l_w}{l^+}\) and \(\alpha ^-=\frac{l_w}{l^-}\), which are used for balancing the effect of positive and negative samples during training.

Proposal Evaluation Module. Using probabilities sequences generated by trained temporal evaluation module, we can generate proposals using proposal generation module: \(\varPsi _p = \left\{ \varphi _n=(t_s,t_e,f_{BSP}) \right\} _{n=1}^{N_p}\). Taking \(f_{BSP}\) as input, for a proposal \(\varphi \), confidence score \(p_{conf}\) is generated by proposal evaluation module. Then we calculate its Intersection-over-Union (IoU) with all \(\varphi _g\) in \(\varPsi _g\), and denote the maximum overlap score as \(g_{iou}\). Thus we can represent proposals set as \(\varPsi _p = \left\{ \varphi _n =\left\{ t_s,t_e,p_{conf},g_{iou} \right\} \right\} _{n=1}^{N_p}\). We split \(\varPsi _p\) into two parts based on \(g_{iou}\): \(\varPsi _p^{pos}\) for \(g_{iou}>0.7\) and \(\varPsi _p^{neg}\) for \(g_{iou}<0.3\). For data balancing, we take all proposals in \(\varPsi _p^{pos}\) and randomly sample the proposals in \(\varPsi _p^{neg}\) to insure the ratio between two sets be nearly 1:2.

The training objective of this module is a simple regression loss, which is used to train a precise confidence score prediction based on IoU overlap. We can define it as:

$$\begin{aligned} L_{PEM}=\frac{1}{N_{train}} \sum _{i=1}^{N_{train}}(p_{conf,i}-g_{iou,i})^2{,} \end{aligned}$$
(3)

where \(N_{train}\) is the number of proposals used for training.

3.5 Prediction and Post-processing

During prediction, we use BSN with same procedures described in training to generate proposals set \(\varPsi _p = \left\{ \varphi _n=(t_s, t_e, p_{conf}, p^s_{t_s}, p^e_{t_e} ) \right\} _{n=1}^{N_p}\), where \(N_p\) is the number of proposals. To get final proposals set, we need to make score fusion to get final confidence score, then suppress redundant proposals based on these score.

Score Fusion for Retrieving. To achieve better retrieving performance, for each candidate proposal \(\varphi \), we fuse its confidence score with its boundary probabilities by multiplication to get the final confidence score \(p_{f}\):

$$\begin{aligned} p_{f}=p_{conf} \cdot p^s_{t_s} \cdot p^e_{t_e}. \end{aligned}$$
(4)

After score fusion, we can get generated proposals set \(\varPsi _p = \left\{ \varphi _n=(t_s,t_e,p_{f} ) \right\} _{n=1}^{N_p}\), where \(p_{f}\) is used for proposals retrieving. In Sect. 4.2, we explore the recall performance with and without confidence score generated by proposal evaluation module.

Redundant Proposals Suppression. Around a ground truth action instance, we may generate multiple proposals with different temporal overlap. Thus we need to suppress redundant proposals to obtain higher recall with fewer proposals.

Soft-NMS [2] is a recently proposed non-maximum suppression (NMS) algorithm which suppresses redundant results using a score decaying function. First all proposals are sorted by their scores. Then proposal \(\varphi _m\) with maximum score is used for calculating overlap IoU with other proposals, where scores of highly overlapped proposals is decayed. This step is recursively applied to the remaining proposals to generate re-scored proposals set. The Gaussian decaying function of Soft-NMS can be denoted as:

$$\begin{aligned} p'_{f,i}=\left\{ \begin{matrix} p_{f,i}, &{} iou(\varphi _m,\varphi _i)<\theta \\ p_{f,i}\cdot e^{-\frac{iou(\varphi _m,\varphi _i)^2}{\varepsilon }}, &{} iou(\varphi _m,\varphi _i)\ge \theta \end{matrix}\right. \end{aligned}$$
(5)

where \(\varepsilon \) is parameter of Gaussian function and \(\theta \) is pre-fixed threshold. After suppression, we get the final proposals set \(\varPsi '_p = \left\{ \varphi _n=(t_s,t_e,p'_f ) \right\} _{n=1}^{N_p}\).

4 Experiments

4.1 Dataset and Setup

Dataset. ActivityNet-1.3 [6] is a large dataset for general temporal action proposal generation and detection, which contains 19994 videos with 200 action classes annotated and was used in the ActivityNet Challenge 2016 and 2017. ActivityNet-1.3 is divided into training, validation and testing sets by ratio of 2:1:1. THUMOS14 [22] dataset contains 200 and 213 temporal annotated untrimmed videos with 20 action classes in validation and testing sets separately. In this section, we compare our method with state-of-the-art methods on both ActivityNet-1.3 and THUMOS14.

Evaluation Metrics. In temporal action proposal generation task, Average Recall (AR) calculated with multiple IoU thresholds is usually used as evaluation metrics. Following conventions, we use IoU thresholds set [0.5 : 0.05 : 0.95] in ActivityNet-1.3 and [0.5 : 0.05 : 1.0] in THUMOS14. To evaluate the relation between recall and proposals number, we evaluate AR with Average Number of proposals (AN) on both datasets, which is denoted as AR@AN. On ActivityNet-1.3, area under the AR vs. AN curve (AUC) is also used as metrics, where AN varies from 0 to 100.

In temporal action detection task, mean Average Precision (mAP) is used as evaluation metric, where Average Precision (AP) is calculated on each action class respectively. On ActivityNet-1.3, mAP with IoU thresholds \(\left\{ 0.5,0.75,0.95\right\} \) and average mAP with IoU thresholds set [0.5 : 0.05 : 0.95] are used. On THUMOS14, mAP with IoU thresholds \(\left\{ 0.3,0.4,0.5,0.6,0.7 \right\} \) is used.

Implementation Details. For visual feature encoding, we use the two-stream network [33] with architecture described in [45], where BN-Inception network [20] is used as temporal network and ResNet network [18] is used as spatial network. Two-stream network is implemented using Caffe [21] and pre-trained on ActivityNet-1.3 training set. During feature extraction, the interval \(\sigma \) of snippets is set to 16 on ActivityNet-1.3 and is set to 5 on THUMOS14.

On ActivityNet-1.3, since the duration of videos are limited, we follow [27] to rescale the feature sequence of each video to new length \(l_w =100\) by linear interpolation, and the duration of corresponding annotations to range [0, 1]. In BSN, temporal evaluation module and proposal evaluation module are both implemented using Tensorflow [1]. On both datasets, temporal evaluation module is trained with batch size 16 and learning rate 0.001 for 10 epochs, then 0.0001 for another 10 epochs, and proposal evaluation module is trained with batch size 256 and same learning rate. For Soft-NMS, we set the threshold \(\theta \) to 0.8 on ActivityNet-1.3 and 0.65 on THUMOS14 by empirical validation, while \(\varepsilon \) in Gaussian function is set to 0.75 on both datasets.

Fig. 4.
figure 4

Comparison of our proposal generation method with other state-of-the-art methods in THUMOS14 dataset. (left) BSN can achieve significant performance gains with relatively few proposals. (center) Recall with 100 proposals vs tIoU figure shows that with few proposals, BSN gets performance improvements in both low and high tIoU. (right) Recall with 1000 proposals vs tIoU figure shows that with large number of proposals, BSN achieves improvements mainly while tIoU \({>}0.8\).

4.2 Temporal Proposal Generation

Taking a video as input, proposal generation method aims to generate temporal proposals where action instances likely to occur. In this section, we compare our method with state-of-the-art methods and make external experiments to verify effectiveness of BSN.

Comparison with State-of-the-Art Methods. As aforementioned, a good proposal generation method should generate and retrieve proposals to cover ground truth action instances with high recall and high temporal overlap using relatively few proposals. We evaluate these methods in two aspects.

First we evaluate the ability of our method to generate and retrieve proposals with high recall, which is measured by average recall with different number of proposals (AR@AN) and area under AR-AN curve (AUC). We list the comparison results of ActivityNet-1.3 and THUMOS14 in Tables 1 and 2 respectively, and plot the average recall against average number of proposals curve of THUMOS14 in Fig. 4 (left). On THUMOS14, our method outperforms other state-of-the-art proposal methods when proposal number varies from 10 to 1000. Especially, when average number of proposals is 50, our method significantly improves average recall from \(21.86\%\) to \(37.46\%\) by \(15.60\%\). On ActivityNet-1.3, our method outperforms other state-of-the-art proposal generation methods on both validation and testing set.

Second, we evaluate the ability of our method to generate and retrieve proposals with high temporal overlap, which is measured by recall of multiple IoU thresholds. We plot the recall against IoU thresholds curve with 100 and 1000 proposals in Fig. 4 (center) and (right) separately. Figure 4 (center) suggests that our method achieves significant higher recall than other methods with 100 proposals when IoU threshold varied from 0.5 to 1.0. And Fig. 4 (right) suggests that with 1000 proposals, our method obtains the largest recall improvements when IoU threshold is higher than 0.8.

Furthermore, we make some controlled experiments to confirm the contribution of BSN itself in Table 2. For video feature encoding, except for two-stream network, C3D network [35] is also adopted in some works [4, 9, 13, 32]. For NMS method, most previous work adopt Greedy-NMS [8] for redundant proposals suppression. Thus, for fair comparison, we train BSN with feature extracted by C3D network [35] pre-trained on UCF-101 dataset, then perform Greedy-NMS and Soft-NMS on C3D-BSN and original 2Stream-BSN respectively. Results in Table 2 show that (1) C3D-BSN still outperforms other C3D-based methods especially with small proposals number, (2) Soft-NMS only brings small performance promotion than Greedy-NMS, while Greedy-NMS also works well with BSN. These results suggest that the architecture of BSN itself is the main reason for performance promotion rather than input feature and NMS method.

Table 1. Comparison between our method with other state-of-the-art proposal generation methods on validation set of ActivityNet-1.3 in terms of AR@AN and AUC.

These results suggest the effectiveness of BSN. And BSN achieves the salient performance since it can generate proposals with (1) flexible temporal duration to cover ground truth action instances with various durations; (2) precise temporal boundary via learning starting and ending probability using temporal convolutional network, which brings high overlap between generated proposals and ground truth action instances; (3) reliable confidence score using BSP feature, which retrieves proposals properly so that high recall and high overlap can be achieved using relatively few proposals. Qualitative examples on THUMOS14 and ActivityNet-1.3 datasets are shown in Fig. 5.

Generalizability of Proposals. Another key property of a proposal generation method is the ability to generate proposals for unseen action classes. To evaluate this property, we choose two semantically different action subsets on ActivityNet-1.3: “Sports, Exercise, and Recreation” and “Socializing, Relaxing, and Leisure” as seen and unseen subsets separately. Seen subset contains 87 action classes with 4455 training and 2198 validation videos, and unseen subset contains 38 action classes with 1903 training and 896 validation videos. To guarantee the experiment effectiveness, instead of two-stream network, here we adopt C3D network [36] trained on Sports-1M dataset [24] for video features encoding. Using C3D feature, we train BSN with seen and seen+unseen videos on training set separately, then evaluate both models on seen and unseen validation videos separately. As shown in Table 3, there is only slight performance drop in unseen classes, which demonstrates that BSN has great generalizability and can learn a generic concept of temporal action proposal even in semantically different unseen actions.

Table 2. Comparison between our method with other state-of-the-art proposal generation methods on THUMOS14 in terms of AR@AN.
Table 3. Generalization evaluation of BSN on ActivityNet-1.3. Seen subset: “Sports, Exercise, and Recreation”; Unseen subset: “Socializing, Relaxing, and Leisure”.

Effectiveness of Modules in BSN. To evaluate the effectiveness of temporal evaluation module (TEM) and proposal evaluation module (PEM) in BSN, we demonstrate experiment results of BSN with and without PEM in Table 4, where TEM is used in both results. These results show that: (1) using only TEM without PEM, BSN can also reach considerable recall performance over state-of-the-art methods; (2) PEM can bring considerable further performance promotion in BSN. These observations suggest that TEM and PEM are both effective and indispensable in BSN.

Boundary-Sensitive Proposal Feature. BSP feature is used in proposal evaluation module to evaluate the confidence scores of proposals. In Table 4, we also make ablation studies of the contribution of each component in BSP. These results suggest that although BSP feature constructed from boundary regions contributes less improvements than center region, best recall performance is achieved while PEM is trained with BSP constructed from both boundary and center region.

4.3 Action Detection with Our Proposals

To further evaluate the quality of proposals generated by BSN, we put BSN proposals into “detection by classifying proposals” temporal action detection framework with state-of-the-art action classifier, where temporal boundaries of detection results are provided by our proposals. On ActivityNet-1.3, we use top-1 video-level class generated by classification model [44] for all proposals in a video and keep BSN confidence scores of proposals for retrieving. On THUMOS14, we use top-2 video-level classes generated by UntrimmedNet [43] for proposals generated by BSN and other methods, where multiplication of confidence score and class score is used for retrieving detections. Following previous works, on THUMOS14, we also implement SCNN-classifier on BSN proposals for proposal-level classification and adopt Greedy NMS as [32]. We use 100 and 200 proposals per video on ActivityNet-1.3 and THUMOS14 datasets separately.

Table 4. Study of effectiveness of modules in BSN and contribution of components in BSP feature on THUMOS14, where PEM is trained with BSP feature constructed by Boundary region (\(f_s^A,f_e^A\)) and Center region (\(f_c^A\)) independently and jointly.
Table 5. Action detection results on validation and testing set of ActivityNet-1.3 in terms of mAP@tIoU and average mAP, where our proposals are combined with video-level classification results generated by [44].
Fig. 5.
figure 5

Qualitative examples of proposals generated by BSN on THUMOS14 (top and middle) and ActivityNet-1.3 (bottom), where proposals are retrieved using post-processed confidence score.

Table 6. Action detection results on testing set of THUMOS14 in terms of mAP@tIoU, where classification results generated by UntrimmedNet [43] and SCNN-classifier [32] are combined with proposals generated by BSN and other methods.

The comparison results of ActivityNet-1.3 shown in Table 5 suggest that detection framework based on our proposals outperforms other state-of-the-art methods. The comparison results of THUMOS14 shown in Table 6 suggest that (1) using same action classifier, our method achieves significantly better performance than other proposal generation methods; (2) comparing with proposal-level classifier [32], video-level classifier [43] achieves better performance on BSN proposals and worse performance on [4, 13] proposals, which indicates that confidence scores generated by BSN are more reliable than scores generated by proposal-level classifier, and are reliable enough for retrieving detection results in action detection task; (3) detection framework based on our proposals significantly outperforms state-of-the-art action detection methods, especially when the overlap threshold is high. These results confirm that proposals generated by BSN have high quality and work generally well in detection frameworks.

5 Conclusion

In this paper, we have introduced the Boundary-Sensitive Network (BSN) for temporal action proposal generation. Our method can generate proposals with flexible durations and precise boundaries via directly combing locations with high boundary probabilities, and make accurate retrieving via evaluating proposal confidence score with proposal-level features. Thus BSN can achieve high recall and high temporal overlap with relatively few proposals. In experiments, we demonstrate that BSN significantly outperforms other state-of-the-art proposal generation methods on both THUMOS14 and ActivityNet-1.3 datasets. And BSN can significantly improve the detection performance when used as the proposal stage of a full detection framework. Codes are available in https://github.com/wzmsltw/BSN-boundary-sensitive-network.