引入

There are two main challenges for visual tracking：
首先，待跟踪目标具有类不可知性和任意性，关于目标的先验信息很少。
其次，仅仅向跟踪器提供一个初始图像，tracker需要适应目标的外观变化和处理外部干扰，如遮挡。

上述挑战可以部分地通过鉴别discriminative correlation filter（DCF）方法来处理。通常，这些方法适应于给定的目标域并且以在线方式执行跟踪。然而，在线跟踪可能会引入不相关的背景，增加drifting risk。

孪生网络通过学习目标与搜索区域之间的相似度映射，将视觉跟踪作为模板匹配任务来处理。 In fact, 孪生追踪器mostly adopt the template pyramid strategy to estimate the target box. 然而，这种模板金字塔策略基本上局限于对目标的外观进行建模(当目标规模变化剧烈时，效果不佳)。
SiamRPN将跟踪任务分解为分类和定位问题，避免了图像金字塔尺度估计的耗时操作。更重要的是，通过使用RPN，可以估计不同纵横比下的目标尺寸。
However，在SiamRPN中，所有的训练样本来自预定义的anchors(proposals)，并且正样本和负样本的随机子集被馈送到Siamese网络以优化平均损失。在训练阶段有两个限制：

The number of negative samples is usually much larger than that of positive ones. Many non-semantic negative samples can be easily classified and contribute to the reduction of the average loss, which may prevent the classifier from discriminating hard negative samples, e.g., semantic distractors
All positive samples, whose overlapping ratios with a set of anchors are above a pre-defined threshold, are treated equally and expected to match the positive label as closely as possible.

为了克服上述局限性，我们提出了一种简单而有效的策略–样本关注度排序（SRA）来对训练样本的重要性进行排序。更具体地说，具有与正样本相似的语义特征的硬负样本被分配较大的权重以训练分类分支。这种简单的重新加权方案有助于增强跟踪器的鉴别能力。同时，根据IoU计算阳性样本的权重。实际上，当正样本充分覆盖目标并且包含较少的背景像素时，it plays a more important role in representing the target model.

在SiamRPN 及其后续跟踪器中，从多个预定义锚中选择具有最高分类得分的proposal作为跟踪目标。然而，如图1所示，最终选择的proposal可能不具有最大的IoU。因此，仅考虑proposal的分类分数而不考虑任何位置信息可能产生次优解决方案。
在这里插入图片描述

related works

Proposal Selection and Sampling Strategies

In the Siamese tracking field, the most widely adopted sampling strategy is random sampling, i.e., random selection from the whole training samples. In general, there are more negative samples than positive ones. So a fixed ratio is set to balance two kinds of samples.
Another popular idea is to focus on hard samples that have large loss during offline
training like DaSiamRPN [17] or online choose hard samples in a special target domain such as ATOM [13]

there are many sampling attention strategies in object detection like DR-loss [36]and AP-loss [37],. However, they are not suitable for class-agnostic tracking tasks.

we propose a novel Sample Ranking Attention(SRA) strategy to re-weigh training samples.
this paper designs a ranking network, which can be integrated into Siamese tracking network in an end-to-end manner, instead of the two-stage process of object detection, so that bounding box and classification information are integrated together to produce more reliable proposal selection.

PROPOSED METHOD

基于SiamRPN

在这里插入图片描述

Sample Ranking Attention(SRA)

a positive sample with a higher IoU with the ground-truth should match its positive label better(closer) than other positive samples with lower IoUs. Hence, we propose IoU-SRA as an importance measurement to re-weigh positive samples.

For a positive sample $i ∈ A_{pos}$ , we calculate its IoU with the ground-truth and denotes the value as $v^{iou}_i$ , and obtain the IoU-SRA weight $w_i$ to characterize the importance of the i -th positive sample as $w_i = 1 +v^{iou}_i− τ_{pos}$ ,
the threshold $τ_{pos}$ is used to separate positive and negative samples,是一个常数。

对于负样本，由（1）计算出来的 $c_i(pos)$ 尽可能为0；
we propose a Score-SRA strategy to measure the importance of negative samples, under which the samples with higher $c_i(pos)$ gain more attention during training（因为硬负样本能被正确分类对提升网络的效果更重要）
For each negative sample $i ∈ A_{neg}$ , we calculate its Score-SRA weight $w_i$ as
$w_i = 1 + c_i(pos)$
在这里插入图片描述

Ranking Network(RANet)

The above SRA can ensure that the important training samples are correctly classified.
RANet的目标是给具有较大IoU值的提案的较高排名得分，反之亦然。
The ranking score map ${r_i}$ is calculated as
${r_i}= corr([φ(z)]_{rank}, [φ(x)]_{rank})$ 只考虑正样本：
在这里插入图片描述
the total training loss is $L_{total} = L_{rpn} + L_{rank}$
（其来源于铰链损失函数hinge loss）

Online Tracking Decision

在这里插入图片描述

在推断阶段，SiamRPN仅利用anchor的分类得分来确定当前目标位置。不同的是，我们考虑了anchor与ground truth之间的overlapping，并更加关注具有高IoU的正anchor。为了获得用于在线跟踪的更可靠的anchor，我们建议通过它们的加权和来融合归一化分类和排名得分图，如下所示:
${s_i}= β{r_i}+ (1 − β){c_i(pos)}$ where i is the anchor index, ${r_i}$ and ${c_i(pos)}$ denote the ranking and foreground classification score maps of all anchors, respectively.
${s_i}$ is the final score map used for anchor selection.

融合得分图的优点如图4所示，它表明SiamLTR通过使用等级得分更能够捕获具有较高IoU的bbox，SiamLTR能够更好地适应大规模变化，实现更精确的定位。

基于SiamRPN++:

在这里插入图片描述
As shown in Figure 5, three pairs of aggregated feature maps are fed into RANet to produce the ranking score maps $R_i,i=1,2,3$ . A weighted-fusion layer utilizes all the output maps
as

The weight $η_i$ is optimized offline together with the network and fixed in the inference phase. The weighted ranking score map Rall is treated as ${r_i}$ in Eq6.
其他相同，推理阶段也相同。