在这里插入图片描述

ECCV-2020

作者分享：https://www.techbeat.net/talk-info?id=462
Code：https://github.com/lxtGH/DecoupleSegNets

文章目录

1 Background and Motivation
2 Related Work
3 Advantages / Contributions
4 Method
- 4.1 Decoupled segmentation framework
- 4.2 Body generation module
- 4.3 Edge preservation module
- 4.4 Decoupled body and edge supervision
- 4.5 Network architecture
5 Experiments
- 5.1 Datasets
- 5.2 Ablation studies
- 5.3 Visual analysis
- 5.4 Results on other datasets
6 Conclusion（own）

1 Background and Motivation

现有语义分割方法的缺点：

RF grows slowly，不能 model longer-range relationships between pixels，分割时，物体内部会产生歧义和噪声
下采样操作，会带来 blurred predictions

针对缺点 1 的提升方法有（提升 object inner consistency）

dilated convolution
pyramid pooling module
non-local operators
graph convolution network
dynamic graph

针对缺点 2 的提升方法有（提升 object boundaries）

embed low-level features into high-level features
refine the outputs

上述的方法要么增加了 object inner consistency（属于同一物体的特征靠近一些，分割在一起），要么增加了 object boundaries，没有考虑 body 和 boundary 之间的交互，作者从图片低频高频分别代表 body 和 boundary 角度出发，把特征解耦成 body feature 和 boundary(edge) feature 部分, and then jointly optimizing them in a unified framework

在这里插入图片描述

2 Related Work

Semantic segmentation
- structured prediction operators：eg CRF
- Deep Learning：eg PSPNet，DeepLab series
Boundary processing
Multi task learning

3 Advantages / Contributions

Improving Semantic Segmentation via 解耦 body 和 edge 特征（然后用 loss 进行监督）
设计了 Body Generation Module 用来专门提取 body feature 的模块
提出的方法较为轻便，很容易加入到现有的 semantic segmentation 方法中
在 4 个 driving scene semantic segmentation 数据集上取得了 SOTA

4 Method

1）object inner consistency

improve the object’s inner consistency by modeling the global context

2）object boundaries

refine objects detail along their boundaries by multi-scale feature fusion

4.1 Decoupled segmentation framework

$\begin{aligned} \hat{F} &= F_{body} + \varphi(F_{edge}) \\ &= F_{body} + \varphi(F - F_{body}) \\ &= \phi(F) + \varphi(F- \phi(F)) \end{aligned}$

$\phi$ 是 body generation module
$\varphi$ 是 edge perservation module
$F$ 是原始特征图， $F = F_{body} + F_{edge}$
$\hat{F}$ 是加强后的特征图

4.2 Body generation module

目的是 generating more consistent feature representations for pixels inside the same object

learn a flow field $\delta \in \mathbb{R}^{H \times W \times 2}$ generated by the network itself to warp features towards object inner parts

1）Flow field generation

核心的思想如下

Low spatial frequency parts capture the summation of images, and a lower resolution feature map represents the most salient part where we view it as pseudo-center location or the set of seed points.

特征图分辨率很小的时候，其代表的都是每个区域最 salient 的部分

整体结构如下，借鉴的是《Flownet: Learning optical flow with convolutional networks》

在这里插入图片描述
采用的是 encoder-decoder 的结构

Down-sampling 的作用是产生伪中心点，或者说 coarse 的中心点，之后上采样成原始特征图分辨率，然后与原始特征图 concatenation 在一起来 learn flow filed

在这里插入图片描述

2）Feature warping

让同一目标的特征尽量往其中心靠近

在这里插入图片描述

$w$ 是 flow map 对应的值

$F$ 是原始特征

Flow field 的作用方式是对四领域内的点进行加权求和

4.3 Edge preservation module

在这里插入图片描述
思路：原始特征减去 body 特征，之后再和 low-level 特征进行融合来 supply the missing fine details information

在这里插入图片描述

上图紫色的部分为 $F-F_{body}$

$∣ ∣$ 表示 concatenation， $\gamma$ 是 1×1 conv

$F_{fine}$ 表示的是 low-level feature，来自 backbone 的浅层

4.4 Decoupled body and edge supervision

在这里插入图片描述

监督了 $F_{body}$ ， $F_{edge}$ 和 $\hat{F}$ （也即 $F_{final}$ ）

在这里插入图片描述

$b$ 表示 $F_{edge}$ ，a boundary map
$s_{body}$ 表示 $F_{body}$ 预测的结果
$s_{finaly}$ 表示 $F_{final}$ 预测的结果
$\hat{s}$ 表示 GT semantic label
$\hat{b}$ 表示 GT binary masks which is generated by $\hat{s}$
$L_{final}$ 是 cross entropy loss for segmentation task
$L_{body}$ 采用的是 boundaries relaxation loss（借鉴的是《Improving semantic segmentation via propagation and label relaxation》——CVPR 2019），在训练时，仅 sample part of pixels within the objects for training
$L_{edge}$ 如公式 4 所示

在这里插入图片描述

Most of the hardest pixels to classify lie on the boundary between object classes.（边界点是难样本）

It is not easy to classify the center pixel of a receptive field when poentially half or more of the input context could be a new class！

作者解决的方法是，引入 edge prior，配合 OHEM

公式 4 分为 $L_{bce}$ 和 $L_{ce}$ 两部分

$L_{bce}$ 是边界 label 和预测边界之间的 binary cross entropy loss
$L_{ce}$ 是 cross entropy loss，如公式 5 所示

在这里插入图片描述

$N$ 是 total pixels in the image
$\cdot N$
$\hat{s}_i$ 是 pixel $i$ 的 GT 类别
$s_{i,j}$ 是 predicted posterior probability for pixel $i$ and class $j$ ，可以简单理解为 $i$ 预测为 $\hat{s}_i$ 的概率
$\mathbb{I[x] = 1}$ 如果 $x$ 是 True，否则为 0
$\sigma$ 是 sigmoid 函数，来来判断是否为边界
$t_K$ 是 OHEM 中的阈值，选取 K highest losses
$t_b$ 是判断是否为边界的阈值