RE2 - Simple and Effective Text Matching with Richer Alignment Features

这篇论文来自阿里，19年的ACL论文。《Simple and Effective Text Matching with Richer Alignment Features》：https://arxiv.org/abs/1908.00300

Intro

很多深层网络只拥有一层alignment layer，导致模型需要很多额外的语义信息或手工特征或复杂alignment机制或后处理。

本文的创新点就在于用multiple alignment processes。

R - Residual vectors：previous aligned features

E - Embedding vectors：original point-wise features

E - Encoded vectors：contextual features

简称RE2

具体代表什么呢？让我们往下看。

Model

在这里插入图片描述

空白格子表示embedding vectors，斜线方格表示augmented residual connections，经过一个encoder生成的context vectors用黑色方格表示。如图所示，把这三个向量concat起来都放进alignment layer里，再用alignment layer的input和output都concat起来放入fusion layer中。一个block包含encoding、alignment和fusion三层，重复N次且每个block都是独立的参数。 fusion layer的output经过池化层，得到最后的固定长度向量。利用左右两侧的固定长度向量做预测，Loss采用交叉熵。

Augmented Residual Connections

为了给alignment layer（attention layer）提供更丰富的特征，RE2用了残差网络来连接连续的n个blocks。

The input of the $n$ -th block $x^{(n)}$ ( $n$ ≥ 2), is the concatenation of the input of the first block $x^{(1)}$ and the summation of the output of previous two blocks (denoted by rectangles with diagonal stripes in Figure 1):

$x^{(n)}_i=[x^{(1)}_i;o^{(n-1)}_i+o_i^{(n-2)}]$

Alignment Layer

alignment的方法仍是采取点积（可指路上一篇【文本匹配】之经典ESIM论文详读）。 $F$ 指identity function或单层前向神经网络，这个作为超参数自行指定。

$e_{ij}=F(a_i)^TF(b_j)$

求得相似度e后，我们用同样的方法得到加权和。 $a'_i$ 就是 $\{b_j\}^{l_b}_{j=1}$ 中关于 $a_i$ 的内容。

$a'_i=\sum^{l_b}_{j=1}\frac{exp(e_{ij})}{\sum^{l_b}_{k=1}exp(e_{ik})} b_j, \forall i\in [1,...,l_a]\\b'_j=\sum^{l_a}_{i=1}\frac{exp(e_{ij})}{\sum^{l_a}_{k=1}exp(e_{kj})} a_i, \forall j\in [1,...,l_b]$

Fusion Layer

对输入sequence $\bar{a}$ 进行以下三个计算，并进行concat。这里的 $G$ 是单层前向神经网络，因为参数不共享所以用不同角标进行区别。

$\bar{a}_i^1=G_1([a_i;a_i'])\\\bar{a}_i^2=G_2([a_i;a_i-a_i'])\\\bar{a}_i^3=G_3([a_i;a_i \odot a_i'])\\\bar{a}_i=G([\bar{a}_i^1;\bar{a}_i^2;\bar{a}_i^3])\\$

相减主要是为了提取difference，相乘是为了提取similarity。

Prediction Layer

输入 $v_1,v_2$ 两个vector，输出的值为：

$\hat{y}=H([v1;v2;v1-v2;v1\odot v2])$

其中H为多层前向神经网络。

简化版，这个也作为超参数调试：

$\hat{y}=H([v1;v2])$

Difference with ESIM

从公式来看，和ESIM还是比较相似，最大的不同点在于使用残差网络进行信息增强。由于每个block都含有alignment层，从单一的alignment process变成了multiple alignment processes。这个方法放弃了复杂的计算alignment方式（complicated multi-way alignment mechanisms, heavy distillations of alignment results, external syntactic features, or dense connections to connect stacked blocks when the model is going deep），所以在保证性能的基础上尽可能地快。

作者给出了github地址：

tf1.x：alibaba-edu/simple-effective-text-matching

pytorch：alibaba-edu/simple-effective-text-matching-pytorch