ftrl 流式更新 java_深入理解FTRL

FTRL算法是吸取了FOBOS算法和RDA算法的两者优点形成的Online Learning算法。读懂这篇文章，你需要理解LR、SGD、L1正则。

FOBOS算法

前向后向切分(FOBOS，Forward Backward Splitting)是 John Duchi 和 Yoran Singer 提出的。在该算法中，权重的更新分成两个步骤，其中

$math?formula=t$ 是迭代次数，

$math?formula=%5Ceta%5Et$ 是当前迭代的学习率，

$math?formula=G%5Et$ 是loss func的梯度，

$math?formula=%5CPsi(W)$ 是正则项，如下：

$math?formula=W%5E%7Bt%2B0.5%7D%3DW%5Et-%5Ceta%5EtG%5Et$

$math?formula=W%5E%7Bt%2B1%7D%3Dargmin_W%5C%7B%20%5Cfrac%7B1%7D%7B2%7D%20%5CVert%20W-W%5E%7Bt%2B0.5%7D%20%5CVert%5E2_2%20%2B%20%5Ceta%5E%7Bt%2B0.5%7D%5CPsi(W)%20%5C%7D$

权重更新的另外一种形式：

对上式argmin部分求导，令导数等于0可得：

$math?formula=W%5E%7Bt%2B1%7D%3DW%5Et-%5Ceta%5EtG%5Et-%5Ceta%5E%7Bt%2B0.5%7D%5Cpartial%5CPsi(W%5E%7Bt%2B1%7D)$

这就是权重更新的另外一种形式，可以看到

$math?formula=W%5E%7Bt%2B1%7D$ 的更新不仅与

$math?formula=W%5E%7Bt%7D$ 有关，还与自己本身有关，有人猜测这就是前向后向的来源。

L1-FOBOS，正则项为L1范数，其中

$math?formula=%5Clambda%3E0$ ：

$math?formula=W%5E%7Bt%2B0.5%7D%3DW%5Et-%5Ceta%5EtG%5Et$

$math?formula=W%5E%7Bt%2B1%7D%3Dargmin_W%5C%7B%20%5Cfrac%7B1%7D%7B2%7D%20%5CVert%20W-W%5E%7Bt%2B0.5%7D%20%5CVert%5E2_2%20%2B%20%5Ceta%5E%7Bt%2B0.5%7D%5Clambda%5CVert%20W%20%5CVert_1%20%5C%7D$

合并为一步：

令

$math?formula=%5Ceta%5E%7Bt%2B0.5%7D%3D%5Ceta%5Et$ ，将二次项乘开，消去常数项得

$math?formula=W%5E%7Bt%2B1%7D%3Dargmin_W%5C%7B%20G%5Et%20W%20%2B%20%5Cfrac%7B1%7D%7B2%5Ceta%5Et%7D%5CVert%20W-W%5Et%5CVert%5E2_2%20%2B%20%5Clambda%20%5CVert%20W%20%5CVert_1%5C%7D$

闭式解：

$math?formula=w%5E%7Bt%2B1%7D_i%3D%20%5Cbegin%7Bcases%7D%200%2C%20%26%20if%5C%20%5Cvert%20w%5Et_i-%5Ceta%5Et%20g%5Et_i%5Cvert%5Cleq%5Ceta%5E%7Bt%2B0.5%7D%5Clambda%5C%5C%20(w%5Et_i-%5Ceta%5Et%20g%5Et_i)-%5Ceta%5E%7Bt%2B0.5%7D%5Clambda%5Ccdot%20sgn(w%5Et_i-%5Ceta%5Et%20g%5Et_i)%2C%20%26%20otherwise%20%5Cend%7Bcases%7D$

推导过程略，思路同下方FTRL闭式解的推导过程。

为什么一般设

$math?formula=%5Ceta%5E%7Bt%2B0.5%7D%3D%5Ceta%5Et$ ？

我们希望这一步更新中，上半步和下半部的步长(学习率)一样。

RDA算法

RDA(Regularized Dual Averaging Algorithm)叫做正则对偶平均算法，特征权重的更新策略如下，只有一步，其中

累积梯度

$math?formula=G%5E%7B(1%3At)%7D%3D%5Csum_%7Bs%3D1%7D%5Et%20G%5Es$ ，

累积梯度平均值

$math?formula=g%5E%7B(1%3At)%7D%3D%5Cfrac1t%5Csum_%7Bs%3D1%7D%5Et%20G%5Es%3D%5Cfrac%7BG%5E%7B(1%3At)%7D%7D%7Bt%7D$ ，

$math?formula=%5CPsi(W)$ 是正则项，

$math?formula=h(W)$ 是一个严格的凸函数，

$math?formula=%5Cbeta%5E%7B(t)%7D$ 是一个关于t的非负递增序列：

$math?formula=W%5E%7Bt%2B1%7D%3Dargmin_W%5C%7B%20g%5E%7B(1%3At)%7DW%20%2B%20%5CPsi(W)%20%2B%20%5Cfrac%7B%5Cbeta%5E%7B(t)%7D%20%7D%7Bt%7Dh(W)%20%5C%7D$

L1-RDA：

令

$math?formula=%5CPsi(W)%3D%5Clambda%5CVert%20W%20%5CVert_1$ ，令

$math?formula=h(W)%3D%5Cfrac%7B1%7D%7B2%7D%5CVert%20W%20%5CVert%5E2_2$ ，令

$math?formula=%5Cbeta%5E%7B(t)%7D%3D%5Cgamma%5Csqrt%7Bt%7D$ ，其中

$math?formula=%5Clambda%3E0$ ，

$math?formula=%5Cgamma%3E0$ ，并且各项同时乘以t，得：

$math?formula=W%5E%7Bt%2B1%7D%3Dargmin_W%5C%7B%20g%5E%7B(1%3At)%7DW%20%2B%20%5Clambda%20%5CVert%20W%20%5CVert_1%20%2B%20%5Cfrac%7B%5Cgamma%7D%7B2%5Csqrt%7Bt%7D%7D%5CVert%20W%20%5CVert%5E2_2%5C%7D$

闭式解：

$math?formula=w%5E%7Bt%2B1%7D_i%3D%20%5Cbegin%7Bcases%7D%200%2C%20%26%20if%5C%20%5Cvert%20g%5E%7B(1%3At)%7D_i%5Cvert%3C%5Clambda%5C%5C%20-%5Cfrac%7B%5Csqrt%20t%7D%7B%5Cgamma%7D(g%5E%7B(1%3At)%7D-%5Clambda%20sgn(g%5E%7B(1%3At)%7D))%2C%20%26%20otherwise%20%5Cend%7Bcases%7D$

推导过程略，思路同下方FTRL闭式解的推导过程。

L1-FOBOS与L1-RDA对比

从截断方式来看，在 RDA 的算法中，只要梯度的累加平均值小于参数

$math?formula=%5Clambda$ 就直接进行截断，说明 RDA 更容易产生稀疏性；同时，RDA 中截断的条件是考虑梯度的累加平均值，可以避免因为某些维度训练不足而导致截断的问题，这一点与 TG，FOBOS 不一样。通过调节参数

$math?formula=%5Clambda$ 可以在精度和稀疏性上进行权衡。

为什么

$math?formula=h(W)$ 是一个严格的凸函数？

因为凸函数+凸函数=凸函数，可以保证整体的凸性，argmin的部分如果不保证凸性，极值就不存在，则无法更新权重。

为什么

$math?formula=%5Cbeta%5E%7B(t)%7D$ 是一个关于t的非负递增序列？

可以认为学习率

$math?formula=%5Ceta%5Et%3D%5Cfrac%7B1%7D%7B%5Cgamma%5Csqrt%7Bt%7D%7D$ ，

$math?formula=%5Cbeta%5E%7B(t)%7D$ 可以看作是学习率的倒数，因为学习率设置为随着迭代次数增加而减小的正数，所以

$math?formula=%5Cbeta%5E%7B(t)%7D$ 是一个关于t的非负递增序列。

FTRL算法

FTRL 算法综合考虑了 FOBOS 和 RDA 对于梯度和正则项的优势和不足，其中累积梯度

$math?formula=G%5E%7B(1%3At)%7D%3D%5Csum_%7Br%3D1%7D%5Et%20G%5Er$ ，

$math?formula=%5Csigma%5Es%3D%5Cfrac%7B1%7D%7B%5Ceta%5Es%7D-%5Cfrac%7B1%7D%7B%5Ceta%5E%7Bs-1%7D%7D$ ，

$math?formula=%5Csigma%5E%7B(1%3At)%7D%3D%5Cfrac%7B1%7D%7B%5Ceta_t%7D%3D%5Csum_%7Bs%3D1%7D%5Et%20%5Csigma%5Es$ ，

$math?formula=%5Clambda_1%3E0$ ，

$math?formula=%5Clambda_2%3E0$ ，特征权重的更新公式是：

$math?formula=W%5E%7Bt%2B1%7D%3Dargmin_w%5C%7B%20G%5E%7B(1%3At)%7DW%20%2B%20%5Clambda_1%20%5CVert%20W%20%5CVert_1%20%2B%20%5Cfrac%7B%5Clambda_2%7D%7B2%7D%5CVert%20W%20%5CVert%5E2_2%20%2B%20%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es%5CVert%20W-W%5Es%5CVert%5E2_2%20%5C%7D$

维度

$math?formula=i$ 的学习率设置为

$math?formula=%5Ceta%5Et_i%3D%5Cfrac%7B%5Calpha%7D%7B%5Cbeta%2B%5Csqrt%7B%5Csum_%7Bs%3D1%7D%5Et%20(g%5E%7B(s)%7D)%5E2%7D%7D$ ，随着迭代次数增加而减小，

$math?formula=%5Cbeta$ 主要作用是保证分母不为0.

使用

$math?formula=%5Csigma$ 替换学习率可将L1-FOBOS、L1-RDA、FTRL写成类似的形式，如下：

$math?formula=W%5E%7Bt%2B1%7D_%7B(L1-FOBOS)%7D%3Dargmin_W%5C%7B%20G%5Et%20W%20%2B%20%5Clambda%20%5CVert%20W%20%5CVert_1%20%2B%20%5Cfrac%7B1%7D%7B2%7D%5Csigma%5E%7B(1%3At)%7D%5CVert%20W-W%5Et%5CVert%5E2_2%20%5C%7D$

$math?formula=W%5E%7Bt%2B1%7D_%7B(L1-RDA)%7D%3Dargmin_W%5C%7B%20G%5E%7B(1%3At)%7DW%20%2B%20t%5Clambda%20%5CVert%20W%20%5CVert_1%20%2B%20%5Cfrac%7B1%7D%7B2%7D%5Csigma%5E%7B(1%3At)%7D%5CVert%20W-0%20%5CVert%5E2_2%5C%7D$

$math?formula=W%5E%7Bt%2B1%7D_%7B(FTRL)%7D%3Dargmin_W%5C%7B%20G%5E%7B(1%3At)%7DW%20%2B%20%5Clambda_1%20%5CVert%20W%20%5CVert_1%20%2B%20%5Cfrac%7B%5Clambda_2%7D%7B2%7D%5CVert%20W%20%5CVert%5E2_2%20%2B%20%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es%5CVert%20W-W%5Es%5CVert%5E2_2%20%5C%7D$

各项解释todo

闭式解及其推导过程：

将二次项乘开，消去常数项，得：

$math?formula=W%5E%7Bt%2B1%7D%3Dargmin_W%5C%7B%20(G%5E%7B(1%3At)%7D-%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5EsW%5Es)W%20%2B%20%5Clambda_1%20%5CVert%20W%20%5CVert_1%20%2B%20%5Cfrac%7B1%7D%7B2%7D(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)%5CVert%20W%20%5CVert%5E2_2%5C%7D$

设

$math?formula=Z%5Et%3DG%5E%7B(1%3At)%7D-%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5EsW%5Es$ ，则

$math?formula=Z%5Et%3DZ%5E%7Bt-1%7D%2BG%5Et-%5Csigma%5Et%20W%5Et$ ，得：

$math?formula=W%5E%7Bt%2B1%7D%3Dargmin_W%5C%7BZ%5EtW%2B%5Clambda_1%5CVert%20W%20%5CVert_1%20%2B%20%5Cfrac%7B1%7D%7B2%7D(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)%5CVert%20W%20%5CVert%5E2_2%5C%7D$

对于单个维度

$math?formula=i$ 来说：

$math?formula=w%5E%7Bt%2B1%7D_i%3Dargmin_w%5C%7Bz%5Et_iw_i%2B%5Clambda_1%5Cvert%20w_i%20%5Cvert%20%2B%20%5Cfrac%7B1%7D%7B2%7D(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)w%5E2_i%5C%7D$

对上式，假设

$math?formula=w%5E*_i$ 是最优解，令上式导数等于0可得：

$math?formula=z%5Et_i%2B%5Clambda_1sgn(w%5E*_i)%2B(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)w%5E*_i%3D0$

我们分三种情况进行讨论

当

$math?formula=%5Cvert%20z%5Et_i%5Cvert%5Cleq%5Clambda_1$ 时：

当

$math?formula=w%5E*_i%3D0$ 时，满足

$math?formula=sgn(0)%20%5Cin%20(-1%2C1)$ ，成立

当

$math?formula=w%5E*_i%3E0$ 时，

$math?formula=z%5Et_i%2B%5Clambda_1sgn(w%5E*_i)%3Dz%5Et_i%2B%5Clambda_1%5Cgeq0$ 且

$math?formula=(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)w%5E*_i%3E0$ 上式不成立

当

$math?formula=w%5E*_i%3C0$ 时，

$math?formula=z%5Et_i%2B%5Clambda_1sgn(w%5E*_i)%3Dz%5Et_i-%5Clambda_1%5Cleq0$ 且

$math?formula=(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)w%5E*_i%3C0$ 上式不成立

当

$math?formula=z%5Et_i%3E%5Clambda_1$ 时：

当

$math?formula=w%5E*_i%3D0$ 时，不满足

$math?formula=sgn(0)%20%5Cin%20(-1%2C1)$ ，不成立

当

$math?formula=w%5E*_i%3E0$ 时，

$math?formula=z%5Et_i%2B%5Clambda_1sgn(w%5E*_i)%3Dz%5Et_i%2B%5Clambda_1%3E0$ 且

$math?formula=(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)w%5E*_i%3E0$ ，上式不成立

当

$math?formula=w%5E*_i%3C0$ 时，

$math?formula=z%5Et_i%2B%5Clambda_1sgn(w%5E*_i)%3Dz%5Et_i-%5Clambda_1%3E0$ 且

$math?formula=(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)w%5E*_i%3C0$ ，

$math?formula=w%5E*_t$ 有解，

$math?formula=w%5E*_t%3D-(%5Cfrac%7B%5Cbeta%2B%5Csqrt%7B%5Csum_%7Bs%3D1%7D%5Et%20(g%5E%7B(s)%7D)%5E2%7D%7D%7B%5Calpha%7D%2B%5Clambda_2)%5E%7B-1%7D(z%5Et_i-%5Clambda_1)$

当

$math?formula=z%5Et_i%3C-%5Clambda_1$ 时：

当

$math?formula=w%5E*_i%3D0$ 时，不满足

$math?formula=sgn(0)%20%5Cin%20(-1%2C1)$ ，不成立

当

$math?formula=w%5E*_i%3E0$ 时，

$math?formula=z%5Et_i%2B%5Clambda_1sgn(w%5E*_i)%3Dz%5Et_i%2B%5Clambda_1%3C0$ 且

$math?formula=(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)w%5E*_i%3E0$ ，

$math?formula=w%5E*_t$ 有解，

$math?formula=w%5E*_t%3D-(%5Cfrac%7B%5Cbeta%2B%5Csqrt%7B%5Csum_%7Bs%3D1%7D%5Et%20(g%5E%7B(s)%7D)%5E2%7D%7D%7B%5Calpha%7D%2B%5Clambda_2)%5E%7B-1%7D(z%5Et_i%2B%5Clambda_1)$

当

$math?formula=w%5E*_i%3C0$ 时，

$math?formula=z%5Et_i%2B%5Clambda_1sgn(w%5E*_i)%3Dz%5Et_i-%5Clambda_1%3C0$ 且

$math?formula=(%5Clambda_2%2B%5Csum_%7Bs%3D1%7D%5Et%5Csigma%5Es)w%5E*_i%3C0$ ，上式不成立

综上，可得分段函数形式的闭式解：

$math?formula=w%5E%7Bt%2B1%7D_i%3D%20%5Cbegin%7Bcases%7D%200%2C%20%26%20if%20%5C%20%5Cvert%20z%5Et_i%5Cvert%3C%5Clambda_1%20%5C%5C%20-(%5Cfrac%7B%5Cbeta%2B%5Csqrt%7B%5Csum_%7Bs%3D1%7D%5Et%20(g%5E%7B(s)%7D)%5E2%7D%7D%7B%5Calpha%7D%2B%5Clambda_2)%5E%7B-1%7D(z%5Et_i-sgn(z%5Et_i)%5Clambda_1)%2C%20%26%20%5Ctext%7Botherwise%7D%20%5Cend%7Bcases%7D$

befb9e02d858

论文内的伪代码

引入L2范数与否是等价的

我们不难发现论文[1]中的权重更新公式中是没有L2正则项的，但是伪代码中却有L2正则项系数

$math?formula=%5Clambda_2$ ，这是因为更新公式中的超参数

$math?formula=%5Cfrac%5Cbeta%5Calpha%2B%5Clambda_2%5Capprox%5Cfrac%5Cbeta%5Calpha$ ，相当于通过调节超参，引入L2范数与否没有区别。论文中的伪代码这样写，相当于减少了一个超参数，如果是调过参的同学就知道减少一个超参数意味着什么。

为什么学习率长这样

类似Adagrad的思想

befb9e02d858

用硬币实验解释todo

去除正则项的FTRL等价于SGD可推导

论文原话是Without regularization, this algorithm is identical to standard online gradient descent.

如何直观理解累积梯度的作用

在实现上，full train和increment train的有什么区别

FTRL工程实现上的trick

近似代替梯度平方和

befb9e02d858

如果不理解，回去仔细研究LR的公式。

去除低频特征

由于长尾，大部分特征是稀疏的，且频次很低，online的场景无法用batch的方式去统计特征频次。论文提了两个方案，以泊松概率p决定特征是否更新和建立Bloom Filter Inclusion。我看大部分实现都是用Bloom Filter。

负采样，权重更新时除以负采样率

使用更少的位来进行浮点数编码

四个超参的经验值

如何用FTRL做广告探索todo

[1] McMahan, H. Brendan, et al. "Ad click prediction: a view from the trenches." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.

[2] 张戎 FOLLOW THE REGULARIZED LEADER (FTRL) 算法总结 https://zhuanlan.zhihu.com/p/32903540