NetVLAD: CNN architecture for weakly supervised place recognition

article/2025/8/27 9:48:39

背景知识:

Vector of Locally Aggregated Descriptors(VLAD)image retrieval.
【CC】是广泛使用的图像提取方式,本文是在在这个提取器上做改进;具体是啥下面有介绍

weakly supervised ranking loss
【CC】本文的另外一个创新点是弱监督的LOSS设计,后面有介绍

Place recongnition as an instance retrieval task:the query image location is estimated using the locations of the most visually similar images querying a large geotagged database;image is represented using local invariant features such as SIFT. the locations of top ranked images are used as suggestions for the location of the query
【CC】位置识别可以看成一个实例提取任务:给定一个查询图片,其位置通过已存储图片库中最相近的图片进行估计,图片库是经过几何标注的。肯定不能存储原图,一般是经过局部不变性过滤器进行了特征提取的,比如SIFT.

Representation compressed and effificiently indexed. image database augmented by 3D structure enables recovery of accurate camera pose.
【CC】经过提取后的特征肯定要压缩,并且能够高效索引(即,最好能够找一种排序算法). 图片库是经过3D增强的,能够从库中的特征还原camera的3D位姿

what is the appropriate representation of a place that is rich enough to distinguish similarly
【CC】位置识别问题本质是如何设计一个算子或者NN 表达一个PLACE能够提供足够的信息用来度量相似性

解题思路:

First, what is a good CNN architecture for place recognition?
【CC】设计一个NN的网络做特征提取用来做位置识别
inspired by the Vector of Locally Aggregated Descriptors (VLAD) representation, develop a convolutional neural network architecture aggregates mid-level (conv5) convolutional features extracted from the entire image into a compact single vector representation,resulting aggregated representation compressed PCA
【CC】整体的思路是对VLAD的改进. 引入卷积层(可以看到后面用来两种经典卷积网络VGG16/AlexNet做改造),加入一个COVN5的block,然后使用PCA进行压缩

Second, how to gather suffificient amount of annotated data?
【CC】如何搞到足够标签数据
we know the two panoramas are captured at approximately similar positions based on their (noisy) GPS but we don’t know which parts of the panoramas depict the same parts of the scene
【CC】全局场景数据会有噪声,只知道两幅图片位置相近,但不知道那部分特征是公共的

Third, how can we train the developed architecture tailored for the place recognition task
【CC】看看后面的rank loss函数设计就知道,只需要简单进行图片分类

a function f as the “image representation extractor”, given an image Ii it produces a fixed size vector f(Ii). the representations for the entire database {Ii}. Visual search finding the nearest database image to the query, exactly or through approximate nearest
neighbour search, by sorting images based on the Euclidean distance d(q, Ii) between f(q) and f(Ii).
【CC】形式化表述, f是一个提取器:输入图片Ii输出固定长度向量. 给定图片从库中找到“距离”最近的图片,距离定义为待查询的图片特征向量f(q)跟候选图片特征向量f(Ii)的欧式距离

The representation is parametrized with a set of parameters θ referring to it as fθ(I). Euclidean distance dθ(Ii, Ij ) = || fθ(Ii) − fθ(Ij)||
【CC】更进一步,参数为 θ的NN网络记作fθ(I).对应的欧式距离如上式子

设计要求

Most image retrieval pipelines are based on (i)extracting local descriptors, then (ii) pooled in an orderless manner.
【CC】传统特征提取的流水线:先提取本地特征,然后无序的池化(类似ORB+词袋模型)

Robustness to lighting and viewpoint changes is provided by the descriptors
【CC】对光照/视角变化的鲁棒性要靠提取器本身性能

(i), we crop the CNN at the last convolutional layer and view it as a dense descriptor extractor.
(ii) we design a new pooling layer that pools extracted descriptors into a fixed image representation and its parameters are learnable
【CC】先裁剪NN网络的最后一层作为提取器,使用新设计的池化层输出固定长度的向量

传统VLAD

Formally, given N D-dimensional local image descriptors {xi} as input, and K cluster centres (“visual words”) {ck} as VLAD parameters, the output VLAD image representation V is K×D-dimensional.
【CC】输入N个D维特征,有K个中心点(参数),VLAD输出K*D的阵

The (j, k) element of V is computed as follows:
在这里插入图片描述
where xi(j) and ck(j) are the j-th dimensions of the i-th descriptor and k-th cluster centre, respectively. ak(xi) denotes the membership of the descriptor xi to k-th visual word, i.e. it is 1 if cluster ck is the closest cluster to descriptor xi and 0 otherwise
【CC】输出V计算式如上,xi(j)第i个特征的第j维, ck(j)第k个中心点的第j维;ak(xi)一个指示函数:第i个特征xi靠近第k个中心点就表征1, 否则0;V(j,k)就像一个xi关于ck的“协方差阵”(实际上是关于中心点距离的阵)

Intuitively, each D-dimensional column k of V records the sum of residuals (xi −ck) of descriptors which are assigned to cluster ck
【CC】V阵的K列就是xi关于ck的残差和

The matrix V is then L2-normalized column-wise, converted into a vector, and finally L2-normalized in its entirety .
【CC】先对V阵所有列进行L2正则化,得到一个向量,然后对整个向量L2正则化;
【总结】上面所有就是vlad的计算,得到了一个固定长度的向量,下面对VLAD进行改造

形式化改进

to mimic VLAD in a CNN framework ,the layer’s operation is differentiable. The source of discontinuities in VLAD is the hard assignment ak(xi), we replace it with soft assignment
在这里插入图片描述
which assigns the weight of descriptor xi to cluster ck proportional to their proximity. α is a parameter (positive constant) that controls the decay of the response with the magnitude of the distance,α → +∞ setup replicates the original VLAD
【CC】为了通过CNN模仿VLAN,要保证所有层可微;原始VLAD中不可微的就是指示函数ak(xi),我们使用soft指派函数(可微)替换原有的ak(xi),新的a¯k(xi)可以表征xi关于中心店ck的距离权重. α是一个衰减系数,超参,设置成 +∞就退化成了原始的ak(xi)

By expanding the squares in, it is easy to see that the term e−αk xik 2 cancels between the numerator and the denominator resulting in a soft-assignment of the following form:
在这里插入图片描述
where vector wk = 2αck and scalar bk = −α||ck||2
【CC】展开,分子/分母消掉二次项 , 得到类似线性的式子,将新a¯k(xi)带入到V的计算式中得到如下V的式子
在这里插入图片描述
where {wk}, {bk} and {ck} are sets of trainable parameters for each cluster k.
Similarly to the original VLAD descriptor, the NetVLAD layer aggregates the first order statistics of residuals (xi − ck) in different parts of the descriptor space weighted by the soft-assignment a¯k(xi) of descriptor xi to cluster k.
【CC】{wk}, {bk} {ck} 都是NN网络关于K可以训练的参数. 新的V值计算跟老的相比,有相同的于xi - ck残差项,不同在于ak换成了可训练的soft-

three independent sets of parameters {wk}, {bk} and {ck}, compared to just {ck} of the original VLAD. This enables greater flexibility than the original VLAD,show:
在这里插入图片描述
Benefits of supervised VLAD. Red and green circles are local descriptors from two different images, assigned to the same cluster(Voronoi cell). Under the VLAD encoding, their contribution to the similarity score between the two images is the scalar product (as final VLAD vectors are L2-normalized) between the corresponding residuals, where a residual vector is computed as the difference between the descriptor and the cluster’s anchor point. The anchor point ck can be interpreted as the origin of a new coordinate system local to the the specific cluster k. In standard VLAD, the anchor is chosen as the cluster centre (×) in order to evenly distribute the residuals across the database. However, in a supervised setting where the two descriptors are known to belong to images which should not match, it is possible to learn a better anchor (? ) which causes the scalar product between the new residuals to be small
【CC】新的计算式有三个独立参数项{wk}, {bk} {ck}, 比老的更灵活; 上图显示,绿/红点(两张图片的特征值)关于CK 的矢量内积就代表其相似性. 如果采用监督的方式能够学到一个更好的ck点,明知道红点-绿点不是同一个地点,如图Ck点可以从X点到星号点迁移

实际面临的挑战

(i) how to gather enough annotated training data
– Weak supervision from the Time Machine – Google Street View Time Machine
(ii) what is the appropriate loss for the place recognition task.
【CC】第二个才是本文的核心; 至于google的数据这个在国内用不了,百度街景好像没有这个功能,还不知道国内有啥开源的数据集可以用
在这里插入图片描述
Google Street View Time Machine examples. Each column shows perspective images generated from panoramas from
nearby locations, taken at different times. A well designed method can use this source of imagery to learn to be invariant to changes in viewpoint and lighting (a-c), and to moderate occlusions (b).It can also learn to suppress confusing visual information such as clouds (a), vehicles and people (b-c), and to chose to either ignore vegetation or to learn a season-invariant vegetation representation(a-c).
【CC】最核心的还是有“真值”-- 知道哪些图片在物理位置上相近的,进而通过NN学习到“局部的不变性”

Therefore, for a given training query q, the GPS information can only be used as a source of (i) potential positives {pqi }, i.e. images that are geographically close to the query, and (ii)definite negatives {nqj}, i.e. images that are geographically far from the query
【CC】这里是关于GPS信息的使用原则:GPS信息只能在确认了是相邻/远离的场景下才能作为属性进行查询,而不能以GPS作为判定的依据

损失函数设计

For a given test query image q, the goal is to rank a database image Ii∗ from a close-by location higher than all other far away images Ii in the database. we wish the Euclidean distance dθ(q, I) between the query q and a close-by image Ii∗ to be smaller than the distance to far away images in the database Ii, i.e. dθ(q, Ii∗) < dθ(q, Ii), for all images Ii further than a certain distance from the query on the map
【CC】给一个待查找的图片q,对图片库进行查找希望找到一个Ii* 使得q到Ii*的欧式距离是所有图片中最小的

We obtain a training dataset of tuples (q, {pqi }, {nqj}), where for each training query image q we have a set of potential positives {pqi } and the set of definite negatives {nqj}.
【CC】数据集中的一次q查询,对应了一组正样本{pqi }和一组负样本{nqj}

the set of potential positives contains at least one positive image that should match the query, but we do not know which one. To address this ambiguity, we propose to identify the best matching potential positive image pqi∗
在这里插入图片描述
for each training tuple (q, {pqi }, {nqj}). The goal then becomes to learn an image representation fθ so that distance dθ(q, pqi∗) between the training query q and the best matching potential positive pqi∗ is smaller than the distance dθ(q, nqj ) between the query q and all negative images qj :

【CC】如果我们对每次查询q有真值pqi∗,那么我们需要训练的是图像的表达fθ(即本文的描述子网络VLADNET),使得上面的两个式子成立:即真值是所有正样本中距离最小的, 真值的距离比所有的负样本都小

Based on this intuition we define a weakly supervised ranking loss Lθ for a training tuple (q, {pqi }, {nqj}) as
在这里插入图片描述
where l is the hinge loss l(x) = max(x, 0), and m is a constant parameter giving the margin
【CC】m是一个超参,代表正负样本距离差能够容忍的最小值,超过m才计算loss,小于这个值loss为0; 有点类似 triplet loss

网络结构

在这里插入图片描述
【CC】输入图片后面加了一个卷积层(有各种cnn的比对实现),将WHD的图片抽取成N*D的特征向量x,将x分别塞到(w,b)θ的小卷积网络里面和VLAD里面,本质就是对改进式子(equation4)的实现;可训练的就是两部分fθ(即图片的特征表达层)和(w,b)θ,使得上面的Loss func - Lθ最小


http://chatgpt.dhexx.cn/article/SQSNutle.shtml

相关文章

Self-Supervised Difference Detection for Weakly-Supervised Semantic Segmentation

Self-Supervised Difference Detection for Weakly-Supervised Semantic Segmentation 摘要1. Introduction2. Related Works3. Method3.1. Difference detection network3.2. Self-supervised difference detection module 论文地址 这篇论文原文的定义实在是太混乱了&#xf…

Unified Deep Supervised Domain Adaptation and Generalization

论文概述 问题研究背景&#xff1a;supervised domain adaptation(SDA)&#xff0c;源域有大量带标签的数据&#xff0c;目标域仅有少量可使用的数据 问题的难点&#xff1a;目标域数据不足导致概率分布在语义上很难对齐和区分。对齐指的是源域图片类别之间的关系与目标域图片…

Self-supervised Video Transformer 阅读

目录 1.介绍2.SVT2.1 SVT结构2.2 自监督训练Motion CorrespondencesCross-View Correspondences 2.3 SVT loss 1.介绍 本文是针对video transformer进行自监督训练&#xff0c;从一个给定的视频中&#xff0c;创建具有不同空间大小和帧率的局部和全局时空视图&#xff0c;自监…

最简单的self-supervised方法

从Kaiming的MoCo和Hinton组Chen Ting的SimCLR开始&#xff0c;自监督学习&#xff08;SSL&#xff09;成了计算机视觉的热潮显学。凡是大佬大组&#xff08;Kaiming, VGG&#xff0c;MMLAB等&#xff09;&#xff0c;近两年都是搞了几个自监督方法的。从一开始的新奇兴奋地看着…

弱监督学习 weakly supervised learning 笔记

周志华 A Brief Introduction to Weakly Supervised Learning 2018 引言 在机器学习领域&#xff0c;学习任务可以划分为监督学习、非监督学习。通常&#xff0c;两者都需要从包含大量训练样本的训练数据集中学习预测模型。 监督学习的训练数据包括&#xff0c;数据对象向量…

Supervised Contrastive Learning浅读

目录 前言 1.方法介绍以及结构 2.思路的实现 2.1自监督对比学习 2.2有监督对比学习 3.结果 前言 本文是根据观看了知名油管up主&#xff0c;对Supervised Contrastive Learning这篇文论文的解读写了一点自己的理解&#xff0c;初次接触&#xff0c;理解甚浅。 在文章中…

supervised——>self-supervised

在CV中&#xff0c;以数据与神经网络为基础&#xff0c;我们通常以supervised的方式与unsupervised的方式来进行网络的训练&#xff0c;这些行为的目的都是为了想要使学到的网络能够具有较好的特征表示能力&#xff0c;以进行如分类、目标检测、语义分割等。这两种方式的主要异…

自监督模型 Self-supervised learning(李宏毅2022

这个红色的怪物叫做ELMo 、最早的self-supervised learning model 作业四的模型也是个transformer&#xff0c;只有0.1个million 最早的是ELMo Cookie Monster等你来凑&#x1f63c; T5是Google做的&#xff0c;跟车子也没什么关系&#xff0c; 在没有label情况下&#xff…

《论文笔记》—— Self-supervised Image-specific Prototype Exploration for Weakly Supervised Semantic Segment

摘要&#xff1a;基于图像级标签的弱监督语义分割(WSSS)由于标注成本低而备受关注。现有的方法通常依赖于类激活映射(CAM)来度量图像像素和分类器权重之间的相关性。然而&#xff0c;分类器只关注识别区域&#xff0c;而忽略每张图像中的其他有用信息&#xff0c;导致定位图不完…

Semi-supervised Learning(半监督学习)

目录 Introduction Why semi-supervised learning help&#xff1f; Semi-supervised Learning for Generative Model Supervised Generative Model Semi-supervised Generative Model Low-density Separation Assumption Self Training Entropy-based Regularization(基…

supervised contrastive learning 解读

SupCon 定义&#xff1a; Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. novelties&#xff1a; 属于同一类的归一化后的特征表示靠得越近越好…

第十章 Supervised PCA

supervised pca很简单粗暴&#xff0c;计算 X X X的每一个纬度和 Y Y Y的相关性&#xff0c;取一个阈值&#xff0c;丢掉一些纬度&#xff0c;然后用普通的pca降维。 如何计算两个随机变量的相关性/相似性&#xff1f; 两个随机变量 X , Y X,Y X,Y&#xff0c;有一个函数 ϕ \p…

学习笔记|BERT——自监督学习的典范

1. 自监督学习的概念 在机器学习中&#xff0c;最常见的是监督学习&#xff08;Supervised learning&#xff09;。假设模型的输入是 x x x&#xff0c;输出是 y y y&#xff0c;我们如何使模型输出我们期望的 y y y呢&#xff1f;我们得拥有已标注的&#xff08;label&#x…

supervised使用教程

安装 平台要求 引自官网&#xff08;supervised.org/introductio…&#xff09;&#xff1a;Supervisor已经过测试&#xff0c;可以在Linux&#xff08;Ubuntu 9.10&#xff09;&#xff0c;Mac OS X&#xff08;10.4 / 10.5 / 10.6&#xff09;和Solaris&#xff08;对于Int…

如何使用镜像网站?

1. 使用清华大学镜像网站下载镜像 官网&#xff1a;清华大学镜像站 例如centOS&#xff1a; 1&#xff09;查找centOS 2&#xff09;找到对应的版本号 3&#xff09;找到镜像地址 4&#xff09;找到自己要下载的版本 DVD&#xff1a;标准版 mini&#xff1a;迷你版 everyt…

如何快速镜像一个网站

仅需下述几个步骤即可快速镜像一个网站&#xff0c;镜像的内容包括html&#xff0c;js&#xff0c;css&#xff0c;image等静态页面资源&#xff0c;暂时无法镜像有用户交互的动态页面。 1、安装wget工具&#xff0c;以ubuntu系统为例 sudo apt-get install wget 2、下载网站…

【数学与算法】泰勒公式_线性化_雅各比矩阵_黑塞矩阵

本文的所涉及的知识点&#xff0c;如果有相关知识盲区&#xff0c;请参考&#xff1a; 微分方程通杀篇 如何区分线性系统与非线性系统 本文是观看B站视频【工程数学基础】2_线性化_泰勒级数_泰勒公式所作的笔记。 其中&#xff0c; k k k 是第k个点&#xff0c; n n n是指每个点…

机器学习中的数学基础 Day1

O(n) o(n) order&#xff1a;阶&#xff0c;多次式阶&#xff0c;x^2x1 阶2 f(x)O(g(x))&#xff1a;存在x0、M&#xff0c;使得x>x0时&#xff0c;f(x)<Mg(x) 2x^2 O(x^2),M2,x0任意 x^2x1 O(x^2),M2,x010 f(x)o(g(x)):对于任意的ε&#xff0c;存在x0&#xff0…

Hessian矩阵正定与函数凹凸性的关系

1. 从矩阵变换的角度 首先半正定矩阵定义为: 其中X 是向量&#xff0c;M 是变换矩阵 我们换一个思路看这个问题&#xff0c;矩阵变换中&#xff0c;代表对向量 X进行变换&#xff0c;我们假设变换后的向量为Y&#xff0c;记做 于是半正定矩阵可以写成&#xff1a; 这个是不是很…

Jacobian and Hessian(雅克比矩阵和海塞矩阵)

雅克比矩阵&#xff08;Jacobian &#xff09; 雅可比矩阵 是一阶偏导数以一定方式排列成的矩阵, 其行列式称为雅可比行列式。 假设 F : R n → R m F: R_n \to R_m F:Rn​→Rm​ 是一个从欧式 n 维空间转换到欧式 m 维空间的函数. 这个函数由 m 个实函数组成:&#xff0c;记…