Decoupled Novel Object Captioner

article/2025/8/6 8:45:35

Decoupled Novel Object Captioner

  • Abstract
  • Introduction
  • Methods
    • Preliminaries
      • Zero-Shot Novel Object Captioning.
      • Sequence Model with the Placeholder
      • Key-Value Object Memory
      • Framework Overview
      • Training
  • Reference

Reference[原文]: Joselynzhao.top & 夏木青 | Decoupled Novel Object Captioner

(image-text)

Abstract

In this paper, we introduce the zero-shot novel object caption-ing task where the machine generates descriptions without extratraining sentences about the novel object. To tackle the challenging problem, we propose a Decoupled Novel Object Captioner (DNOC)framework that can fully decouple the language sequence model from the object descriptions.

**DNOC has two components. **

  1. ASequence Model with the Placeholder (SM-P) generates a sen-tence containing placeholders.

占位符的ASequence模型(SM-P)生成一个包含占位符的传感器。

  1. A key-value object memorybuilt upon the freely available detection model, contains the visualinformation and the corresponding word for each object.

Introduction

The captainingnetworks need a large number of image-sentence paired data totrain a meaningful model.

These captioning models fail in describing the novel objects whichare unseen words in the paired training data.

在这里插入图片描述

However, to feed the novel object description into the generatedcaptions, existing approaches either employ the pre-trained lan-guage sequence model [3, 34] or require extra unpaired training sentences of the novel object [41].

**In both cases, the novel objectshave been used in training and, hence, is not really novel. **

A moreprecise meaning of novel in existing works is unseen in the pairedtraining sentences.

In this paper, we tackle the image captioning for novel objects,where we do not need any training sentences containing the object

We utilize a pre-trained object detection model about the novel object. We call it zero-shot novel object captioning to distinguish itfrom the traditional problem setting [3, 34, 41].

In the zero-shot novel object captioning, there are zero training sentences aboutthe novel object .i.e., there is no information about the semanticmeaning, sense, and context of the object

To address this problem, we propose a Decoupled Novel ObjectCaptioner (DNOC) framework that is able to generate natural lan-guage descriptions without extra training sentences of the novelobject.

in Fig. 1, our method first generates the captioning sentence bygenerating a placeholder “” to represent any novel object.Then it learns to fill in the placeholder with “zebra” based on thevisual object detection result.

the main contributions of this work are listed asfollows:

  • We introduce the zero-shot novel object captioning task
  • we design the sequence modelwith the placeholder (SM-P).
  • A key-value object memory is introduced to incorporate ex-ternal visual knowledge.

Methods

Preliminaries

given an input image I , the goal is to generate an as-sociated natural language sentence s of length nl, denoted as s =
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Zero-Shot Novel Object Captioning.

We denote Wunseen as the vocabulary for the novelobject words which are unseen in training.

A notable challenge for this task is to deal with the out-of-vocabulary (OOV) words.

The learned word embedding function ϕw is unable to encode the unseen words, since these word cannot simply be found in Wpaired.

We denote these extra training sentences as Sunpaired.

Sequence Model with the Placeholder

在这里插入图片描述

To solve this problem, we design a newtoken, denoted as “<PL>”.
“<PL>” is the placeholder that representsany novel words ˜w ∈ Wunseen .
We add the token“” into the paired vocabulary Wpair edto learn the embedding.

our model utilizes theexternal knowledge from the object detection model to replace it

we use the LSTM as the backbone of our SM-P

Instead, the SM-P model outputs the “”token when it needs to generate a word.

The “” token will be replaced by the novel word generatedby the key-value object memory.

Key-Value Object Memory

we exploit a pre-trained object detection model tobuild the key-value object memory.
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

Framework Overview

在这里插入图片描述

** For an input image with novel objects, we have thefollowing steps to generate the captioning sentence**:

  • (i) We first exploit the SM-P to generate a captioning sentencewith some placeholders. Each placeholder represents an un-seen word/phrase for a novel object;
  • (ii) We then build a key-value object memory Mobjfor each inputbased on the detection feature-label pairs {fi , li } on the image;
  • (iii) Finally, we replace the placeholders of the sentence by corre-sponding object descriptions.

Training

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

source code: https://github.com/Yu-Wu/Decoupled-Novel-Object-Captioner

Reference

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, JeffreyDean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.2016. TensorFlow: A System for Large-Scale Machine Learning… In OSDI, Vol. 16.265–283.
[2] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2017.Guided open vocabulary image captioning with constrained beam search. InEMNLP.
[3] Lisa Anne Henzdricks, Subhashini Venugopalan, Marcus Rohrbach, RaymondMooney, Kate Saenko, Trevor Darrell, Junhua Mao, Jonathan Huang, AlexanderToshev, Oana Camburu, et al. 2016. Deep compositional captioning: Describingnovel object categories without paired training data. In CVPR.
[4] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MTevaluation with improved correlation with human judgments. In ACL-W. 65–72.
[5] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduledsampling for sequence prediction with recurrent neural networks. In NIPS. 1171–1179.
[6] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach,Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-termrecurrent convolutional networks for visual recognition and description. In CVPR.2625–2634.
[7] Xuanyi Dong, Linchao Zhu, De Zhang, Yi Yang, and Fei Wu. 2018. Fast ParameterAdaptation for Few-shot Image Captioning and Visual Question Answering. InACM on Multimedia.
[8] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, CyrusRashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells astory: Generating sentences from images. In ECCV. 15–29.
[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML. 1126–1135.
[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neuralcomputation 9, 8 (1997), 1735–1780.
[11] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al.2017. Speed/accuracy trade-offs for modern convolutional object detectors. InCVPR.
[12] Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, and Alexander G Haupt-mann. 2015. Bridging the ultimate semantic gap: A semantic search engine forinternet videos. In ICMR. 27–34.
[13] Lu Jiang, Shoou-I Yu, Deyu Meng, Yi Yang, Teruko Mitamura, and Alexander GHauptmann. 2015. Fast and accurate content-based semantic search in 100minternet videos. In ACM on Multimedia. 49–5
[14] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolu-tional localization networks for dense captioning. In CVPR. 4565–4574.
[15] Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments forgenerating image descriptions. In CVPR. 3128–3137.
[16] Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic opti-mization. In ICLR.
[17] Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal neurallanguage models. In ICML. 595–603.
[18] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, YejinChoi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding andgenerating simple image descriptions. IEEE Transactions on Pattern Analysis andMachine Intelligence 35, 12 (2013), 2891–2903.
[19] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Transactionson Pattern Analysis and Machine Intelligence 36, 3 (2014), 453–465.
[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Commonobjects in context. In ECCV. 740–755.
[21] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural Baby Talk.In CVPR. 7219–7228.
[22] Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L Yuille.2015. Learning like a child: Fast novel visual concept learning from sentencedescriptions of images. In ICCV. 2533–2541.
[23] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015.Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). ICLR(2015).
[24] George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Kather-ine J Miller. 1990. Introduction to WordNet: An on-line lexical database. Interna-tional journal of lexicography 3, 4 (1990), 235–244.
[25] Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, AlexBerg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012.Midge: Generating Image Descriptions From Computer Vision Detections. InEACL. 747–756.
[26] Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. 2011. Im2text: Describingimages using 1 million captioned photographs. In NIPS. 1143–1151
.[27] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016.Sequence level training with recurrent neural networks. In ICLR.
[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:Towards real-time object detection with region proposal networks. In NIPS. 91–99.
[29] Marcus Rohrbach, Michael Stark, and Bernt Schiele. 2011. Evaluating knowledgetransfer and zero-shot learning in a large-scale setting. In CVPR. 1641–1648.
[30] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and TimothyLillicrap. 2016. One-shot learning with memory-augmented neural networks.NIPS-W (2016).
[31] Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networksfor large-scale image recognition. In ICLR.
[32] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.2017. Inception-v4, inception-resnet and the impact of residual connections onlearning. In AAAI.
[33] Hamed R Tavakoliy, Rakshith Shetty, Ali Borji, and Jorma Laaksonen. 2017.Paying Attention to Descriptions Generated by Image Captioning Models. InICCV. 2506–2515.
[34] Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, RaymondMooney, Trevor Darrell, and Kate Saenko. 2017. Captioning Images with DiverseObjects. In CVPR.
[35] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney,Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. InICCV. 4534–4542.
[36] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Match-ing networks for one shot learning. In NIPS. 3630–3638.
[37] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Showand tell: A neural image caption generator. In CVPR. 3156–3164
.[38] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2017. Show and Tell: LessonsLearned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactionson Pattern Analysis and Machine Intelligence 39, 4 (April 2017), 652–663
.[39] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. 2018. Zero-Shot Learning - AComprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Transactionson Pattern Analysis and Machine Intelligence (2018), 1–1. https://doi.org/10.1109/TPAMI.2018.2857768
[40] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, RuslanSalakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual attention. In ICML. 2048–2057.
[41] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating copyingmechanism in image captioning for learning novel objects. In CVPR. 5263–5271
.[42] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Imagecaptioning with semantic attention. In CVPR. 4651–4659.
[43] Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Un-covering the Temporal Context for Video Question Answering. InternationalJournal of Computer Vision 124, 3 (01 Sep 2017), 409–421. https://doi.org/10.1007/s11263-017-1033-7Session: FF-4 MM’18, October 22-26, 2018, Seoul, Republic of Korea1037


http://chatgpt.dhexx.cn/article/x1ltK8Pg.shtml

相关文章

Video Anomaly Detection by Solving Decoupled Spatio-Temp

Video Anomaly Detection by Solving Decoupled Spatio-Temp 什么是SSL? Self-Supervised Learning&#xff0c;又称为自监督学习什么是多标签分类问题: 一个数据有多个标签pretext 任务&#xff1a; 简单的来说&#xff0c;通过另一个任务简介完成主任务 比如&#xff0c;要训…

魔改YOLOv5/YOLOv7高阶版——改进之结合解耦头Decoupled_Detect

💖💖>>>加勒比海带,QQ2479200884<<<💖💖 🍀🍀>>>【YOLO魔法搭配&论文投稿咨询】<<<🍀 ✨✨>>>学习交流 | 温澜潮生 | 合作共赢 | 共同进步<<<✨✨

Distilling Object Detectors via Decoupled Features

Abstract 相比于图像分类而言&#xff0c;目标检测器更加复杂&#xff0c;具有多个损失函数。而目前的的检测中&#xff0c;其主要将注意力集中在对象的区域中&#xff0c;但本文指出&#xff0c;从背景中提取的特征信息对于学生模型的学习也是必不可少的。且由于目标区域和背…

Decoupled Attention Network for Text Recognition

摘要&#xff1a; 最流行的文字检测的方法是注意力机制&#xff0c;但是大多数的注意力机制方法由于循环的对齐操作会导致严重的对齐问题。因为对齐操作依赖于历史解码信息。 本文提出的DAN将对齐操作与历史解码信息解耦。 原理&#xff1a; Connectionist temporal classifi…

涨点技巧:Detect系列---Yolov5/Yolov7加入解耦头Decoupled_Detect,涨点明显

目录 1. Decoupled Head介绍 2.Yolov5加入Decoupled_Detect 2.1 DecoupledHead加入common.py中&#xff1a; 2.2 Decoupled_Detect加入yolo.py中&#xff1a; 2.3修改yolov5s_decoupled.yaml 3.数据集下验证性能 &#x1f3c6; &#x1f3c6;&#x1f3c6;&#x1f3c6;&…

Decoupled Contrastive Learning 论文解读和感想

本文首先提出了当前对比学习的三大痛点&#xff1a; 1、当前的sota方法结构都过于复杂 2、对比学习要想取得效果&#xff0c;必须要用大batch 3、超参敏感(个人认为这里说的超参是指数据增强方式) 然后本文以SimCLR为例&#xff0c;通过对对比损失的梯度进行分析&#xff0c;发…

DECOUPLED WEIGHT DECAY REGULARIZATION

引言 Adam作为一个常用的深度学习优化方法&#xff0c;提出来的时候论文里的数据表现都非常好&#xff0c;但实际在使用中发现了不少问题&#xff0c;在许多数据集上表现都不如SGDM这类方法。 后续有许多工作针对Adam做了研究&#xff0c;之前整理过关于优化算法的发展历程&am…

Decoupled Dynamic Filter Networks

转载自:https://www.cnblogs.com/liuyangcode/p/14755924.html 对depth-wise的改进&#xff0c;将卷积核的参数改为根据输入变化的方式 Introduction 卷积缺点在于&#xff1a;内容不变&#xff0c;计算量高动态filter可以根据内容自适应&#xff0c;但是会提高计算量。depth…

Analyzing and Leveraging Decoupled L1 Caches in GPUs

introduction 我们都知道L1/L2/L3cache解决了内存墙的问题&#xff0c;但是作者分析出现有的缓存架构有着天然缺陷&#xff0c; 作者列出的many to few communication&#xff0c;也就是L1ache中大量的数据传输到L2cache中&#xff0c;可能对于L1cache的带宽使用率不是很高&a…

Decoupled network

Decoupled network https://zhuanlan.zhihu.com/p/37598903 神经网络机制存在的缺陷&#xff1f; 过拟合&#xff0c;梯度消失或者是膨胀&#xff0c;训练依靠大量样本&#xff0c;对网络初始化及其敏感记忆协迁移等等。 Decupled network是对operator的改进 现在的卷积操作…

Decoupled Knowledge Distillation论文阅读+代码解析

本文来自2022年CVPR的文章&#xff0c;论文地址点这里 一. 介绍 知识蒸馏&#xff08;KD&#xff09;的通过最小化师生预测对数之间的KL-Divergence来传递知识(下图a)。目前大部分的研究注意力都被吸引到从中间层的深层特征中提取知识。与基于logit的精馏方法相比&#xff0c…

令牌桶算法

一 算法 令牌桶算法和漏桶算法不同的是&#xff0c;有时后端能够处理一定的突发情况&#xff0c;只是为了系统稳定&#xff0c;一般不会让请求超过正常情况的60%&#xff0c;给容灾留有余地。但漏桶算法中后端处理速度是固定的&#xff0c;对于短时的突发情况&#xff0c;后端…

动态分区分配算法(1、首次适应算法 2、最佳适应算法 3、最坏适应算法 4、邻近适应算法)

文章目录 前言知识总览1、首次适应算法2、最佳适应算法3、最坏适应算法4、邻近适应算法知识回顾与重要考点 前言 此篇文章是我在B站学习时所做的笔记&#xff0c;大部分图片都是课件老师的PPT&#xff0c;方便复习用。此篇文章仅供学习参考。 提示&#xff1a;以下是本篇文章…

《算法4》读书笔记(一)

写在前面&#xff1a;配套网站algs4.cs.princeton.edu&#xff0c;可以把这个网站作为编程的时候的参考资料。这本书比较实用&#xff08;某瓣评分9.3&#xff09;&#xff0c;但没有动态规划部分&#xff0c;作为两三年没怎么碰过算法和数据结构的菜狗&#xff0c;看了《图解算…

《算法4》深入理解红黑树

红黑树是一种性能非常优秀的数据结构&#xff0c;关键在于它能保证最坏的性能也是对数的&#xff0c;主要是因为它是一种平衡的树&#xff0c;所以也叫平衡查找树。要理解红黑树&#xff0c;最好先看看我的上一篇博客《算法4》符号表以及二叉查找树&#xff0c;了解二叉查找树以…

【算法4总结】第四章:图

目录备份 第四章&#xff1a;图 概述 图可以根据是否有向和带权分成以下四种&#xff1a; 无向图 &#xff08;无向不带权&#xff09;有向图 &#xff08;有向不带权&#xff09;加权无向图&#xff08;无向带权&#xff09;加权有向图&#xff08;有向带权&#xff09; …

算法4(一、递归学习)

每次用递归都感觉有点难&#xff0c;这个趁着恶补基础知识的时候&#xff0c;专门看了一遍递归&#xff0c;算法4的。 1.1 递归介绍 方法可以调用自己&#xff0c;例如&#xff1a;下面给出了bin_search的二分查找的一种实现。&#xff08;算法4中使用的是Java&#xff0c;但…

【算法4总结】第一章:基础

目录备份 第一章&#xff1a;基础 我认为这一章主要介绍的是如何使用工具。 一共五节&#xff0c;前两节主要是对 Java 语法的回顾&#xff0c;第三节则是三个数据结构&#xff0c;背包&#xff0c;队列和栈的API讲解。 而第四节是讲解的是如何分析算法。第五节则是针对具体…

SQL修改语句

如果我们要修改数据库中表的数据&#xff0c;这个时候我们就要使用到UPDATE语句。 UPDATE语句的基本语法是&#xff1a; UPDATE <表名> SET 字段1值1, 字段2值2, ... WHERE ...; 例如&#xff0c;我们想更新employees表id100的记录的last_name和salary这两个字段&…

【数据库】SQL语句之修改语句(INSERT,UPDATE,DELETE)

1.INSERT INSERT INTO <表名> (字段1, 字段2, ...) VALUES (值1, 值2, ...); 例如&#xff1a; 一次插入一个 INSERT INTO students (class_id, name, gender, score) VALUES (2, 小明, M, 80);一次插入多条 INSERT INTO students (class_id, name, gender, score) VA…