深度学习中Embedding层有什么用?

article/2025/10/20 15:45:17

2019年03月24日15:23:32更新:
由于CSDN图片经常显示不出来,本文最新链接请点击:https://fuhailin.github.io/Embedding/
博主所有博客写作平台已迁移至:https://fuhailin.github.io/ ,欢迎收藏关注。


在这里插入图片描述

这篇博客翻译自国外的深度学习系列文章的第四篇,想查看其他文章请点击下面的链接,人工翻译也是劳动,如果你觉得有用请打赏,转载请打赏:

  1. Setting up AWS & Image Recognition
  2. Convolutional Neural Networks
  3. More on CNNs & Handling Overfitting

在深度学习实验中经常会遇Embedding层,然而网络上的介绍可谓是相当含糊。比如 Keras中文文档中对嵌入层 Embedding的介绍除了一句 “嵌入层将正整数(下标)转换为具有固定大小的向量”之外就不愿做过多的解释。那么我们为什么要使用嵌入层 Embedding呢? 主要有这两大原因:

  1. 使用One-hot 方法编码的向量会很高维也很稀疏。假设我们在做自然语言处理(NLP)中遇到了一个包含2000个词的字典,当时用One-hot编码时,每一个词会被一个包含2000个整数的向量来表示,其中1999个数字是0,要是我的字典再大一点的话这种方法的计算效率岂不是大打折扣?

  2. 训练神经网络的过程中,每个嵌入的向量都会得到更新。如果你看到了博客上面的图片你就会发现在多维空间中词与词之间有多少相似性,这使我们能可视化的了解词语之间的关系,不仅仅是词语,任何能通过嵌入层 Embedding 转换成向量的内容都可以这样做。

上面说的概念可能还有点含糊. 那我们就举个栗子看看嵌入层 Embedding 对下面的句子做了什么:)。Embedding的概念来自于word embeddings,如果您有兴趣内容,可以查询 word2vec 。

“deep learning is very deep”

使用嵌入层embedding 的第一步是通过索引对该句子进行编码,这里我们给每一个不同的句子分配一个索引,上面的句子就会变成这样:

1 2 3 4 1

接下来会创建嵌入矩阵,我们要决定每一个索引需要分配多少个‘潜在因子’,这大体上意味着我们想要多长的向量,通常使用的情况是长度分配为32和50。在这篇博客中,为了保持文章可读性这里为每个索引指定6个潜在因子。嵌入矩阵就会变成这样:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hA09OczB-1615965508624)(https://cdn-images-1.medium.com/max/1000/1*Di85w_0UTc6C3ilk5_LEgg.png)]

这样,我们就可以使用嵌入矩阵来而不是庞大的one-hot编码向量来保持每个向量更小。简而言之,嵌入层embedding在这里做的就是把单词“deep”用向量[.32, .02, .48, .21, .56, .15]来表达。然而并不是每一个单词都会被一个向量来代替,而是被替换为用于查找嵌入矩阵中向量的索引。其次这种方法面对大数据时也可有效计算。由于在深度神经网络的训练过程中嵌入向量也会被更新,我们就可以探索在高维空间中哪些词语之间具有彼此相似性,再通过使用t-SNE 这样的降维技术就可以将这些相似性可视化。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ydoX5Rjd-1615965508625)(https://cdn-images-1.medium.com/max/1000/1*m8Ahpl-lpVgm16CC-INGuw.png)]


Not Just Word Embeddings

These previous examples showed that word embeddings are very important in the world of Natural Language Processing. They allow us to capture relationships in language that are very difficult to capture otherwise. However, embedding layers can be used to embed many more things than just words. In my current research project I am using embedding layers to embed online user behavior. In this case I am assigning indices to user behavior like ‘page view on page type X on portal Y’ or ‘scrolled X pixels’. These indices are then used for constructing a sequence of user behavior.

In a comparison of ‘traditional’ machine learning models (SVM, Random Forest, Gradient Boosted Trees) with deep learning models (deep neural networks, recurrent neural networks) I found that this embedding approach worked very well for deep neural networks.

The ‘traditional’ machine learning models rely on a tabular input that is feature engineered. This means that we, as researchers, decide what gets turned into a feature. In these cases features could be: amount of homepages visited, amount of searches done, total amount of pixels scrolled. However, it is very difficult to capture the spatial (time) dimension when doing feature-engineering. By using deep learning and embedding layers we can efficiently capture this spatial dimension by supplying a sequence of user behavior (as indices) as input for the model.

In my research the Recurrent Neural Network with Gated Recurrent Unit/Long-Short Term Memory performed best. The results were very close. From the ‘traditional’ feature engineered models Gradient Boosted Trees performed best. I will write a blog post about this research in more detail in the future. I think my next blog post will explore Recurrent Neural Networks in more detail.

Other research has explored the use of embedding layers to encode student behavior in MOOCs (Piech et al., 2016) and users’ path through an online fashion store (Tamhane et al., 2017).


Recommender Systems

Embedding layers can even be used to deal with the sparse matrix problem in recommender systems. Since the deep learning course (fast.ai) uses recommender systems to introduce embedding layers I want to explore them here as well.

Recommender systems are being used everywhere and you are probably being influenced by them every day. The most common examples are Amazon’s product recommendation and Netflix’s program recommendation systems. Netflix actually held a $1,000,000 challenge to find the best collaborative filtering algorithm for their recommender system. You can see a visualization of one of these models here.

There are two main types of recommender systems and it is important to distinguish between the two.

  1. Content-based filtering. This type of filtering is based on data about the item/product. For example, we have our users fill out a survey on what movies they like. If they say that they like sci-fi movies we recommend them sci-fi movies. In this case al lot of meta-information has to be available for all items.
  2. Collaborative filtering: Let’s find other people like you, see what they liked and assume you like the same things. People like you = people who rated movies that you watched in a similar way. In a large dataset this has proven to work a lot better than the meta-data approach. Essentially asking people about their behavior is less good compared to looking at their actual behavior. Discussing this further is something for the psychologists among us.

In order to solve this problem we can create a huge matrix of the ratings of all users against all movies. However, in many cases this will create an extremely sparse matrix. Just think of your Netflix account. What percentage of their total supply of series and movies have you watched? It’s probably a pretty small percentage. Then, through gradient descent we can train a neural network to predict how high each user would rate each movie. Let me know if you would like to know more about the use of deep learning in recommender systems and we can explore it further together. In conclusion, embedding layers are amazing and should not be overlooked.

If you liked this posts be sure to recommend it so others can see it. You can also follow this profile to keep up with my process in the Fast AI course. See you there!

References

Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L. J., & Sohl-Dickstein, J. (2015). Deep knowledge tracing. In Advances in Neural Information Processing Systems (pp. 505–513).

Tamhane, A., Arora, S., & Warrier, D. (2017, May). Modeling Contextual Changes in User Behaviour in Fashion e-Commerce. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 539–550). Springer, Cham.

Embedding layers in keras

嵌入层embedding用在网络的开始层将你的输入转换成向量,所以当使用 Embedding前应首先判断你的数据是否有必要转换成向量。如果你有categorical数据或者数据仅仅包含整数(像一个字典一样具有固定的数量)你可以尝试下Embedding 层。
如果你的数据是多维的你可以对每个输入共享嵌入层或尝试单独的嵌入层。

from keras.layers.embeddings import EmbeddingEmbedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)

-The first value of the Embedding constructor is the range of values in the input. In the example it’s 2 because we give a binary vector as input.

  • The second value is the target dimension.
  • The third is the length of the vectors we give.
  • input_dim: int >= 0. Size of the vocabulary, ie. 1+maximum integer
    index occuring in the input data.

本文译自:https://medium.com/towards-data-science/deep-learning-4-embedding-layers-f9a02d55ac12

How does embedding work? An example demonstrates best what is going on.

Assume you have a sparse vector [0,1,0,1,1,0,0] of dimension seven. You can turn it into a non-sparse 2d vector like so:

model = Sequential()
model.add(Embedding(2, 2, input_length=7))
model.compile('rmsprop', 'mse')
model.predict(np.array([[0,1,0,1,1,0,0]]))
array([[[ 0.03005414, -0.02224021],[ 0.03396987, -0.00576888],[ 0.03005414, -0.02224021],[ 0.03396987, -0.00576888],[ 0.03396987, -0.00576888],[ 0.03005414, -0.02224021],[ 0.03005414, -0.02224021]]], dtype=float32)

Where do these numbers come from? It’s a simple map from the given range to a 2d space:

model.layers[0].W.get_value()
array([[ 0.03005414, -0.02224021],[ 0.03396987, -0.00576888]], dtype=float32)

The 0-value is mapped to the first index and the 1-value to the second as can be seen by comparing the two arrays. The first value of the Embedding constructor is the range of values in the input. In the example it’s 2 because we give a binary vector as input. The second value is the target dimension. The third is the length of the vectors we give.
So, there is nothing magical in this, merely a mapping from integers to floats.

Now back to our ‘shining’ detection. The training data looks like a sequences of bits:

X
array([[ 0.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,1.,  0.,  0.,  0.,  0.,  0.],[ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,0.,  0.,  0.,  0.,  0.,  0.],[ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,0.,  0.,  1.,  0.,  1.,  0.],[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,0.,  1.,  0.,  1.,  0.,  0.],[ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,0.,  0.,  0.,  0.,  0.,  1.]])

If you want to use the embedding it means that the output of the embedding layer will have dimension (5, 19, 10). This works well with LSTM or GRU (see below) but if you want a binary classifier you need to flatten this to (5, 19*10):

model = Sequential()
model.add(Embedding(3, 10, input_length= X.shape[1] ))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X, y=y, batch_size=200, nb_epoch=700, verbose=0, validation_split=0.2, show_accuracy=True, shuffle=True)

It detects ‘shining’ flawlessly:

model.predict(X)
array([[  1.00000000e+00],[  8.39483363e-08],[  9.71878720e-08],[  7.35597965e-08],[  9.91844118e-01]], dtype=float32)

An LSTM layer has historical memory and so the dimension outputted by the embedding works in this case, no need to flatten things:

model = Sequential()model.add(Embedding(vocab_size, 10))
model.add(LSTM(5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X, y=y,  nb_epoch=500, verbose=0, validation_split=0.2, show_accuracy=True, shuffle=True)

Obviously, it predicts things as well:

model.predict(X)
array([[ 0.96855599],[ 0.01917232],[ 0.01917362],[ 0.01917258],[ 0.02341695]], dtype=float32)

本文译自:http://www.orbifold.net/default/2017/01/10/embedding-and-tokenizer-in-keras/


http://chatgpt.dhexx.cn/article/tRhZxtml.shtml

相关文章

pytorch nn.Embedding的用法和理解

(2021.05.26补充)nn.Embedding.from_pretrained()的使用: >>> # FloatTensor containing pretrained weights >>> weight torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]]) >>> embedding nn.Embedding.from…

EMBEDDING层作用

embedding层作用:①降维②对低维的数据进行升维时,可能把一些其他特征给放大了,或者把笼统的特征给分开了。 Embedding其实就是一个映射,从原先所属的空间映射到新的多维空间中,也就是把原先所在空间嵌入到一个新的空…

彻底理解embedding

本文转载自https://blog.csdn.net/weixin_42078618/article/details/84553940,版权问题请联系博主删除 首先,我们有一个one-hot编码的概念。 假设,我们中文,一共只有10个字。。。只是假设啊,那么我们用0-9就可以表示…

深度学习中的embedding

整理翻译自google developer的机器学习入门课程,介绍了embedding的应用方式和如何计算embedding,后面还配有通过tensorflow DNN训练embedding练习加深理解。 分类输入数据(Categorical Input Data) 分类数据是指表示来自有限选择集的一个或多个离散项的…

【文本分类】深入理解embedding层的模型、结构与文本表示

[1] 名词理解 embedding层:嵌入层,神经网络结构中的一层,由embedding_size个神经元组成,[可调整的模型参数]。是input输入层的输出。 词嵌入:也就是word embedding…根据维基百科,被定义为自然语言处理NLP中…

用万字长文聊一聊 Embedding 技术

作者:qfan,腾讯 WXG 应用研究员 随着深度学习在工业届不断火热,Embedding 技术便作为“基本操作”广泛应用于推荐、广告、搜索等互联网核心领域中。Embedding 作为深度学习的热门研究方向,经历了从序列样本、图样本、再到异构的多…

Embedding技术

1、Embedding 是什么 Embedding是用一个低维稠密的向量来“表示”一个对象(这里的对象泛指一切可推荐的事物,比如商品、电影、音乐、新闻等),同时表示一词意味着Embedding能够表达相应对象的某些特征,同时向量之间的距…

什么是embedding?

本文转自:https://www.jianshu.com/p/6c977a9a53de    简单来说,embedding就是用一个低维的向量表示一个物体,可以是一个词,或是一个商品,或是一个电影等等。这个embedding向量的性质是能使距离相近的向量对应的物体…

Pairwise-ranking loss代码实现对比

Multi-label classification中Pairwise-ranking loss代码 定义 在多标签分类任务中,Pairwise-ranking loss中我们希望正标记的得分都比负标记的得分高,所以采用以下的形式作为损失函数。其中 c c_ c​是正标记, c − c_{-} c−​是负标记。…

【论文笔记】API-Net:Learning Attentive Pairwise Interaction for Fine-Grained Classification

API-Net 简介创新点mutual vector learning(互向量学习)gate vector generation(门向量生成器)pairwise interaction(成对交互) 队构造(Pair Construction)实验结果总结 简介 2020年…

白话点云dgcnn中的pairwise_distance

点云DGCNN中对于代码中pairwise_distance的分析与理解 2021年5月7日:已经勘误,请各位大佬不惜赐教。 一点一点读,相信我,我能讲清楚。 这个是本篇文章所要讨论的代码段 总体上把握,这个代码计算出了输入点云每对点之…

推荐系统[四]:精排-详解排序算法LTR (Learning to Rank): poitwise, pairwise, listwise相关评价指标,超详细知识指南。

搜索推荐系统专栏简介:搜索推荐全流程讲解(召回粗排精排重排混排)、系统架构、常见问题、算法项目实战总结、技术细节以及项目实战(含码源) 专栏详细介绍:搜索推荐系统专栏简介:搜索推荐全流程讲解(召回粗排精排重排混排)、系统架构、常见问题、算法项目实战总结、技术…

【torch】torch.pairwise_distance分析

every blog every motto: You can do more than you think. https://blog.csdn.net/weixin_39190382?typeblog 0. 前言 记录torch.pairwise_distance 1. 一维 1.1 元素个数相同 1.1.1 元素个数为1 生成代码: t torch.randn(1) f torch.randn(1)计算代码,下…

pairwise损失_triplet损失_提升精排模型的trick

01标签 import torch import torch.nn as nn# 输入x是一个二维张量,每一行表示一个样本的分数,每一列表示一个特征或维度 x torch.tensor([[0.5, 0.7], [0.9, 0.8], [0.6, 0.4], [0.3, 0.6], [0.8, 0.7], [0.4, 0.5]])# 标签y是一个一维张量&#xff0c…

LTR (Learning to Rank): 排序算法 poitwise, pairwise, listwise常见方案总结

目录 1 Learing to Rank介绍2 The Pointwise Approach3 The Pairwise Approach3.1 RankNet 4 The Listwise Approach4.1 直接优化评测指标4.1.1 LambdaRank4.1.2 LambdaMART 4.2 定义Listwise损失函数4.2.1 ListNet4.2.2 ListMLE 5 排序评估指标5.1 Mean Reciprocal Rank (MRR)…

【论文精读】Pairwise learning for medical image segmentation

Published in: Medical Image Analysis 2020 论文:https://www.sciencedirect.com/science/article/abs/pii/S1361841520302401 代码:https://github.com/renzhenwang/pairwise_segmentation 目录 Published in: Medical Image Analysis 2020 摘要 一…

pairwise相似度计算

做了一个比赛,其中为了更好的构建负样本,需要计算不同句子之间的相似性,句子大概有100w,句子向量是300维,中间踩了很多坑,记录一下。 暴力计算 最简单的idea是预分配一个100w x 100w的矩阵,一…

如何计算 Pairwise correlations

Pairwise Correlation的定义是啥?配对相关性?和pearson correlations有什么区别? Pairwise Correlation顾名思义,用来计算两个变量间的相关性,而pearson correlations只是计算相关性的一种方法罢了。 1、pearson相关系…

再谈排序算法的pairwise,pointwise,listwise

NewBeeNLP干货 作者:DOTA 大家好,这里是 NewBeeNLP。 最近因为工作上的一些调整,好久更新文章和个人的一些经验总结了,下午恰好有时间,看了看各渠道的一些问题和讨论,看到一个熟悉的问题,在这…

【推荐】pairwise、pointwise 、 listwise算法是什么?怎么理解?主要区别是什么?

写在前面:写博客当成了学习笔记,容易找到去完善,不用于商业用途。通过各种途径网罗到知识汇总与此,如有侵权,请联系我,我下掉该内容~~ 排序学习的模型通常分为单点法(Pointwise Approach&#…