词向量

词向量是自然语言处理中重要的基础，有利于我们对文本、情感、词义等等方向进行分析，主要是将词转化为稠密向量，从而使得相似的词，其词向量也相近。

一、词向量的表示

词向量的表示通常有两种方式，一种是离散的，另一种是分布式的；其离散方式通常称为one-hot representation，其缺点是不能显示词与词之间的关系，但优点是在高维空间中，很多任务线性可分。

其分布式的方式通常称为 distribution representation，是将词转化为一种分布式的、连续的、定长的稠密向量，其优点是可以表示词与词之间的距离关系，每一维度都有其特定的含义；

两者的区别是用one-hot特征时，可以对特征向量进行删减，而分布式的则不可以。

二、词向量的训练

2.1 基于统计的方法

2.1.1 共现矩阵

统计一个窗口内word共现次数，以word周边的共现词的次数做为当前word的vector。该矩阵一定程度上缓解了one-hot向量相似度为0问题，但并没有解决数据的稀疏性和高维性问题。

2.1.2 奇异值分解

针对共现矩阵存在的问题，提出了对原始词向量进行降维，从而得到一个稠密的连续词向量。利用SVD的方法，最终可以得到一个正交矩阵，进行归一化后即为词向量。

该方法的有点是可以一定程度上反映语义相近的词，以及word间的线性关系；但由于很多词没有出线，导致矩阵及其稀疏，需要对词频做额外处理才能达到好的结果，并且其矩阵也是非常大，维度高。

基于共现矩阵的词向量代码如下：

# 基于词与词构造共现矩阵，提取词向量
import collections
file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt"
model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt"
min_count = 5 #最低词频
word_demension = 200
window_size = 5 # 窗口大小def load_data(file_path = file_path):dataset = []for line in open(file_path,encoding='utf-8'):line = line.strip().split(',')dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1])return dataset
dataset = load_data()# 统计总词数
def build_wrod_dict():words = []for data in dataset:words.extend(data)reserved_words = [item for item in collections.Counter(words).most_common() if item[1]>min_count]word_dict = {item[0]:item[1] for item in reserved_words}return word_dict
# 构造上下文窗口
def build_word2word_dict():word2word_dict = {}for data_idx, data in enumerate(dataset):contexts = []for index in range(len(data)):if index < window_size:left = data[:index]else:left = data[index-window_size:index]if index + window_size > len(data):right = data[index + 1:]else:right = data[index + 1: index + window_size + 1]context = left + [data[index]] + right# 得到了一句话中的上下文的窗口for word in context:if word not in word2word_dict:word2word_dict[word] = {}else:for co_word in context:if co_word !=word:word2word_dict[word][co_word] =1else:word2word_dict[word][co_word] += 1return word2word_dict
# 构造共现矩阵
def build_word2word_matrix():word2word_dict = build_word2word_dict()word_dict =build_wrod_dict()word_list = list(word_dict)# 这个只会构造出一个word的keyword2word_matrix = []count = 0for word1 in word_list:count +=1temp = []sumtf = sum(word2word_dict[word1].values())for word2 in word_list:weight = word2word_dict[word2].get(word2, 0) / sumtftemp.append(weight)word2word_matrix.append(temp)return word2word_matrix

2.2 基于语言模型

语言模型生成词向量是通过训练神经网络模型附带产出的，一般是采用三层神经网络结构，分别为输入层、隐藏层以及输出层。常见的就是word2vect方法，该方法主要有两种方式，CBOW和skip-gram；

Word2vect的改进方法有两种，一种是基于Hierarchical softmax，另一种是基于负采样。

word2vect最先优化使用的结构是霍夫曼树，来代替隐藏层和输出层的神经元，但其问题就在隐藏层和输出层的softmax计算量很大（因为要计算所有词的softmax概率，再去找最大概率），因此霍夫曼树可以解决这个问题。霍夫曼树的叶子节点起到输出神经元的作用。一般霍夫曼树后会对叶子节点进行编码，由于权重高的叶子节点靠近根节点，而权重低的叶子节点远离根节点，这样权重高的节点编码段短，权重低的编码较长，符合信息论，也就是越是常用的词拥有更短的编码。霍夫曼树当中定义左节点还是右节点里面有个主意的sigmoid函数，因此最后变成了求解Hierarchical Softmax的参数的问题，求解梯度并进行计算。

基于负采样求解word2vect模型的方法摒弃了霍夫曼树，因为霍夫曼树针对样本中心词是一个生僻词时，就得在霍夫曼树中路径寻找很久。比如训练一个样本，中心词是w，他的周围上下文共有2c个词，则记为context(w)。由于这个中心词w和context(w)相关，则它是一个真实的正例；现在通过负采样技术，得到neg个和w不同的中心词wi,i=1,2,…,neg，则context(w)和这个wi组成一个负例子；利用这个正例和neg负例，我们进行二元逻辑回归，得到负采样对应每个词wi对应的模型参数theta，以及每个词的词向量。

简单的对负采样进行总结：

还是假设词库有10000个词，词向量300维，那么每一层神经网络的参数是300万个，输出层相当于有一万个可能类的多分类问题。可以想象，这样的计算量非常非常非常大。采样的思想非常简单，简单地令人发指：我们知道最终神经网络经过softmax输出一个向量，只有一个概率最大的对应正确的单词，其余的称为negative sample。现在只选择5个negative sample，所以输出向量就只是一个6维的向量。要考虑的参数不是300万个，而减少到了1800个！这样做看上去很偷懒，实际效果却很好，大大提升了运算效率。

2.2.1 CBOW（连续词袋模型）

该模型是预测上下文已知的情况下，当前词出现的概率。上下文的选取采用窗口方式。本文基于负采样的TensorFlow下训练cbow的词向量代码如下：

# 连续词袋模型，根据上下文预测当前单词
import math
import numpy as np
import tensorflow as tf
import  collections
file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt"
model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt"
min_count = 5 #最低词频
batch_size = 200 # 每次迭代的数量
embedding_size = 200 # 生成词向量的维度
window_size = 5 # 窗口大小
num_sampled = 100 # 负采样的样本
num_steps = 10000# 最大的迭代次数
def load_data(file_path = file_path):dataset = []for line in open(file_path,encoding='utf-8'):line = line.strip().split(',')dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1])return dataset
dataset = load_data()
# 获得所有的单词组
def read_data(dataset):words = []for data in dataset:words.extend(data)return words
# 创建数据集合
def build_dataset(words,min_count):count = [['unk',-1]]reserved_words = [item for item in collections.Counter(words).most_common() if item[1]>min_count]count.extend(reserved_words)dictionary = dict()for word,_ in count:dictionary[word] = len(dictionary)data = list()unk_count = 0for word in words:if word in dictionary:index = dictionary[word]else:index = 0unk_count += 1data.append(index)count[0][1] = unk_countreverse_dictionary = dict(zip(dictionary.values(),dictionary.keys()))return data,count,dictionary,reverse_dictionary
# 生成训练的样本
data_index = 0
def generate_batch(batch_size, window_size,data):# data的格式为编号span = 2*window_size+1batch = np.ndarray(shape=(batch_size,span-1),dtype=np.int32)labels = np.ndarray(shape=(batch_size,1),dtype=np.int32)buffer = collections.deque(maxlen=span)for _ in range(span):buffer.append(data[data_index])data_index = (data_index+1)/len(data)# data中每个元素的下标for i in range(batch_size):target=window_sizetarget2avoid = [window_size]col_idx = 0for j in range(span):if j ==span//2:continuebatch[i,col_idx] = buffer[j]col_idx += 1labels[i,0] = buffer[target]buffer.append(data[data_index])data_index = (data_index+1)/len(data)return batch,labels
# 进行训练
def train_word2vec(vocabulary_size,batch_size,embedding_size,window_size,num_sampled,num_steps,data):graph = tf.Graph()with graph.as_default(),tf.device('/cpu:0'):train_dataset = tf.placeholder(tf.int32,shape=[batch_size,2*window_size])train_labels = tf.placeholder(tf.int32,shape=[batch_size,1])embedding = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))# 这儿与skip-gram不同的是，cbow的输入是上下文向量的均值#embed = tf.reshape(embedding,window_size*2,batch_size//(window_size*2),embedding_size)这个方法也可以context_embedding = []for i in range(2 * window_size):#对每列进行相加，然后取平均值context_embedding.append(tf.nn.embedding_lookup(embedding,train_dataset[:,i]))ave_embed = tf.reduce_mean(tf.stack(axis=0,values=context_embedding),0,keep_dims=False)softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size],stddev=1.0/math.sqrt(embedding_size)))softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))# 定义损失函数loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights=softmax_weights,biases=softmax_biases,inputs=ave_embed,labels=train_labels,num_sampled = num_sampled,num_classes=vocabulary_size))opt = tf.train.AdamOptimizer(1.0).minimize(loss)norm = tf.sqrt(tf.reduce_mean(tf.square(embedding),1,keep_dims=True))normalized_embeddings = embedding/normwith tf.Session(graph) as session:tf.global_variables_initializer()average_loss = 0for step in range(num_steps):batch_data,batch_labels = generate_batch(batch_size,window_size,data)feed_dict = {train_labels:batch_data,train_labels:batch_labels}_,l = session.run([opt,loss],feed_dict=feed_dict)average_loss += lif step % 200 ==0:if step>0:average_loss = average_loss/200print('average loss at step',step,':',average_loss)average_loss = 0final_embedding = normalized_embeddings.eval()return final_embedding

2.2.2 skip-gram（跳字模型）

原理和CBOW大致相同，只是输入是中心词，输出是周围词词向量。

基于负采样的TensorFlow训练skipgram的词向量代码如下：

# 利用skip-gram进行词向量的训练，是当前单词预测上下文
import collections
import math
import random
import numpy as np
import tensorflow as tf
file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt"
model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt"
min_count = 5 #最低词频
batch_size = 200 # 每次迭代的数量
embedding_size = 200 # 生成词向量的维度
window_size = 5 # 窗口大小
num_sampled = 100 # 负采样的样本
num_steps = 10000# 最大的迭代次数
def load_data(file_path = file_path):dataset = []for line in open(file_path,encoding='utf-8'):line = line.strip().split(',')dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1])return dataset
dataset = load_data()
# 获得所有的单词组
def read_data(dataset):words = []for data in dataset:words.extend(data)return words
# 创建数据集合
def build_dataset(words,min_count):# 把那些低频的词过滤掉，并根据出现频次的大小进行相关的编号count = [['UNK',-1]] # 对不统计或者没有出现的进行计数count.extend([item for item in collections.Counter(words).most_common() if item[1]>min_count])dictionary = dict()for word,_ in count:dictionary[word] = len(dictionary)# 进行编号data = list()unk_count = 0for word in words:if word in dictionary:index = dictionary[word]else:index = 0unk_count += 1data.append(index)count[0][1] = unk_countreverse_dictionary = dict(zip(dictionary.values(),dictionary.keys()))# 形成id：单词，的形式return data,dictionary,reverse_dictionary# 生成训练样本
data_index = 0
def generate_bath(batch_size,window_size,data):# 其中data的格式为进行编号的id格式# num_skips: 表示为每个单词生成多少个样本，本实验设置的是2个，其中batch_size必须是num_skips的整数倍# window_size：一般2*window_size>=num_skipsbatch = np.ndarray(shape=(batch_size),dtype=np.int32)# 建立一个batch大小的一维数组，保存任意单词# 建立一个(batch,1)大小的二维数组，保存打次前一个或者后一个从而形成pair，其中1表示预测周围的词的数目labels = np.ndarray(shape=(batch_size,1),dtype=np.int32)# Sample data [0, 5241, 3082, 12, 6, 195, 2, 3137, 46, 59] ['UNK', 'anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used']# 假设取num_steps为2, window_size为1, batchsize为8# batch:[5242, 3084, 12, 6]# labels[0, 3082, 5241, 12, 3082, 6, 12, 195]#print(batch)[5242    5242    3084    3084    12    12    6    6]，共8维#print(labels)[[0][3082][12][5242][6][3082][12][195]]，共8维span = 2*window_size+1 # 得到一个窗口的大小buffer = collections.deque(maxlen=span)for _ in range(span):buffer.append(data[data_index])data_index = (data_index+1)%len(data)# batch_size一定是num-skips的倍数，从而保证每个batch-size都能够用完num-skipsfor i in range(batch_size//(window_size*2)):#保证每个词产生的上下文组合用完target = window_size#中心词target2avoid = [window_size]#中心词首先被排除for j in range(window_size*2):#一个窗口的数据while target in target2avoid:target = random.randint(0,span-1)target2avoid.append(target2avoid)batch[i*window_size*2+j] = buffer[window_size]labels[i*window_size*2+j,0] = buffer[target]buffer.append(data[data_index])data_index = (data_index + 1) % len(data)return batch,labels
# 然后构建网络进行训练
def train_wordvec(vocabulary_size,batch_size,embeddingsize,window_size,num_sample,num_steps,data):gragh = tf.Graph()with gragh.as_default():# 输入数据train_inputs = tf.placeholder(tf.int32,shape=[batch_size])train_labels = tf.placeholder(tf.int32,shape=[batch_size,1])# 使用cpu进行训练with tf.device('/cpu:0'):# 初始化一个embeddingembedding = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))# 查找对应的embeddingembed = tf.nn.embedding_lookup(embedding_size,train_inputs)# 全连接参数定义nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size],stddev=1.0/math.sqrt(embedding_size)))nce_bias = tf.Variable(tf.zeros([vocabulary_size]))# 定义一个lossloss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,biases=nce_bias,inputs=embed,num_classes=vocabulary_size,num_sampled=num_sampled))# 优化方法opt = tf.train.GradientDescentOptimizer(1.0).minimize(loss)# 计算每个词的模，用于归一化norm = tf.sqrt(tf.reduce_sum(tf.square(embedding),1,keep_dims=True))normalized = embedding/norm# 初始化模型的变量init = tf.global_variables_initializer()# 基于构造的网络进行训练with tf.Session(gragh) as session:# 初始化运行init.run()# 定义平均损失average_loss = 0for step in range(num_steps):batch_inputs,batch_labels = generate_bath(batch_size,window_size,data)feed_dict = {train_inputs:batch_inputs,train_labels:batch_labels}# 计算每一次迭代的loss_,loss = session.run([opt,loss],feed_dict=feed_dict)average_loss += loss# 每个一段时间将其打印出来if step%200 == 0:if step>0:average_loss /=200print('average loss at step',step,":",average_loss)average_loss =0final_embedding = normalized.eval()return final_embedding

参考文献：

https://blog.csdn.net/mawenqi0729/article/details/80698350

http://www.cnblogs.com/pinard/p/7160330.html

https://blog.csdn.net/u014595019/article/details/54093161

http://www.cnblogs.com/pinard/p/7249903.html

https://blog.csdn.net/rxt2012kc/article/details/71123052

https://blog.csdn.net/leadai/article/details/80249999

https://github.com/liuhuanyong/Word2Vector