gensim---LDA---perplexity

以下内容来源于https://blog.csdn.net/qq_25073545/article/details/79773807
使用gensim实现lda，并计算perplexity（ gensim Perplexity Estimates in LDA Model）
Neither. The values coming out of bound() depend on the number of topics (as well as number of words), so they’re not comparable across different num_topics (or different test corpora).
从bound（）中得出的值取决于主题的数量（以及单词的数量），因此它们在不同的num_topics(或不同的测试语料库）上是不可比的。
the opposite:a smaller bound value implies deterioration. For example, bound -6000 is “better” than -7000 (bigger is better
较小的界限值意味着恶化。例如，界限6000比7000更“好”，更大更好。

使用log_perplexity方法评估LDA
代码示例1

from gensim.models import LdaModel
from gensim.corpora import Dictionary
import numpy as npdocs = [["a", "a", "b"], ["a", "c", "g"], ["c"],["a", "c", "g"]]dct = Dictionary(docs)
corpus = [dct.doc2bow(_) for _ in docs]
c_train, c_test = corpus[:2], corpus[2:]ldamodel = LdaModel(corpus=c_train, num_topics=2, id2word=dct)
Per-word Perplexity=ldamodel.log_perplexity(c_test)
print(Per-word Perplexity)

corpus：59000 documents
unique token:500000
我估计R中的最终模型来利用它的可视化工具来解释我的结果，但是首先我需要为我的模型选择主题的数量。因为我没有直觉关于潜在结构中有多少主题，所以我要估计一系列模型，主题k＝20, 25, 30…并估计每个模型的困惑，以确定在Blei（2003）中推荐的最佳主题数目。在我知道的（LDA和TopICModels）中，用于估计LDA的唯一包使用batch LDA，每当我估计一个具有超过70个主题的模型时，我就用完了内存（这是一个超级计算集群，每个处理器的RAM多达96千兆字节）。我认为我可以使用gensim来估计一系列的模型，使用的是online LDA，它的内存强度要少得多，计算出一个保存（held-out）的文档样本的困惑，根据这些结果选择主题的数量，然后在R中使用batch LDA来估计最终模型。
步骤如下：
1.从R中的一系列文本文件生成语料库，以MM格式导出文档术语矩阵和字典。
2.在Python中导入语料库和词典。
3.将语料库分成训练/测试数据集。
4.利用训练数据估计LDA模型。
5.使用测试数据计算边界和每个字的困惑。
我的理解是困惑总是随着话题数量的增加而减少，所以最佳的话题数量应该是困惑中的边际变化小。然而，每当我估计一系列模型时，困惑实际上随着话题的数量而增加。对于k=20/25/30/35/40的困惑度的值：

Perplexity (20 topics):  -44138604.0036
Per-word Perplexity:  542.513884961
Perplexity (25 topics):  -44834368.1148
Per-word Perplexity:  599.120014719
Perplexity (30 topics):  -45627143.4341
Per-word Perplexity:  670.851965367
Perplexity (35 topics):  -46457210.907
Per-word Perplexity:  755.178877447
Perplexity (40 topics):  -47294658.5467
Per-word Perplexity:  851.001209258我已经想到的可能的问题所在：
1.模型运行得不够长，无法正常收敛吗？我将chunksize设置为1000，因此应该有40-50个passes，而在最后一个块中，我看到980个+/1000个文档在50次迭代中收敛。
2.我不理解LDAlda.bound函数的估计值吗？
3.我需要更多地修剪(trim)字典吗？我已经删除了中值TF-IDF分数以下的所有tokens，所以我把原来的字典切成了两半。
我的问题是我用R来构建字典和语料库吗？在文本编辑器中，我比较了从R生成的字典和MM语料库文件到用gensim构建的更小的测试字典/语料库，而在信息如何编码时，我看不到任何差异。我想使用R来构建语料库，所以我确保我使用的是完全相同的语料库，作为online LDA，我将在R中使用最后的模型，而我不知道如何将gensim语料库转换成R文档术语矩阵对象。
我使用的脚本是:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)import numpy
import scipy
import gensimimport random
random.seed(11091987)           #set random seed# load id->word mapping (the dictionary)
id2word =  gensim.corpora.Dictionary.load_from_text('../dict.dict')# load corpus
## add top line to MM file since R does not automatically add this
## and save new version
with open('../dtm.mtx') as f:dtm = f.read()dtm = "%%MatrixMarket matrix coordinate real general\n" + dtmwith open('dtm.mtx', 'w+') as f:f.write(dtm)corpus = gensim.corpora.MmCorpus('dtm.mtx')print id2word
print corpus# shuffle corpus洗牌语料库
cp = list(corpus)
random.shuffle(cp)# split into 80% training and 20% test sets
p = int(len(cp) * .8)
cp_train = cp[0:p]
cp_test = cp[p:]import time
start_time = time.time()lda = gensim.models.ldamodel.LdaModel(corpus=cp_train, id2word=id2word, num_topics=25,update_every=1, chunksize=1000, passes=2)elapsed = time.time() - start_time
print('Elapsed time: '),
print elapsedprint lda.show_topics(topics=-1, topn=10, formatted=True)print('Perplexity: '),
perplex = lda.bound(cp_test)
print perplexprint('Per-word Perplexity: '),
print numpy.exp2(-perplex / sum(cnt for document in cp_test for _, cnt in document))elapsed = time.time() - start_time
print('Elapsed time: '),
print elapsed

以下内容来源于
https://blog.csdn.net/zilong10_24/article/details/79858702
https://blog.csdn.net/jiaqiang_ruan/article/details/77989459?locationNum=2&fps=1

1.LDA主题模型困惑度

perplexity是一种信息理论的测量方法，b的perplexity值定义为基于b的熵的能量（b可以是一个概率分布或者概率模型），通常用于概率模型的比较
WIKI上列举了三种perplexity的计算：
①概率分布的perplexity
公式：
其中H(p)就是该概率分布的熵，当概率P的K平均分布的时候，带入上式可以得到P的perplexity值为K。
一个特殊的例子是k面均匀骰子的概率分布，它的困惑度恰好是k。一个拥有k困惑度的随机变量有着和k面均匀骰子一样多的不确定性，并且可以说该随机变量有着k个困惑度的取值（k-ways perplexed）。（在有限样本空间离散随机变量的概率分布中，均匀分布有着最大的熵）
困惑度有时也被用来衡量一个预测问题的难易程度。但这个方法不总是精确的。例如：在概率分布B(1,P=0.9)中，即取得1的概率是0.9，取得0的概率是0.1。可以计算困惑度是：
gs4
同时自然地，我们预测下一样本点的策略将是：预测其取值为1，那么我们预测正确的概率是0.9。而困惑度的倒数是1/1.38=0.72而不是0.9。（但当我们考虑k面骰子上的均匀分布时，困惑度是k，困惑度的倒数是1/k，正好是预测正确的概率）
困惑度是信息熵的指数。
②概率模型的perplexity
用一个概率模型q去估计真实概率分布p，那么可以通过测试集中的样本来定义这个概率模型的困惑度。

其中测试样本x1, x2, …, xN是来自于真实概率分布p的观测值，b通常取2。因此，低的困惑度表示q对p拟合的越好，当模型q看到测试样本时，它会不会“感到”那么“困惑”。
公式： gs2
公式中的Xi为测试局，可以是句子或者文本，N是测试集的大小（用来归一化），对于未知分布q，perplexity的值越小，说明模型越好。指数部分也可以用交叉熵来计算。

其中 p^ 表示我们对真实分布下样本点x出现概率的估计。比如用p(x)=n/N
③单词的perplexity
perplexity经常用于语言模型的评估，物理意义是单词的编码大小。例如，如果在某个测试语句上，语言模型的perplexity值为2^190，说明该句子的编码需要190bits 。
在自然语言处理中，困惑度是用来衡量语言概率模型优劣的一个方法。一个语言概率模型可以看成是在整过句子或者文段上的概率分布。（译者：例如每个分词位置上有一个概率分布，这个概率分布表示了每个词在这个位置上出现的概率；或者每个句子位置上有一个概率分布，这个概率分布表示了所有可能句子在这个位置上出现的概率）

比如，i这个句子位置上的概率分布的信息熵可能是190，或者说，i这个句子位置上出现的句子平均要用190 bits去编码，那么这个位置上的概率分布的困惑度就是2^(190)。（译者：相当于投掷一个2^(190)面筛子的不确定性）通常，我们会考虑句子有不同的长度，所以我们会计算每个分词上的困惑度。比如，一个测试集上共有1000个单词，并且可以用7.95个bits给每个单词编码，那么我们可以说这个模型上每个词有2^(7.95)=247 困惑度。相当于在每个词语位置上都有投掷一个247面骰子的不确定性。在一个特定领域的语料中，常常可以得到更低的困惑度。

2.困惑度perplexity公式

gs3
其中，p(w)是指的test集中出现的每一个词的概率，具体到LDA的模型中就是p(w)=∑zp(z|d)*p(w|z) (z,d分别指训练过的主题和test集的各篇文档)。分母的N是test集中出现的所有词，或者说是test集的总长度（test of corpus_words），不排重。

3.计算困惑度的代码

#-*-coding:utf-8-*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import os
from gensim.corpora import Dictionary
from gensim import corpora, models
from datetime import datetime
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s : ', level=logging.INFO)def perplexity(ldamodel, testset, dictionary, size_dictionary, num_topics):"""calculate the perplexity of a lda-model"""# dictionary : {7822:'deferment', 1841:'circuitry',19202:'fabianism'...]print ('the info of this ldamodel: \n')print ('num of testset: %s; size_dictionary: %s; num of topics: %s'%(len(testset), size_dictionary, num_topics))prep = 0.0prob_doc_sum = 0.0topic_word_list = [] # store the probablity of topic-word:[(u'business', 0.010020942661849608),(u'family', 0.0088027946271537413)...]for topic_id in range(num_topics):topic_word = ldamodel.show_topic(topic_id, size_dictionary)dic = {}for word, probability in topic_word:dic[word] = probabilitytopic_word_list.append(dic)doc_topics_ist = [] #store the doc-topic tuples:[(0, 0.0006211180124223594),(1, 0.0006211180124223594),...]for doc in testset:doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))testset_word_num = 0for i in range(len(testset)):prob_doc = 0.0 # the probablity of the docdoc = testset[i]doc_word_num = 0 # the num of words in the docfor word_id, num in doc:prob_word = 0.0 # the probablity of the word doc_word_num += numword = dictionary[word_id]for topic_id in range(num_topics):# cal p(w) : p(w) = sumz(p(z)*p(w|z))prob_topic = doc_topics_ist[i][topic_id][1]prob_topic_word = topic_word_list[topic_id][word]prob_word += prob_topic*prob_topic_wordprob_doc += math.log(prob_word) # p(d) = sum(log(p(w)))prob_doc_sum += prob_doctestset_word_num += doc_word_numprep = math.exp(-prob_doc_sum/testset_word_num) # perplexity = exp(-sum(p(d)/sum(Nd))print ("the perplexity of this ldamodel is : %s"%prep)return prepif __name__ == '__main__':# os.sep 根据你所处的平台，自动地采用相应的分割符号middatafolder = r'E:\work\lda' + os.sepdictionary_path = middatafolder + 'dictionary.dictionary'corpus_path = middatafolder + 'corpus.mm'ldamodel_path = middatafolder + 'lda.model'dictionary = corpora.Dictionary.load(dictionary_path)corpus = corpora.MmCorpus(corpus_path)lda_multi = models.ldamodel.LdaModel.load(ldamodel_path)num_topics = 50testset = []# sample 1/300for i in range(corpus.num_docs/300):testset.append(corpus[i*300])prep = perplexity(lda_multi, testset, dictionary, len(dictionary.keys()), num_topics)