一. 原理介绍

在研究生实习时候就做过语言模型的任务，当时让求PPL值，当时只是调包，不求甚解，哈哈哈，当时也没想到现在会开发这个评价指标，那现在我来讲一下我对这个指标的了解，望各位大佬多多指教。

1. 这个困惑度是如何发展来的呢？

在得到不同的语言模型（一元语言模型、二元语言模型....）的时候，我们如何判断一个语言模型是否好还是坏，一般有两种方法：

一种方法将其应用到具体的问题当中，比如机器翻译、speech recognition、spelling corrector等。然后看这个语言模型在这些任务中的表现（extrinsic evaluation，or in-vivo evaluation）。但是，这种方法一方面难以操作，另一方面可能非常耗时，可能跑一个evaluation需要大量时间，费时难操作。
针对第一种方法的缺点，大家想是否可以根据与语言模型自身的一些特性，来设计一种简单易行，而又行之有效的评测指标。于是，人们就发明了perplexity这个指标。

简单说一下语言模型：语言模型（Language Model，LM），给出一句话的前k个词，希望它可以预测第k+1个词是什么，即给出一个第k+1个词可能出现的概率的分布p(xk+1|x1,x2,...,xk)。

2. 困惑度的概念

困惑度的计算是基于单词的，在自然语言处理中，困惑度是用来衡量语言概率模型优劣的一个方法。一个语言概率模型可以看成是在整过句子或者文段上的概率分布。例如每个分词位置上有一个概率分布，这个概率分布表示了每个词在这个位置上出现的概率；或者每个句子位置上有一个概率分布，这个概率分布表示了所有可能句子在这个位置上出现的概率。

困惑度（perplexity）的基本思想是：给测试集的句子赋予较高概率值的语言模型较好,当语言模型训练完之后，测试集中的句子都是正常的句子，那么训练好的模型就是在测试集上的概率越高越好，公式如下：

结论：困惑度越小，句子概率越大，语言模型越好。

下面是一些 ngram 模型经训练文本后在测试集上的困惑度值：

preview

可以看到，之前我们学习的 trigram 模型经训练后，困惑度由955跌减至74，这是十分可观的结果。在Brown corpus (1 million words of American English of varying topics and genres) 上报告的最低的困惑度就是使用一个trigram model（三元语法模型）。在一个特定领域的语料中，常常可以得到更低的困惑度。

二. 使用场景

我们在想衡量语言模型的好坏，然后对语言模型输出的句子进行评价。场景很简单，话不多说，上代码。

三. MindSpore的Perplexity代码实现

原理就是上面说的那些，比较好理解，话不多说直接上代码。使用的是MindSpore框架实现的代码。

MindSpore代码实现

"""Perplexity"""
import math
import numpy as np
from mindspore._checkparam import Validator as validator
from .metric import Metricclass Perplexity(Metric):def __init__(self, ignore_label=None):super(Perplexity, self).__init__()if ignore_label is None:self.ignore_label = ignore_labelelse:self.ignore_label = validator.check_value_type("ignore_label", ignore_label, [int])self.clear()def clear(self):"""清除历史数据"""self._sum_metric = 0.0self._num_inst = 0def update(self, *inputs):# 输入数量的校验if len(inputs) != 2:raise ValueError('Perplexity needs 2 inputs (preds, labels), but got {}.'.format(len(inputs)))# 将输入统一转化为numpy，变为listpreds = [self._convert_data(inputs[0])]labels = [self._convert_data(inputs[1])]if len(preds) != len(labels):raise RuntimeError('preds and labels should have the same length, but the length of preds is{}, ''the length of labels is {}.'.format(len(preds), len(labels)))loss = 0.num = 0# 遍历句子中的每一个单词，再进行维度的变换for label, pred in zip(labels, preds):if label.size != pred.size / pred.shape[-1]:raise RuntimeError("shape mismatch: label shape should be equal to pred shape, but got label shape ""is {}, pred shape is {}.".format(label.shape, pred.shape))# 将label进行重塑label = label.reshape((label.size,))# 将label_expand转为int类型label_expand = label.astype(int)# label_expand扩展一维label_expand = np.expand_dims(label_expand, axis=1)first_indices = np.arange(label_expand.shape[0])[:, None]pred = np.squeeze(pred[first_indices, label_expand])if self.ignore_label is not None:ignore = (label == self.ignore_label).astype(pred.dtype)num -= np.sum(ignore)pred = pred * (1 - ignore) + ignoreloss -= np.sum(np.log(np.maximum(1e-10, pred)))num += pred.sizeself._sum_metric += lossself._num_inst += numdef eval(self):# 分母不能为0if self._num_inst == 0:raise RuntimeError('Perplexity can not be calculated, because the number of samples is 0.')return math.exp(self._sum_metric / self._num_inst)

使用方法如下：

import numpy as np
from mindspore import Tensor
from mindspore.nn.metrics import Perplexityx = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
y = Tensor(np.array([1, 0, 1]))
metric = Perplexity(ignore_label=None)
metric.clear()
metric.update(x, y)
perplexity = metric.eval()
print(perplexity)2.231443166940565

每个batch（比如两组数据）进行计算的时候如下：

import numpy as np
from mindspore import Tensor
from mindspore.nn.metrics import Perplexitymetric = Perplexity(ignore_label=None)
metric.clear()
x = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
y = Tensor(np.array([1, 0, 1]))
metric.update(x, y)x1= Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
y1 = Tensor(np.array([1, 0, 1]))
metric.update(x1, y1)
perplexity = metric.eval()
print(perplexity)