人脸识别mtcnn原理

人脸检测，也就是在图片中找到人脸的位置。输入是一张可能含有人脸的图片，输出是人脸位置的矩形框。

人脸对齐。原始图片中人脸的姿态、位置可能有较大的区别，为了之后统一处理，要把人脸“摆正”。为此，需要检测人脸中的关键点（Landmark），如眼睛的位置、鼻子的位置、嘴巴的位置、脸的轮廓点等。根据这些关键点可以使用仿射变换将人脸统一校准，以尽量消除姿势不同带来的误差。

MTCNN网络结构

MTCNN由三个神经网络组成，分别是P-Net、R-Net、O-Net。在使用这些网络之前，首先要将原始图片缩放到不同尺度，形成一个“图像金字塔”。接着会对每个尺度的图片通过神经网络计算一遍。这样做的原因在于：原始图片中的人脸存在不同的尺度，如有的人脸比较大，有的人脸比较小。对于比较小的人脸，可以在放大后的图片上检测；对于比较大的人脸，可以在缩小后的图片上检测。这样，就可以在统一的尺度下检测人脸了。

P-Net

P-Net的输入是一个宽和高皆为12像素，同时是3通道的RGB图像，该网络要判断这个12×12的图像中是否含有人脸，并且给出人脸框和关键点的位置。

输出由三部分组成:

判断该图像是否是人脸，输出向量的形状为1×1×2，图像是否是人脸的概率。
给出框的精确位置，一般称之为框回归。P-Net输入的12×12的图像块可能并不是完美的人脸框的位置，如有的时候人脸并不正好为方形，有的时候12×12的图像块可能偏左或偏右，因此需要输出当前框位置相对于完美的人脸框位置的偏移。对于图像中的框，可以用四个数来表示它的位置：框左上角的横坐标、框左上角的纵坐标、框的宽度、框的高度。因此，框回归输出的值是：框左上角的横坐标的相对偏移、框左上角的纵坐标的相对偏移、框的宽度的误差、框的高度的误差。输出向量的形状就是图中的1×1×4。
给出人脸的5个关键点的位置。5个关键点分别为：左眼的位置、右眼的位置、鼻子的位置、左嘴角的位置、右嘴角的位置。每个关键点又需要横坐标和纵坐标两维来表示，因此输出一共是10维（即1×1×10）。

R-Net

对每个P-Net输出可能为人脸的区域都放缩到24×24的大小，再输入到R-Net中，进行进一步判定。

O-Net

进一步把所有得到的区域缩放成48×48的大小，输入到最后的O-Net中

从P-Net到R-Net，最后再到O-Net，网络输入的图片越来越大，卷积层的通道数越来越多，内部的层数也越来越多，因此它们识别人脸的准确率应该是越来越高的。同时，P-Net的运行速度是最快的，R-Net的速度其次，O-Net的运行速度最慢。之所以要使用三个网络，是因为如果一开始直接对图中的每个区域使用O-Net，速度会非常慢。实际上P-Net先做了一遍过滤，将过滤后的结果再交给R-Net进行过滤，最后将过滤后的结果交给效果最好但速度较慢的O-Net进行判别。这样在每一步都提前减少了需要判别的数量，有效降低了处理时间。

中心损失 Center Loss

参考论文：A Discriminative Feature Learning Approach for Deep Face Recognition（http://ydwen.github.io/papers/WenECCV16.pdf）

在理想的状况下，希望“向量表示”之间的距离可以直接反映人脸的相似度：

对于同一个人的两张人脸图像，对应的向量之间的欧几里得距离应该比较小。
对于不同人的两张人脸图像，对应的向量之间的欧几里得距离应该比较大。

在原始的CNN模型中，使用的是Softmax损失。Softmax是类别间的损失，对于人脸来说，每一类就是一个人。尽管使用Softmax损失可以区别出每个人，但其本质上没有对每一类的向量表示之间的距离做出要求。

中心损失（Center Loss）不直接对距离进行优化，它保留了原有的分类模型，但又为每个类（人）指定了一个类别中心。同一类的图像对应的特征都应该尽量靠近自己的类别中心，不同类的类别中心尽量远离。

还是设输入的人脸图像为 $x_{i}$ ，该人脸对应的类别为 $y_{i}$ ，对每个类别都规定一个类别中心，记作 $c_{y_{i}}$ 。希望每个人脸图像对应的特征 $f(x_{i})$ 都尽可能接近其中心 $c_{y_{i}}$ 。因此定义中心损失为

多张图像的中心损失就是将它们的值加在一起

这是一个非常简单的定义。不过还有一个问题没有解决，那就是如何确定每个类别的中心 $c_{y_{i}}$ 呢？从理论上来说，类别 $y_{i}$ 的最佳中心应该是它对应的所有图片的特征的平均值。但如果采取这样的定义，那么在每一次梯度下降时，都要对所有图片计算一次 $c_{y_{i}}$ ，计算复杂度就太高了。针对这种情况，不妨近似一处理下，在初始阶段，先随机确定 $c_{y_{i}}$ ，接着在每个batch内，使用 $L_{i}=\frac{1}{2}\left \| f(x_{i}-c_{y_{i}})\right \|^{2}$ 对当前batch内的 $c_{y_{i}}$ 也计算梯度，并使用该梯度更新 $c_{y_{i}}$ 。此外，不能只使用中心损失来训练分类模型，还需要加入Softmax损失，也就是说，最终的损失由两部分构成，即 $L=L_{softmax}+\lambda L_{center}$ ，其中λ是一个超参数。

从图中可以看出，当中心损失的权重λ越大时，生成的特征就会具有越明显的“内聚性”。

def center_loss(features, label, alfa, nrof_classes):Center loss based on the paper "A Discriminative Feature Learning Approach for Deep Face Recognition"(http://ydwen.github.io/papers/WenECCV16.pdf):param features: 深度卷积网络提取的特征，[batch_size, feature_dim]:param label: 类别标签， [batch_size, 1]:param alfa: :param nrof_classes: 类别总数， int:return:nrof_features = features.get_shape()[1]centers = tf.get_variable('centers', [nrof_classes, nrof_features], dtype=tf.float32,initializer=tf.constant_initializer(0), trainable=False)label = tf.reshape(label, [-1])centers_batch = tf.gather(centers, label)diff = (1 - alfa) * (centers_batch - features)  # 计算梯度centers = tf.scatter_sub(centers, label, diff)  # 更新类别中心loss = tf.reduce_mean(tf.square(features - centers_batch))return loss, centers

三元组损失 Triplet Loss

每次都在训练数据中取出三张人脸图像，第一张图像记为 $x_{i}^{a}$ ，第二张图像记为 $x_{i}^{p}$ ，第三张图像记为 $x_{i}^{n}$ 。在这样一个“三元组”中， $x_{i}^{a}$ 和 $x_{i}^{p}$ 对应的是同一个人的图像，而 $x_{i}^{n}$ 是另外一个不同的人的人脸图像。因此，距离 $\left \| f(x_{i}^{a})-f(x_{i}^{p}) \right \|_{2}$ 应该较小，而距离 $\left \| f(x_{i}^{a})-f(x_{i}^{n}) \right \|_{2}$ 应该较大。严格来说，三元组损失要求下面的式子成立

即相同人脸间的距离平方至少要比不同人脸间的距离平方小 $\alpha$ ，据此，设计损失函数为

这样的话，当三元组的距离满足时，不产生任何损失，此时 $L_{i}=0$ 。当距离不满足上述等式时，就会有值为的损失。此外，在训练时会固定 $\left \| f(x) \right \|=1$ ，以保证特征不会无限地“远离”。

三元组损失直接对距离进行优化，因此可以解决人脸的特征表示问题。但是在训练过程中，三元组的选择非常地有技巧性。如果每次都是随机选择三元组，虽然模型可以正确地收敛，但是并不能达到最好的性能。如果加入“难例挖掘”，即每次都选择最难分辨的三元组进行训练，模型又往往不能正确地收敛。对此，又提出每次都选取那些“半难”（Semi-hard）的数据进行训练，让模型在可以收敛的同时也保持良好的性能。此外，使用三元组损失训练人脸模型通常还需要非常大的人脸数据集，才能取得较好的效果。

def triplet_loss(anchor, positive, negative, alpha):"""Calculate the triplet loss according to the FaceNet paperArgs:anchor: the embeddings for the anchor images.positive: the embeddings for the positive images.negative: the embeddings for the negative images.Returns:the triplet loss according to the FaceNet paper as a float tensor."""with tf.variable_scope('triplet_loss'):pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, positive)), 1)neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, negative)), 1)basic_loss = tf.add(tf.subtract(pos_dist,neg_dist), alpha)loss = tf.reduce_mean(tf.maximum(basic_loss, 0.0), 0)return loss

def select_triplets(embeddings, nrof_images_per_class, image_paths, people_per_batch, alpha):"""Select the triplets for training:param embeddings: 深度神经网络提取的图片特征向量 [?, embedding_dim]:param nrof_images_per_class: list,每个人的图片数量列表:param image_paths::param people_per_batch:  每个batch包含的类别（人）数量:param alpha::return:"""trip_idx = 0emb_start_idx = 0num_trips = 0triplets = []for i in range(people_per_batch):nrof_images = int(nrof_images_per_class[i])for j in range(1,nrof_images):a_idx = emb_start_idx + j - 1 # anchor indexneg_dists_sqr = np.sum(np.square(embeddings[a_idx] - embeddings), 1) # 计算anchor 图片和其他人脸的距离for pair in range(j, nrof_images): p_idx = emb_start_idx + pair # positive indexpos_dist_sqr = np.sum(np.square(embeddings[a_idx]-embeddings[p_idx])) # 计算anchor 和positive人脸距离neg_dists_sqr[emb_start_idx:emb_start_idx+nrof_images] = np.NaN  # 将anchor人脸与同类的人脸距离mask为Nanall_neg = np.where(neg_dists_sqr-pos_dist_sqr<alpha)[0] # 筛选出 不同人脸之间的距离比相同人脸之间的距离大alpha的 负例图片nrof_random_negs = all_neg.shape[0]if nrof_random_negs>0:rnd_idx = np.random.randint(nrof_random_negs) # 从满足要求的负例集中随机选取一张图片作为负例n_idx = all_neg[rnd_idx]triplets.append((image_paths[a_idx], image_paths[p_idx], image_paths[n_idx]))trip_idx += 1num_trips += 1emb_start_idx += nrof_imagesnp.random.shuffle(triplets)return triplets, num_trips, len(triplets)

def train(args, sess, dataset, epoch, image_paths_placeholder, labels_placeholder, labels_batch,batch_size_placeholder, learning_rate_placeholder, phase_train_placeholder, enqueue_op, input_queue, global_step, embeddings, loss, train_op, summary_op, summary_writer, learning_rate_schedule_file,embedding_size, anchor, positive, negative, triplet_loss):batch_number = 0if args.learning_rate>0.0:lr = args.learning_rateelse:lr = facenet.get_learning_rate_from_file(learning_rate_schedule_file, epoch)while batch_number < args.epoch_size:# 从总数据中随机选择people_per_batch*images_per_person 张照片，同类的照片放在一起image_paths, num_per_class = sample_people(dataset, args.people_per_batch, args.images_per_person)print('Running forward pass on sampled images: ', end='')start_time = time.time()nrof_examples = args.people_per_batch * args.images_per_personlabels_array = np.reshape(np.arange(nrof_examples),(-1,3))image_paths_array = np.reshape(np.expand_dims(np.array(image_paths),1), (-1,3))# 将people_per_batch*images_per_person 张照片入队列sess.run(enqueue_op, {image_paths_placeholder: image_paths_array, labels_placeholder: labels_array})emb_array = np.zeros((nrof_examples, embedding_size))nrof_batches = int(np.ceil(nrof_examples / args.batch_size))# 计算people_per_batch*images_per_person 张照片的向量表示， 计算的同时出队列，计算完成后，队列为空for i in range(nrof_batches):batch_size = min(nrof_examples-i*args.batch_size, args.batch_size)emb, lab = sess.run([embeddings, labels_batch], feed_dict={batch_size_placeholder: batch_size, learning_rate_placeholder: lr, phase_train_placeholder: True})emb_array[lab,:] = embprint('%.3f' % (time.time()-start_time))# 选择出“半难的”数据进行训练print('Selecting suitable triplets for training')triplets, nrof_random_negs, nrof_triplets = select_triplets(emb_array, num_per_class, image_paths, args.people_per_batch, args.alpha)selection_time = time.time() - start_timeprint('(nrof_random_negs, nrof_triplets) = (%d, %d): time=%.3f seconds' % (nrof_random_negs, nrof_triplets, selection_time))# Perform training on the selected tripletsnrof_batches = int(np.ceil(nrof_triplets*3/args.batch_size))triplet_paths = list(itertools.chain(*triplets))labels_array = np.reshape(np.arange(len(triplet_paths)),(-1,3))triplet_paths_array = np.reshape(np.expand_dims(np.array(triplet_paths),1), (-1,3))# 将“半难的”数据入队列sess.run(enqueue_op, {image_paths_placeholder: triplet_paths_array, labels_placeholder: labels_array})nrof_examples = len(triplet_paths)train_time = 0i = 0emb_array = np.zeros((nrof_examples, embedding_size))loss_array = np.zeros((nrof_triplets,))# 按批次训练while i < nrof_batches:start_time = time.time()batch_size = min(nrof_examples-i*args.batch_size, args.batch_size)feed_dict = {batch_size_placeholder: batch_size, learning_rate_placeholder: lr, phase_train_placeholder: True}err, _, step, emb, lab = sess.run([loss, train_op, global_step, embeddings, labels_batch], feed_dict=feed_dict)emb_array[lab,:] = embloss_array[i] = errduration = time.time() - start_timeprint('Epoch: [%d][%d/%d]\tTime %.3f\tLoss %2.3f' %(epoch, batch_number+1, args.epoch_size, duration, err))batch_number += 1i += 1train_time += duration# Add validation loss and accuracy to summarysummary = tf.Summary()#pylint: disable=maybe-no-membersummary.value.add(tag='time/selection', simple_value=selection_time)summary_writer.add_summary(summary, step)return step