一. Combiner合并的简单介绍

今天我们讲的是Shuffle中的第七步

每一个 map 都可能会产生大量的本地输出，Combiner 的作用就是对map 端的输出先做一次合并，以减少在 map 和 reduce 节点之间的数据传输量，以提高网络IO 性能，是 MapReduce 的一种优化手段之一。

1. Combiner是MR程序中Mapper和Reducer之外的一种组件。
2. Combiner组件的父类就是Reducer。
3. Combiner和Reducer的区别在于运行的位置
Combiner是在每一个MapTask所在的节点运行；
Reducer是接收全局所有Mapper的输出结果；
4. Combiner的意义就是对每一个MapTask的输出进行局部汇总，以减少网络传输量。

二. 通过图片了解使用Combiner和不使用的区别

1. 未使用combiner的网络开销
2. 使用combiner的网络开销

可以很明显的看出在combiner阶段,通过合并同一个区中相同key的value值,减小了后续的数据传输，从而提高了网络的io!

但在MapReduce中,combiner是默认不开启的。为什么呢?是因为数据合并并不适用所有的业务需求,如果是计算个数,求和combiner还能发挥它的优势!但如果是求平均数,combiner必不可免的会影响到最终的结果，使结果变得不可靠!所以当我们需要到combiner时,需要手动开启。

3. 自定义Combiner实现步骤
①自定义一个Combiner继承Reducer，重写Reduce方法

public class WordcountCombiner extends Reducer<Text, IntWritable, Text,IntWritable>{@Overrideprotected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {// 1 汇总操作int count = 0;for(IntWritable v :values){count += v.get();}// 2 写出context.write(key, new IntWritable(count));}
}

②在Job驱动类中设置：
job.setCombinerClass(WordcountCombiner.class);

三. 代码实现

注：用于对比的程序源代码为《MapReduce系列(2) | 统计输出给定的文本文档每一个单词出现的总次数》中的源代码，有想进行对比的同学，可以自行复制创建对比(其实本源码就比源代码多一行)。

3.1 编写Mapper类

package com.buwenbuhuo.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;/*** @author 卜温不火* @create 2020-04-22 21:24* com.buwenbuhuo.wordcount - the name of the target package where the new class or interface will be created.* mapreduce0422 - the name of the current project.*/
public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> {Text k = new Text();IntWritable v = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {// 1 获取一行String line = value.toString();// 2 切割String[] words = line.split(" ");// 3 输出for (String word : words) {k.set(word);context.write(k, v);}}
}

3.2 编写Reducer类

package com.buwenbuhuo.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;/*** @author 卜温不火* @create 2020-04-22 21:24* com.buwenbuhuo.wordcount - the name of the target package where the new class or interface will be created.* mapreduce0422 - the name of the current project.*/
public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable>{int sum;IntWritable v = new IntWritable();@Overrideprotected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {// 1 累加求和sum = 0;for (IntWritable count : values) {sum += count.get();}// 2 输出v.set(sum);context.write(key,v);}
}

3.3 编写Driver驱动类

package com.buwenbuhuo.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** @author 卜温不火* @create 2020-04-22 21:24* com.buwenbuhuo.wordcount - the name of the target package where the new class or interface will be created.* mapreduce0422 - the name of the current project.*/
public class WcDriver {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {// 1 获取配置信息以及封装任务Configuration configuration = new Configuration();Job job = Job.getInstance(configuration);// 2 设置jar加载路径job.setJarByClass(WcDriver.class);// 3 设置map和reduce类job.setMapperClass(WcMapper.class);job.setReducerClass(WcReducer.class);// 4 设置map输出job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);////  仅此一行添加job.setCombinerClass(WcReducer.class);// 5 设置最终输出kv类型job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);// 6 设置输入和输出路径FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));// 7 提交boolean result = job.waitForCompletion(true);System.exit(result ? 0 : 1);}
}