MapReduce 编程

MapReduce 是 Hadoop 的分布式计算框架，用于在 HDFS 集群上并行处理海量数据。它通过 Map 和 Reduce 两个阶段将复杂任务分解为可并行执行的小任务，结合数据本地化提高效率。本教程将详细介绍 MapReduce 编程模型，重点包括编程步骤、Java 和 Hadoop Streaming（Python）示例、运行流程和优化技巧，结合 Verilog 背景进行类比以便理解。内容基于 Apache Hadoop 3.x（截至 2025 年 10 月的最新稳定版）。

1. MapReduce 编程模型

MapReduce 程序将数据处理分为三个阶段：

Map：将输入数据分解为键值对（key-value pairs），进行初步处理（如过滤、转换）。
Shuffle：Hadoop 自动完成，按键分组并分发到 Reduce 任务。
Reduce：聚合 Map 输出，生成最终结果（如求和、统计）。

关键特点：

并行性：Map 和 Reduce 任务分布在多个节点上运行。
数据本地化：任务尽量在数据所在节点执行，减少网络传输。
容错性：任务失败自动重试，依赖 HDFS 的副本机制。

类比 Verilog：

Map 像并行组合逻辑，处理输入“信号”。
Reduce 像时序逻辑，汇总中间结果。
Shuffle 像数据总线，路由中间数据。
YARN（资源管理）像时钟分配器，调度任务。

2. MapReduce 编程步骤

MapReduce 程序通常用 Java 编写，也支持 Python 等语言（通过 Hadoop Streaming）。以下以经典的 WordCount（统计文本文件中单词频率）为例，介绍 Java 和 Python 实现。

2.1 Java 编程

Java 是 MapReduce 的主要编程语言，依赖 Hadoop 提供的类（如 Mapper 和 Reducer）。

2.1.1 Mapper

功能：将输入文本分解为单词，输出键值对（单词，1）。
代码：

  import org.apache.hadoop.io.IntWritable;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Mapper;
  import java.io.IOException;

  public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

      @Override
      protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
          String[] words = value.toString().split("\\s+");
          for (String w : words) {
              word.set(w);
              context.write(word, one);
          }
      }
  }

说明：
输入：LongWritable（文件偏移量），Text（一行文本）。
输出：Text（单词），IntWritable（计数 1）。
类比 Verilog：Mapper 像并行逻辑单元，将输入数据分解为独立处理的“信号”。

2.1.2 Reducer

功能：接收 Map 输出，按单词汇总计数。
代码：

  import org.apache.hadoop.io.IntWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Reducer;
  import java.io.IOException;

  public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
      private IntWritable result = new IntWritable();

      @Override
      protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
              sum += val.get();
          }
          result.set(sum);
          context.write(key, result);
      }
  }

说明：
输入：Text（单词），Iterable<IntWritable>（计数列表）。
输出：Text（单词），IntWritable（总计数）。
类比 Verilog：Reducer 像状态机，汇总输入信号并输出最终状态。

2.1.3 Driver（主程序）

功能：配置 Job，指定 Mapper、Reducer 和输入输出路径。
代码：

  import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.io.IntWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Job;
  import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

  public class WordCount {
      public static void main(String[] args) throws Exception {
          Configuration conf = new Configuration();
          Job job = Job.getInstance(conf, "WordCount");
          job.setJarByClass(WordCount.class);
          job.setMapperClass(WordCountMapper.class);
          job.setReducerClass(WordCountReducer.class);
          job.setOutputKeyClass(Text.class);
          job.setOutputValueClass(IntWritable.class);
          FileInputFormat.addInputPath(job, new Path(args[0]));
          FileOutputFormat.setOutputPath(job, new Path(args[1]));
          System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
  }

说明：
配置 Job 参数，连接 HDFS 输入/输出路径。
类比 Verilog：Driver 像顶层模块，实例化并连接子模块（Mapper/Reducer）。

2.1.4 编译与打包

编译：

  javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.3.6.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar *.java

打包：

  jar cf wordcount.jar *.class

2.2 Hadoop Streaming（Python 编程）

Hadoop Streaming 允许使用 Python 等非 Java 语言编写 MapReduce，适合快速开发。

2.2.1 Mapper（mapper.py）

功能：读取标准输入，输出键值对（单词，1）。
代码：

  #!/usr/bin/env python
  import sys
  for line in sys.stdin:
      words = line.strip().split()
      for word in words:
          print(f"{word}\t1")

说明：每行分解为单词，输出格式为“单词\t计数”。

2.2.2 Reducer（reducer.py）

功能：按单词汇总计数。
代码：

  #!/usr/bin/env python
  import sys
  current_word = None
  current_count = 0
  for line in sys.stdin:
      word, count = line.strip().split('\t', 1)
      count = int(count)
      if current_word == word:
          current_count += count
      else:
          if current_word:
              print(f"{current_word}\t{current_count}")
          current_word = word
          current_count = count
  if current_word:
      print(f"{current_word}\t{current_count}")

说明：按键（单词）累加计数，输出最终结果。

2.2.3 赋予执行权限

chmod +x mapper.py reducer.py

类比 Verilog：Hadoop Streaming 像用脚本生成 Verilog 测试向量，简化硬件逻辑描述。

3. 运行 MapReduce 程序

假设 HDFS 集群已配置（参考前述 HDFS 集群教程），以下是运行步骤。

3.1 准备输入数据

创建测试文件：

  echo -e "hello hadoop\nhello world" > input.txt

上传到 HDFS：

  hdfs dfs -mkdir /user/input
  hdfs dfs -put input.txt /user/input/

3.2 运行 Java 程序

执行：

  hadoop jar wordcount.jar WordCount /user/input /user/output

注意：输出路径（/user/output）不能预先存在。

3.3 运行 Python 程序（Hadoop Streaming）

执行：

  hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
      -file mapper.py -mapper mapper.py \
      -file reducer.py -reducer reducer.py \
      -input /user/input -output /user/output

3.4 查看结果

检查输出：

  hdfs dfs -cat /user/output/part-r-00000

预期输出：

  hadoop  1
  hello   2
  world   1

类比 Verilog：运行 MapReduce 像在 FPGA 上加载比特流，输入是“激励信号”，输出是“逻辑结果”。

4. 高级 MapReduce 编程

以下是扩展 MapReduce 功能的常用技巧。

4.1 Combiner

功能：在 Map 阶段本地聚合，减少 Shuffle 数据量。
示例：在 WordCount 中复用 Reducer：

  job.setCombinerClass(WordCountReducer.class);

类比 Verilog：像在数据通路中插入中间寄存器，减少传输延迟。

4.2 自定义 Partitioner

功能：控制 Map 输出如何分配到 Reducer。
示例：

  public class CustomPartitioner extends Partitioner<Text, IntWritable> {
      @Override
      public int getPartition(Text key, IntWritable value, int numPartitions) {
          return (key.toString().hashCode() & Integer.MAX_VALUE) % numPartitions;
      }
  }

设置：

  job.setPartitionerClass(CustomPartitioner.class);
  job.setNumReduceTasks(2);

类比 Verilog：像数据路由模块，分配“信号”到不同处理单元。

4.3 多输出

功能：将不同结果写入多个文件。
示例：使用 MultipleOutputs：

  import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
  public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
      private MultipleOutputs<Text, IntWritable> mos;
      @Override
      protected void setup(Context context) {
          mos = new MultipleOutputs<>(context);
      }
      @Override
      protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
              sum += val.get();
          }
          mos.write("output1", key, new IntWritable(sum));
      }
      @Override
      protected void cleanup(Context context) throws IOException, InterruptedException {
          mos.close();
      }
  }

配置：

  MultipleOutputs.addNamedOutput(job, "output1", TextOutputFormat.class, Text.class, IntWritable.class);

4.4 自定义输入格式

功能：处理非标准输入（如 JSON、CSV）。
示例：继承 FileInputFormat 自定义解析逻辑。

类比 Verilog：这些高级功能像在硬件设计中添加自定义模块或状态机。

5. 优化与调优

Mapper/Reducer 数量：
Mapper 数由输入块数决定，可通过 mapreduce.job.maps 设置。
Reducer 数：job.setNumReduceTasks(n)。
内存分配：
编辑 mapred-site.xml：
xml <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>2048</value> </property>
数据压缩：
压缩 Map 输出：
xml <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property>
任务并行性：
增加 YARN 资源：编辑 yarn-site.xml：
xml <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> </property>

类比 Verilog：优化像调整硬件设计的时钟频率或并行单元数量。

6. 常见问题与解决

输出路径已存在：
删除：hdfs dfs -rm -r /user/output。
任务失败：
检查日志：$HADOOP_HOME/logs/userlogs。
确保 YARN 资源充足（检查 yarn-site.xml）。
性能慢：
使用 Combiner 或增加 Reducer 数量。
检查 HDFS 块分布：hdfs fsck /user/input -files -blocks.
类比 Verilog：这些问题像硬件中的时序违例（资源不足）或信号冲突（路径错误）。

7. 类比 Verilog 总结

Mapper：并行组合逻辑，处理输入数据。
Reducer：时序逻辑，汇总结果。
Shuffle：数据总线，路由中间数据。
Driver：顶层模块，连接子模块和 I/O。
YARN：时钟分配器，调度任务。

8. 进一步资源

官方文档：https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html （MapReduce 编程指南）。
教程：
DataFlair 的 MapReduce 编程：https://data-flair.training/blogs/hadoop-mapreduce-tutorial （详细示例）。
Edureka 的 WordCount 实现：https://www.edureka.co/blog/mapreduce-tutorial （视频+代码）。
实践：在本地集群运行 WordCount，或在 AWS EMR 测试复杂任务。
书籍：《Hadoop: The Definitive Guide》by Tom White。

如果你需要更复杂的 MapReduce 示例（例如处理 JSON 数据、自定义分区）或结合 Verilog 思维深入探讨（如并行任务与硬件流水线的类比），请提供具体需求，我可以进一步定制解答！

1. MapReduce 编程模型

2. MapReduce 编程步骤

2.1 Java 编程

2.1.1 Mapper

2.1.2 Reducer

2.1.3 Driver（主程序）

2.1.4 编译与打包

2.2 Hadoop Streaming（Python 编程）

2.2.1 Mapper（mapper.py）

2.2.2 Reducer（reducer.py）

2.2.3 赋予执行权限

3. 运行 MapReduce 程序

3.1 准备输入数据

3.2 运行 Java 程序

3.3 运行 Python 程序（Hadoop Streaming）

3.4 查看结果

4. 高级 MapReduce 编程

4.1 Combiner

4.2 自定义 Partitioner

4.3 多输出

4.4 自定义输入格式

5. 优化与调优

6. 常见问题与解决

7. 类比 Verilog 总结

8. 进一步资源

likuolei

发表回复取消回复

归档

分类

2025 年 11 月
一	二	三	四	五	六	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

1. MapReduce 编程模型

2. MapReduce 编程步骤

2.1 Java 编程

2.1.1 Mapper

2.1.2 Reducer

2.1.3 Driver（主程序）

2.1.4 编译与打包

2.2 Hadoop Streaming（Python 编程）

2.2.1 Mapper（mapper.py）

2.2.2 Reducer（reducer.py）

2.2.3 赋予执行权限

3. 运行 MapReduce 程序

3.1 准备输入数据

3.2 运行 Java 程序

3.3 运行 Python 程序（Hadoop Streaming）

3.4 查看结果

4. 高级 MapReduce 编程

4.1 Combiner

4.2 自定义 Partitioner

4.3 多输出

4.4 自定义输入格式

5. 优化与调优

6. 常见问题与解决

7. 类比 Verilog 总结

8. 进一步资源

likuolei

发表回复 取消回复

相关文章

发表回复取消回复