Windows下Eclipse提交MR程序到HadoopCluster
作者:Syn良子 出处:http://www.cnblogs.com/cssdongl 欢迎转载,转载请注明出处.
以前Eclipse上写好的MapReduce项目经常是打好包上传到Hadoop测试集群来直接运行,运行遇到问题的话查看日志和修改相关代码来解决。找时间配置了Windows上Eclispe远程提交MR程序到集群方便调试.记录一些遇到的问题和解决方法.
系统环境:Windows7 64,Eclipse Mars,Maven3.3.9,Hadoop2.6.0-CDH5.4.0.
一.配置MapReduce Maven工程
新建一个Maven工程,将CDH集群的相关xml配置文件(主要是core-site.xml,hdfs-site.xml,mapred-site.xml和yarn-site.xml)拷贝到src/main/java下,因为需要连接的是CDH集群,所以配置pom.xml文件主要内容如下
<properties><project.build.sourceEncoding>UTF-8project.build.sourceEncoding> properties> <repositories><repository><id>clouderaid><url>https://repository.cloudera.com/artifactory/cloudera-repos/url>repository> repositories> <dependencies><dependency><groupId>junitgroupId><artifactId>junitartifactId><version>3.8.1version><scope>testscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-commonartifactId><version>2.6.0-cdh5.4.0version><type>jartype><exclusions><exclusion><artifactId>kfsartifactId><groupId>net.sf.kosmosfsgroupId>exclusion>exclusions><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-hdfsartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-hdfs-nfsartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-yarn-apiartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-yarn-applications-distributedshellartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-yarn-server-resourcemanagerartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-yarn-applications-unmanaged-am-launcherartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hbase-hadoop2-compatartifactId><version>1.0.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-mapreduce-client-commonartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-mapreduce-client-jobclientartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-mapreduce-client-appartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-mapreduce-client-hsartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-mapreduce-client-hs-pluginsartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hbase-clientartifactId><version>1.0.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hbase-commonartifactId><version>1.0.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hbase-serverartifactId><version>1.0.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hbase-protocolartifactId><version>1.0.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hbase-prefix-treeartifactId><version>1.0.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hbase-hadoop-compatartifactId><version>1.0.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>jdk.toolsgroupId><artifactId>jdk.toolsartifactId><version>1.7version><scope>systemscope><systemPath>${JAVA_HOME}/lib/tools.jarsystemPath>dependency><dependency><groupId>org.apache.maven.surefiregroupId><artifactId>surefire-booterartifactId><version>2.12.4version>dependency>dependencies>
如果CDH是其他版本,请参考CDH官方Maven Artifacts,配置好对应的dependency(修改version之类的属性).如果是原生Hadoop,remove掉上面Cloudera的repositroy,配置好对应的dependency.
配置好以后保存pom文件,等待相关jar包下载完成.
二.配置Eclipse提交MR到集群
最简单的莫过于WordCount了,贴代码先.
package org.ldong.test;import java.io.File; import java.io.IOException; import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.yarn.conf.YarnConfiguration;public class WordCount1 extends Configured implements Tool {public static class TokenizerMapper extends Mapper{private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}public static class IntSumReducer extends Reducer {public void reduce(Text key, Iterable values, Context context)throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}public static void main(String[] args) throws Exception {System.setProperty("hadoop.home.dir", "C:\\hadoop-2.6.0");ToolRunner.run(new WordCount1(), args);}public int run(String[] args) throws Exception {String input = "hdfs://littleNameservice/test/input";String output = "hdfs://littleNameservice/test/output";Configuration conf = new YarnConfiguration();Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount1.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setNumReduceTasks(1);FileInputFormat.addInputPath(job, new Path(input));FileSystem fs = FileSystem.get(conf);if (fs.exists(new Path(output))) {fs.delete(new Path(output), true);}=FileOutputFormat.setOutputPath(job, new Path(output));return job.waitForCompletion(true) ? 0 : 1;}}
ok,这里主要配置好代码中标红部分的configuration(和自己的连接集群保持一致),开始运行该程序,直接run as java application ,不出意外的话,肯定报错如下
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
这个错误很常见,在windows下运行MR就会出现,主要是hadoop需要的一些native library在windows上找不到.作为强迫症患者肯定不能视而不见了,解决方式如下:
1.去官网下载hadoop2.6.0-.tar.gz解压到windows本地,配置环境变量,指向解压以后的hadoop根目录下bin文件夹;
2.将本文最后链接中的压缩包下载解压,将其中的文件都解压到步骤1中hadoop的bin目录下.
3.如果没有立刻生效的,加上临时代码,如上面代码中 System.setProperty("hadoop.home.dir", "C:\\hadoop-2.6.0"),下次重启生效后好可以去掉这行
重新运行,该错误消失.出现另外一个错误
其实这个异常是Hadoop2.4之前的一个windows提交MR任务到Hadoop Cluster的严重bug,跟不同系统的classes path有关系,已经在hadoop2.4修复了,见如下链接
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/releasenotes.html
其中 MAPREDUCE-4052 :“Windows eclipse cannot submit job from Windows client to Linux/Unix Hadoop cluster.”
如果有同学是Hadoop2.4之前版本的,解决方式参考下面几个链接(其实4052和5655是duplicate的):
https://issues.apache.org/jira/browse/MAPREDUCE-4052
https://issues.apache.org/jira/browse/MAPREDUCE-5655
https://issues.apache.org/jira/browse/YARN-1298
http://www.aboutyun.com/thread-8498-1-1.html
http://blog.csdn.net/fansy1990/article/details/22896249
而我的环境是Hadoop2.6.0-CDH5.4.0的,解决方式没有上面这么麻烦,直接修改工程下mapred-site.xml文件,修改属性如下(如没有则添加)
<property> <name>mapreduce.app-submission.cross-platformname> <value>truevalue> property>
继续运行,上面那个”no job control”错误消失,出现如下错误

没完没了了,肯定还是配置的问题,最后查阅资料发现还是直接修改工程的配置文件
对于mapred-site.xml,添加或者修改属性如下,注意自己的集群保持一一致,如下
<property><name>mapreduce.application.classpathname><value>$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATHvalue>property>
对于yarn-site.xml,添加或者修改属性如下,注意自己的集群保持一一致,如下
<property><name>yarn.application.classpathname><value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*value>property>
重新运行MR,上面错误消失,出现新的错误找不到MR的class.汗了一地,好吧,2种方式,一种就是我代码中标黄的部分自动把MR打成jar包上传到集群分布式运行
String classDirToPackage = "D:\\workspace\\performance-statistics-mvn\\target\\classes";
这个地方请改成自己工程下的classes文件夹(如果是maven工程,请暂时删除classes下META-INF文件夹,否则后续无法打包),随后的代码用EJob类来打成Jar包,最终提交给集群来运行.EJob类如下
package org.ldong.test;import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.net.URL; import java.net.URLClassLoader; import java.util.ArrayList; import java.util.List; import java.util.jar.JarEntry; import java.util.jar.JarOutputStream; import java.util.jar.Manifest;public class EJob {private static ListclassPath = new ArrayList ();public static File createTempJar(String root) throws IOException {if (!new File(root).exists()) {return null;}Manifest manifest = new Manifest();manifest.getMainAttributes().putValue("Manifest-Version", "1.0");final File jarFile = File.createTempFile("EJob-", ".jar", new File(System.getProperty("java.io.tmpdir")));Runtime.getRuntime().addShutdownHook(new Thread() {public void run() {jarFile.delete();}});JarOutputStream out = new JarOutputStream(new FileOutputStream(jarFile), manifest);createTempJarInner(out, new File(root), "");out.flush();out.close();return jarFile;}private static void createTempJarInner(JarOutputStream out, File f,String base) throws IOException {if (f.isDirectory()) {File[] fl = f.listFiles();if (base.length() > 0) {base = base + "/";}for (int i = 0; i < fl.length; i++) {createTempJarInner(out, fl[i], base + fl[i].getName());}} else {out.putNextEntry(new JarEntry(base));FileInputStream in = new FileInputStream(f);byte[] buffer = new byte[1024];int n = in.read(buffer);while (n != -1) {out.write(buffer, 0, n);n = in.read(buffer);}in.close();}}public static ClassLoader getClassLoader() {ClassLoader parent = Thread.currentThread().getContextClassLoader();System.out .println(parent);URL[] urls = classPath.toArray(new URL[0]);for( URL url : urls){System.out .println(url);}System.out.println(classPath.toArray(new URL[0]));if (parent == null) {parent = EJob.class.getClassLoader();}if (parent == null) {parent = ClassLoader.getSystemClassLoader();}return new URLClassLoader(classPath.toArray(new URL[0]), parent);}public static void addClasspath(String component) {if ((component != null) && (component.length() > 0)) {try {File f = new File(component);if (f.exists()) {URL key = f.getCanonicalFile().toURL();if (!classPath.contains(key)) {classPath.add(key);}}} catch (IOException e) {}}}}
另外一种方式就是直接手动将MR工程打包,放到集群每个节点上可以加载到的class path下,执行eclipse中程序触发MR分布式运行,也可以实现.
好吧,接着运行,这次没什么问题成功了,如下图

去集群上验证jobhistory和hdfs输出目录,结果正确.
其他可能会遇到的错误如
org.apache.hadoop.security.AccessControlException: Permission denied….
其实就是没有hdfs权限,解决方法有以下几种
1.在系统环境变量中增加HADOOP_USER_NAME,其值为hadoop;或者通过java程序动态添加,如下:System.setProperty("HADOOP_USER_NAME", "hadoop"),这里hadoop为对hdfs具有读写权限的hadoop用户.
2.或者对需要操作的目录开放权限,例如 hadoop fs -chmod 777 /test.
3.或者修改hadoop的配置文件:hdfs-site.xml,添加或者修改 dfs.permissions 的值为 false.
4.或者将Eclipse所在机器的用户的名称修改为hadoop,即与服务器上运行hadoop的用户一致,这里hadoop为对hdfs具有读写权限的hadoop用户.
三.配置MR程序本地运行
如果仅仅是需要本地调试测试MR逻辑来读写HDFS,不提交MR到集群运行,那配置比上面简单很多
同样新建一个Maven工程,这次不需要拷贝那些*-site.xml,直接简单配置pom文件主要内容如下
<properties><project.build.sourceEncoding>UTF-8project.build.sourceEncoding> properties> <repositories><repository><id>clouderaid><url>https://repository.cloudera.com/artifactory/cloudera-repos/url>repository> repositories> <dependencies><dependency><groupId>junitgroupId><artifactId>junitartifactId><version>3.8.1version><scope>testscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-commonartifactId><version>2.6.0-cdh5.4.0version><type>jartype><exclusions><exclusion><artifactId>kfsartifactId><groupId>net.sf.kosmosfsgroupId>exclusion>exclusions><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-hdfsartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-coreartifactId><version>2.6.0-mr1-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-yarn-apiartifactId><version>2.6.0-cdh5.4.0version><type>jartype><scope>providedscope>dependency><dependency><groupId>jdk.toolsgroupId><artifactId>jdk.toolsartifactId><version>1.7version><scope>systemscope><systemPath>${JAVA_HOME}/lib/tools.jarsystemPath>dependency><dependency><groupId>org.apache.maven.surefiregroupId><artifactId>surefire-booterartifactId><version>2.12.4version>dependency>dependencies>
编写WordCount如下
import java.io.IOException; import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.yarn.conf.YarnConfiguration;public class WordCount extends Configured implements Tool {public static class TokenizerMapper extends Mapper{private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}public static class IntSumReducer extends Reducer {public void reduce(Text key, Iterable values, Context context)throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}public static void main(String[] args) throws Exception {System.setProperty("hadoop.home.dir", "C:\\hadoop-2.6.0");ToolRunner.run(new WordCount(), args);}public int run(String[] args) throws Exception {String input = "hdfs://master01.jj.wl:8020/test/input";String output = "hdfs://master01.jj.wl:8020/test/output";Configuration conf = new YarnConfiguration();Job job = new Job(conf, "word count");job.setJarByClass(WordCount.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);FileInputFormat.addInputPath(job, new Path(input));FileSystem fs = FileSystem.get(conf);if (fs.exists(new Path(output))) {fs.delete(new Path(output), true);}FileOutputFormat.setOutputPath(job, new Path(output));return job.waitForCompletion(true) ? 0 : 1;}}
配置好集群的连接,如标红部分,然后运行,报错如下
Could not locate executable null\bin\winutils.exe in the Hadoop binaries
和上面的解决方式一样,不废话,解决掉,接着运行,成功

眼尖的同学应该会发现这次MR调用了LocalJobRunner来运行,阅读源码会发现其实就是一个本地的JVM在运行MR程序,根本没有提交到集群,但是读写的是集群的HDFS,所以去集群上检查依然能看到output的输出结果,但是去Web UI上查看集群的任务列表会发现根本找不到刚才运行成功的任务.
winutils相关文件下载链接:http://files.cnblogs.com/files/cssdongl/hadoop2.6%28x64%29.zip
转载于:https://www.cnblogs.com/cssdongl/p/6027116.html
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
