Spark案例之WordCount

目录

IntelliJ IDEA

一、编写WordCount程序

1.创建一个Maven项目WordCount并导入依赖

2.编写代码

3.打包插件

4.创建数据,打包完,导入包

5.集群测试(在包的路径下输入)

hdfs的方式:

本地方式:

6.查看结果

 二、远程调用Spark

1.启动Spark下的start-all.sh

 Jps查看进程:

2.导入依赖

 3.编写代码

4.打包

5.在把代码加到创建sparkConf的后面

原代码

修改后,加上包的路径

6.运行输出


IntelliJ IDEA

一、编写WordCount程序

1.创建一个Maven项目WordCount并导入依赖


4.0.0org.example06071.0-SNAPSHOT
org.apache.sparkspark-core_2.112.1.1

WordCount
net.alchim31.maven
scala-maven-plugin3.2.2compiletestCompile

2.编写代码

import org.apache.spark.{SparkConf, SparkContext}/*** @program: IntelliJ IDEA* @description: 编写spark版本的WordCount* @create: 2022-06-08 11:27*/
object WordCount {def main(args: Array[String]): Unit = {//1.读取配置val sparkConf = new SparkConf().setAppName("WordCount")//2.获取到SparkContextval sc = new SparkContext(sparkConf)//3.执行操作sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _, 1).sortBy(_._2, false).saveAsTextFile(args(1))//4.关闭连接sc.stop()}}

3.打包插件


org.apache.maven.pluginsmaven-assembly-plugin3.0.0WordCountjar-with-dependenciesmake-assemblypackagesingle

4.创建数据,打包完,导入包

[root@hadoop ~]# cd /usr/input
[root@hadoop input]# ls
WordCount.jar
[root@hadoop input]# vim wc.txt
java hadoop java hadoop
php hadoop scala scala
python java hive java

5.集群测试(在包的路径下输入)

hdfs的方式:

spark-submit --class WordCount --master yarn WordCount.jar hdfs://192.168.17.151:9000/wc.txt hdfs://192.168.17.151:9000/out

本地方式:

spark-submit --class WordCount --master yarn WordCount.jar file:///usr/input/word.txt hdfs://192.168.17.151:9000/0608

6.查看结果

[root@hadoop input]# hdfs dfs -cat /out/part-00000
(java,4)
(hadoop,3)
(scala,2)
(hive,1)
(php,1)
(python,1)

 二、远程调用Spark

1.启动Spark下的start-all.sh

 Jps查看进程:

2.导入依赖


4.0.0org.example06071.0-SNAPSHOTUTF-81.81.82.11.112.1.12.7.3org.scala-langscala-library${scala.version}org.apache.sparkspark-core_2.11${spark.version}org.apache.sparkspark-streaming_2.11${spark.version}org.apache.sparkspark-sql_2.11${spark.version}org.apache.sparkspark-hive_2.11${spark.version}org.apache.sparkspark-hive-thriftserver_2.11${spark.version}org.apache.sparkspark-mllib_2.11${spark.version}org.apache.hadoophadoop-client2.7.3com.hankcshanlpportable-1.7.7src/main/scalaorg.apache.maven.pluginsmaven-compiler-plugin3.6.0net.alchim31.mavenscala-maven-plugin3.2.2compiletestCompile-dependencyfile${project.build.directory}/.scala_dependenciesorg.apache.maven.pluginsmaven-surefire-plugin2.18.1falsetrue**/*Test.***/*Suite.*org.apache.maven.pluginsmaven-shade-plugin2.3packageshade*:*META-INF/*.SFMETA-INF/*.DSAMETA-INF/*.RSA

 3.编写代码

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}/*** @program: IntelliJ IDEA* @description: ming* @create: 2022-06-08 19:54*/
object sparkTest {def main(args: Array[String]): Unit = {//1.创建sparkConfval sparkConf = new SparkConf().setMaster("spark://192.168.17.151:7077").setAppName("WordCount")//2.创建sparkContextval sc = new SparkContext(sparkConf)//3.读取数据var rdd0:RDD[String] = sc.textFile("hdfs://192.168.17.151:9000/word.txt")//4.拆分数据var rdd1:RDD[String] = rdd0.flatMap(_.split(" "))//5.mapvar rdd2:RDD[(String,Int)] = rdd1.map((_,1))//6.var rdd3:RDD[(String,Int)] = rdd2.reduceByKey(_+_).sortBy(_._2, false)//7.转数组var result:Array[(String,Int)] = rdd3.collect()//8.打印结果result.foreach(println(_))}}

4.打包

 

5.在把代码加到创建sparkConf的后面

原代码

val sparkConf = new SparkConf().setMaster("spark://192.168.17.151:7077").setAppName("WordCount")

修改后,加上包的路径

val sparkConf = new SparkConf().setMaster("spark://192.168.17.151:7077").setAppName("WordCount").setJars(Seq("D:\\worksoft\\demo\\0607\\target\\0607-1.0-SNAPSHOT.jar"))

6.运行输出


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部