spark数据类型
RDD
创建RDD
- 读取文件 sc.textFile
- 并行化 sc.parallelize
- 其他方式
RDD操作
- Transfermation
- union
- intersection
- distinct
- groupByKey
- reduceByKey
- sortByKey
- join leftOuterJoin rightOuterJoin
- aggregate
- Action
- reduce
- count
- first
- take
- takeSample
- takeOrdered
- saveAsTextFile
- countByKey
- foreach
DataFrame
DataSet
DataFrame to RDD
| Difference | RDD | DataFrame | DataSet |
|---|---|---|---|
| 区别一 | 不支持sparksql | 支持 | 支持 |
| 区别二 | DataSet[Row] |
相互转化
| 行转列 | RDD | DataFrame | DataSet |
|---|---|---|---|
| RDD | - | val rdd = sc.textFile("") case class Person(name: String, age: String) val a = rdd.map(_.split(",")).map{ line => Person(line(0), line(1))}.toDF | rdd = sc.textFile("") case class Person(name: String, age: String) val a = rdd.map(_.split(",")).map{ line => Person(line(0), line(1))}.toDS |
| DataFrame | val rdd1 = testDF.rdd | - | val testDS = testDF.as[Coltest] |
| DataSet | val rdd2 = testDS.rdd | val testDF = testDS.toDF | - |
数据类型
- local vector (dense, sparse)
- labeled point
- LabeledPoint to Libsvm
- local matrix
- distribute matrix
- Row matrix
- IndexedRowMatrix
- CoordinateMatrix
- BlockMatrix
读取文件类型
json parquet jdbc orc libsvm csv text
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
Reference
- 官方文档(推荐):https://spark.apache.org/docs/2.1.2/programming-guide.html#working-with-key-value-pairs
https://blog.csdn.net/gongpulin/article/details/77622107 - https://www.cnblogs.com/maxigang/p/10030834.html
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
