Spark入hive表

2023-08-25 20:03:31

遇到问题：
1.写入HDFS
generateActualKey 设置为NullWritable
generateActualValue 返回value值

设置输出文件名：
generateFileNameForKeyValue

设置目录可以覆盖模式
checkOutputSpecs

2.DataFrame转换为RDD时会有[]；所以需要去掉；
默认字段分割符为","

3.hive创建表时要指定字段分割符和行分割符
dingdang,love NULL
xuejiao,love1312 NULL

解决问题：
CREATE TABLE IF NOT EXISTS src(
c1 string,
c2 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'

4.分区表
创建分区表目录
alter table src add partition(date_key='20200828')

创建分区表
CREATE TABLE IF NOT EXISTS src(
c1 string,
c2 string
)
PARTITIONED BY(Date_Key string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'

代码：

public class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat {@Overrideprotected K generateActualKey(K key, V value) {key = (K) NullWritable.get();//System.out.println("=============== key ======" + key);return key;}@Overrideprotected V generateActualValue(K key, V value) {value = (V)value.toString();//System.out.println("=============== value ======" + value);return value;}@Overrideprotected String generateFileNameForKeyValue(K key, V value, String name) {name = name.replace("part", key.toString());//System.out.println("=============== name ======"  + name);return super.generateFileNameForKeyValue(key, value, name);}@Overridepublic void checkOutputSpecs(FileSystem ignored, JobConf job) throws IOException {Path outDir = getOutputPath(job);if (outDir == null && job.getNumReduceTasks() != 0) {throw new InvalidJobConfException("Output directory not set in JobConf.");}if (outDir != null) {FileSystem fs = outDir.getFileSystem(job);// normalize the output directoryoutDir = fs.makeQualified(outDir);setOutputPath(job, outDir);// get delegation token for the outDir's file systemTokenCache.obtainTokensForNamenodes(job.getCredentials(),new Path[] { outDir }, job);//使spark的输出目录可以存在// check its existence/*if (fs.exists(outDir)) {throw new FileAlreadyExistsException("Output directory "+ outDir + " already exists");}*/}}public static class OutputFormatUtil {public static String prefixOutputName = "";}}

@Override
public int run(String[] args) throws Exception {SparkConf conf = getSparkconf();logger.warn("===========HiveMain run start===========");try {//获取JavacontextJavaSparkContext jsc = getJavaSparkContext();//SQL的sessionSparkSession session = getSparkSession();Configuration hadoopconf = jsc.hadoopConfiguration();//conf.set("spark.hadoop.fs.defaultFS", "hdfs://192.168.13.124:8020");hadoopconf.set("mapreduce.output.fileoutputformat.compress", "false");//conf.set("spark.sql.warehouse.dir", "hdfs://192.168.13.124:9000/user/hive/warehouse");FileSystem fs = FileSystem.get(hadoopconf);session.sql("use default");String strPath = args[0];JavaRDD rowrdd = jsc.textFile(strPath);StructType schema = createSchema(new String[]{"c1","c2"},new DataType[]{DataTypes.StringType,DataTypes.StringType});JavaRDD rowJavaRDD = parserdata2Row(rowrdd);Dataset df = session.createDataFrame(rowJavaRDD, schema);df.createOrReplaceTempView("src");df.show();JavaPairRDD pairRDD = df.javaRDD().mapToPair(f -> {String replace = f.toString().replace("[", "").replace("]", "").replace("null", " ");String key = "S003_WA_SOURCE_0005";String[] value = replace.split(",");System.out.println("f : " + replace);return new Tuple2(new Text(key), new Text(value[0] + "," + value[1]));});pairRDD.saveAsHadoopFile("/user/hive/warehouse/src/date_key=20200828", Text.class,Text.class, RDDMultipleTextOutputFormat.class);}catch (Exception e) {// TODO: handle exception}return 0;
}private static StructType createSchema(String[] strFields, DataType[] dts ) {ArrayList fields = new ArrayList();StructField field = null;if (strFields.length != dts.length) {System.out.println("Schema is error");return null;}for (int i = 0; i < strFields.length; i++) {field = DataTypes.createStructField(strFields[i], dts[i], true);fields.add(field);}StructType schema = DataTypes.createStructType(fields);return schema;
}public JavaRDD parserdata2Row( JavaRDD sourceDataRDD) {return sourceDataRDD.map(f -> {String[] strLine =  f.split("\t");//System.out.println("===== strline: " + Arrays.asList(strLine).size());//System.out.println("===== strline value: " + Arrays.asList(strLine));//获得TXT类型inputView的查询字段String[] res = new String[2];for (int i = 0; i < 2; i++) {res[i] = strLine[i];}return RowFactory.create(res);});
}

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > dingDang ( 一个管理react数据层的产品 )
下一篇 > linux程序开机自动启动

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

Spark入hive表

相关文章