Hive的order by、sort by、distribute by和cluster by[附实例]

2023-11-24 03:59:20

Hive的order by、sort by、distribute by和cluster by

文章目录

Hive的order by、sort by、distribute by和cluster by
- 作用
- - order by
  - sort by
  - distribute by
  - cluster by
- 示例
- - 准备测试数据
  - order by
  - sort by
  - distribute by
  - cluster by

作用

order by

只产生一个reduce，在reduce中全局排序。

数据量大的时候集中在一个reduce就会运行的很慢。

如果指定了严格模式hive.mapred.mode=strict;(默认是nonstrict)，这时就必须指定limit来限制输出条数。

sort by

sort by是分组排序，在有多个reduce输出的情况下，它保证每个reduce的输出结果有序。

只有一个reduce时，和order by作用一样。

设置reduce数量为3：set mapreduce.job.reduces=3;

数据量大的时候使用sort by将数据分散到多个reduce中排序后，再进行order by全局排序会提高计算速度。

distribute by

distribute by会根据指定列的字段的hash值，在map端将数据分散到不同的reduce中，相同hash值的在一个组中，常与sort by 一起使用。

需要设置reduce数量。因为sort by保证每个reduce输出结果有序，默认一个reduce。

distribute by + sort by的结果是分组内有序而全局无序的。

distribute by和group by区别？

有人可能觉得distribute by和group by很像，distribute by是根据hash值分组，把数据分散给不同的reduce，与sort by一起使用；group by是根据实际的值，要与sum等聚合函数一起使用。

Distribute by和sort by的使用场景

1.Map输出的文件大小不均。
2.Reduce输出文件大小不均。
3.小文件过多。
4.文件超大。

distribute by rand() 可以防止数据倾斜

当 distribute by 后边跟的列是：rand() 随机数时，即保证每个分区的数据量基本一致。rand()生成随机数的函数。

按照随机数分组可以将数据随机分配到不同的reduce中，同时保证每个分区的数据量基本一致。需要设置reduce数量。

cluster by

把有相同值的数据分散到一组，并排序，效果等同于distribute by + sort by 。

即 cluster by col 等同于distribute by col sort by col

但是排序只能是默认升序排序，不能指定排序规则为asc 或者desc。

示例

准备测试数据

创建一张order表，字段分别为name，orderdate，orderprice

create table ordtable
(
name string,
orderdate date,
orderprice double
)  
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE ;

插入测试数据到表中

insert overwrite table ordtable 
select 'Jack','2021-01-01',10
union all 
select 'ROSE','2021-02-01',15
union all 
select 'ROSE','2021-03-01',20
union all 
select 'Bob','2021-04-01',25
union all 
select 'Bob','2021-03-11',30
union all 
select 'Bob','2021-03-21',20	
union all 
select 'Adam','2021-04-11',15
union all 
select 'Adam','2021-03-11',25
union all 
select 'Adam','2021-04-21',10
union all 
select 'Adam','2021-04-25',5;

测试数据如下

+----------------+---------------------+----------------------+--+
| ordtable.name  | ordtable.orderdate  | ordtable.orderprice  |
+----------------+---------------------+----------------------+--+
| Jack           | 2021-01-01          | 10.0                 |
| ROSE           | 2021-02-01          | 15.0                 |
| ROSE           | 2021-03-01          | 20.0                 |
| Bob            | 2021-04-01          | 25.0                 |
| Bob            | 2021-03-11          | 30.0                 |
| Bob            | 2021-03-21          | 20.0                 |
| Adam           | 2021-04-11          | 15.0                 |
| Adam           | 2021-03-11          | 25.0                 |
| Adam           | 2021-04-21          | 10.0                 |
| Adam           | 2021-04-25          | 5.0                  |
+----------------+---------------------+----------------------+--+

order by

一个reduce，全局排序

select * from ordtable order by orderprice;

+----------------+---------------------+----------------------+--+
| ordtable.name  | ordtable.orderdate  | ordtable.orderprice  |
+----------------+---------------------+----------------------+--+
| Adam           | 2021-04-25          | 5.0                  |
| Adam           | 2021-04-21          | 10.0                 |
| Jack           | 2021-01-01          | 10.0                 |
| Adam           | 2021-04-11          | 15.0                 |
| ROSE           | 2021-02-01          | 15.0                 |
| Bob            | 2021-03-21          | 20.0                 |
| ROSE           | 2021-03-01          | 20.0                 |
| Adam           | 2021-03-11          | 25.0                 |
| Bob            | 2021-04-01          | 25.0                 |
| Bob            | 2021-03-11          | 30.0                 |
+----------------+---------------------+----------------------+--+

sort by

先设置reduce数为2个，默认1个看不出效果。2个reduce，每个reduce组内排序。

set mapreduce.job.reduces=2;

select * from ordtable sort by orderprice;

+----------------+---------------------+----------------------+--+
| ordtable.name  | ordtable.orderdate  | ordtable.orderprice  |
+----------------+---------------------+----------------------+--+
| Adam           | 2021-04-25          | 5.0                  |
| Jack           | 2021-01-01          | 10.0                 |
| ROSE           | 2021-03-01          | 20.0                 |
| Adam           | 2021-03-11          | 25.0                 |
| Bob            | 2021-04-01          | 25.0                 |
| Bob            | 2021-03-11          | 30.0                 |
| Adam           | 2021-04-21          | 10.0                 |
| Adam           | 2021-04-11          | 15.0                 |
| ROSE           | 2021-02-01          | 15.0                 |
| Bob            | 2021-03-21          | 20.0                 |
+----------------+---------------------+----------------------+--+

distribute by

求按照月份分组，求出了每个月的订单价格升序排序?

按照月份在map端分组，共有4个月份，设置了4个reduce任务，每个reduce组内按照价格升序序排序。默认一个reduce。

set mapreduce.job.reduces=4;

select * from ordtable distribute by month(orderdate) sort by orderprice asc;

0: jdbc:hive2://hadoop91:10000> select * from ordtable distribute by month(orderdate) sort by orderprice asc;
+----------------+---------------------+----------------------+--+
| ordtable.name  | ordtable.orderdate  | ordtable.orderprice  |
+----------------+---------------------+----------------------+--+
| Adam           | 2021-04-25          | 5.0                  |
| Adam           | 2021-04-21          | 10.0                 |
| Adam           | 2021-04-11          | 15.0                 |
| Bob            | 2021-04-01          | 25.0                 |
| Jack           | 2021-01-01          | 10.0                 |
| ROSE           | 2021-02-01          | 15.0                 |
| Bob            | 2021-03-21          | 20.0                 |
| ROSE           | 2021-03-01          | 20.0                 |
| Adam           | 2021-03-11          | 25.0                 |
| Bob            | 2021-03-11          | 30.0                 |
+----------------+---------------------+----------------------+--+

reduce为默认1个时，就相当于全局排序了，就只会按照价格排序，结果如下

set mapreduce.job.reduces=1;

select * from ordtable distribute by month(orderdate) sort by orderprice asc;

0: jdbc:hive2://hadoop91:10000> select * from ordtable distribute by month(orderdate) sort by orderprice asc;
+----------------+---------------------+----------------------+--+
| ordtable.name  | ordtable.orderdate  | ordtable.orderprice  |
+----------------+---------------------+----------------------+--+
| Adam           | 2021-04-25          | 5.0                  |
| Adam           | 2021-04-21          | 10.0                 |
| Jack           | 2021-01-01          | 10.0                 |
| Adam           | 2021-04-11          | 15.0                 |
| ROSE           | 2021-02-01          | 15.0                 |
| Bob            | 2021-03-21          | 20.0                 |
| ROSE           | 2021-03-01          | 20.0                 |
| Adam           | 2021-03-11          | 25.0                 |
| Bob            | 2021-04-01          | 25.0                 |
| Bob            | 2021-03-11          | 30.0                 |
+----------------+---------------------+----------------------+--+

cluster by

select * from ordtable distribute by orderprice sort by orderprice asc;

等同于

select * from ordtable cluster by orderprice ;

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > distribute mysql
下一篇 > 基于distribute命令的路由过滤配置详解

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

Hive的order by、sort by、distribute by和cluster by[附实例]