大数据开发基础入门与项目实战(三)Hadoop核心及生态圈技术栈之6.Impala交互式查询

文章目录

  • 前言
  • 1.Impala-Shell
  • 2.Impala SQL语法
  • 3.导入数据以及JDBC方式查询Impala
  • 总结

前言

Impala的核心开发语言是SQL语句,Impala有shell命令行窗口、JDBC等方式来接收SQL语句执行,对于复杂类型分析可以使用C++或者Java来编写UDF函数。

Impala的SQL语法高度集成了Apache Hive的HQL语法,Impala支持Hive支持的数据类型以及部分Hive的内置函数。

需要注意:

  1. Impala与Hive类似它们的重点都是在与查询,所以像Update、Delete等具有更新性质的操作最好不要使用这种工具,对于删除数据的操作可以通过Drop Table、Alter Table Drop Partition来实现,更新可以尝试使用Insert overwrite方式;

  2. 通常使用Impala的⽅方式是数据文件存储在Hdfs文件系统,借助于Impala的表定义来查询和管理Hdfs上的数据文件;

  3. Impala的使用大多数与Hive相同,比如Impala同样支持内外部表、分区等,可以借鉴参考Hive的使用。

1.Impala-Shell

Impala-Shell命令分为外部命令和内部命令。

impala-shell外部命令是指不需要进入到impala-shell交互命令行当中即可执行的命令参数。impala-shell执行的时候可以带很多选项,可以在启动 impala-shell 时设置,用于修改命令执行环境。
impala-shell –h可以查看帮助手册。

几个常见的选项如下:

选项含义
impala-shell –r刷新impala元数据,与建立连接后执行REFRESH语句效果相同(元数据发生变化的时候)
impala-shell –f 文件路径执行指定的SQL查询文件
impala-shell –i指定连接运行impalad守护进程的主机;默认端口是21000,可以连接到集群中运行impalad的任意主机
impala-shell –o保存执行结果到文件中

使用impala-shell –r命令,如下:

[root@node03 ~]$ impala-shell -r
Starting Impala Shell without Kerberos authentication
Connected to node03:21000
Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a)
Invalidating Metadata
***********************************************************************************
Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved.
(Impala Shell v2.5.0-cdh5.7.6 (ecbba4f) built on Tue Feb 21 14:54:50 PST 2017)The '-B' command line flag turns off pretty-printing for query results. Use this
flag to remove formatting from results you want to save for later, or to benchmark
Impala.
***********************************************************************************
Query: invalidate metadataFetched 0 row(s) in 6.11s
[node03:21000] > exit;
Goodbye root

可以看到,在执行impala-shell –r命令后,刷新元数据的同时进入Impala命令行。

再使用impala-shell –f,先创建SQL文件vim impala_data/test.sql,内容为:

show databases;

然后执行如下:

[root@node03 ~]$ impala-shell -f impala_data/test.sql 
Starting Impala Shell without Kerberos authentication
Connected to node03:21000
Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a)
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default          | Default Hive database                        |
| homework         |                                              |
| mydb             |                                              |
| mydb2            |                                              |
| sale             |                                              |
| tuning           |                                              |
+------------------+----------------------------------------------+
Fetched 7 row(s) in 0.02s

可以看到,查询到了所有数据库;

其中,_impala_builtins是Impala自带的系统数据库。

可以查看如下:

[node03:21000] > show functions;
Query: show functionsFetched 0 row(s) in 0.02s
[node03:21000] > use _impala_builtins;
Query: use _impala_builtins
[node03:21000] > show functions;
Query: show functions
+--------------+-------------------------------------------------+-------------+---------------+
| return type  | signature                                       | binary type | is persistent |
+--------------+-------------------------------------------------+-------------+---------------+
| BIGINT       | abs(BIGINT)                                     | BUILTIN     | true          |
| DECIMAL(*,*) | abs(DECIMAL(*,*))                               | BUILTIN     | true          |
| DOUBLE       | abs(DOUBLE)                                     | BUILTIN     | true          |
...
| INT          | zeroifnull(INT)                                 | BUILTIN     | true          |
| SMALLINT     | zeroifnull(SMALLINT)                            | BUILTIN     | true          |
| TINYINT      | zeroifnull(TINYINT)                             | BUILTIN     | true          |
+--------------+-------------------------------------------------+-------------+---------------+
Fetched 592 row(s) in 0.16s

可以看到,在系统数据库_impala_builtins中能查看到Impala自带的所有函数,在其他数据库中查看不到。

再使用impala-shell –i,如下:

[root@node03 ~]$ impala-shell -i node02
Starting Impala Shell without Kerberos authentication
Connected to node02:21000
Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a)
***********************************************************************************
Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved.
(Impala Shell v2.5.0-cdh5.7.6 (ecbba4f) built on Tue Feb 21 14:54:50 PST 2017)You can change the Impala daemon that you're connected to by using the CONNECT
command.To see how Impala will plan to run your query without actually executing
it, use the EXPLAIN command. You can change the level of detail in the EXPLAIN
output by setting the EXPLAIN_LEVEL query option.
***********************************************************************************
[node02:21000] > exit;
Goodbye root

可以看到,连接到了其他节点的Impala-Server;

如果不指定该选项,则默认使用本机的Impala-Server。

内部命令是指进入impala-shell命令行之后可以执行的语法。

常见的内部命令如下:

命令含义
help查看常用命令及帮助文档
connect hostname连接到指定的机器impalad上去执行
refresh dbname.tablename增量刷新,刷新某一张表的元数据,主要用于刷新Hive中数据表里面的数据改变的情况
invalidate metadata全量刷新,性能消耗较大,主要用于Hive中新建数据库或者数据库表的时候来进行刷新
quit/exit从Impala-Shell中退出
explain用于查看SQL语句的执行计划;
explain可以设置explan_level,用于设置输出内容的级别,包括0、1、2、3等几个值,其中3级别是最高的,可以打印出最全的信息
profile执行SQL语句之后执行,可以打印出更加详细的执行步骤,主要用于查询结果的查看、集群的调优等

使用如下:

[node03:21000] > help;Documented commands (type help <topic>):
========================================
compute   exit     history  quit    shell  unset   version
connect   explain  insert   select  show   use     with   
describe  help     profile  set     tip    valuesUndocumented commands:
======================
alter  create  desc  drop  load  summary[node03:21000] > help compute;
Executes a COMPUTE STATS query.Impala shell cannot get child query handle so it cannotquery live progress for COMPUTE STATS query. Disable liveprogress/summary callback for COMPUTE STATS query.
[node03:21000] > help alter;
*** No help on alter
[node03:21000] > refresh mydb.t1;
Query: refresh mydb.t1Fetched 0 row(s) in 0.98s
[node03:21000] > connect node01;
Connected to node01:21000
Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a)
[node01:21000] > select * from t1;
Query: select * from t1
+--------------------+------+-----+--------+
| id                 | name | age | gender |
+--------------------+------+-----+--------+
| 392456197008193000 | 张三 | 20  | 0      |
| 267456198006210000 | 李四 | 25  | 1      |
| 892456199007203000 | 王五 | 24  | 1      |
| 492456198712198000 | 赵六 | 26  | 2      |
| 392456197008193000 | 张三 | 20  | 0      |
| 392456197008193000 | 张三 | 20  | 0      |
+--------------------+------+-----+--------+
Fetched 6 row(s) in 6.99s
[node01:21000] > explain select * from t1;
Query: explain select * from t1
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=32.00MB VCores=1                           |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| default.t1                                                                         |
|                                                                                    |
| 01:EXCHANGE [UNPARTITIONED]                                                        |
| |                                                                                  |
| 00:SCAN HDFS [default.t1]                                                          |
|    partitions=1/1 files=1 size=186B                                                |
+------------------------------------------------------------------------------------+
Fetched 8 row(s) in 0.06s
[node01:21000] > set explain_level = 3;
EXPLAIN_LEVEL set to 3
[node01:21000] > explain select * from t1;
Query: explain select * from t1
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=32.00MB VCores=1                           |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| default.t1                                                                         |
|                                                                                    |
| F01:PLAN FRAGMENT [UNPARTITIONED]                                                  |
|   01:EXCHANGE [UNPARTITIONED]                                                      |
|      hosts=1 per-host-mem=unavailable                                              |
|      tuple-ids=0 row-size=38B cardinality=unavailable                              |
|                                                                                    |
| F00:PLAN FRAGMENT [RANDOM]                                                         |
|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                       |
|   00:SCAN HDFS [default.t1, RANDOM]                                                |
|      partitions=1/1 files=1 size=186B                                              |
|      table stats: unavailable                                                      |
|      column stats: unavailable                                                     |
|      hosts=1 per-host-mem=32.00MB                                                  |
|      tuple-ids=0 row-size=38B cardinality=unavailable                              |
+------------------------------------------------------------------------------------+
Fetched 17 row(s) in 0.03s
[node01:21000] > select * from t1;
Query: select * from t1
+--------------------+------+-----+--------+
| id                 | name | age | gender |
+--------------------+------+-----+--------+
| 392456197008193000 | 张三 | 20  | 0      |
| 267456198006210000 | 李四 | 25  | 1      |
| 892456199007203000 | 王五 | 24  | 1      |
| 492456198712198000 | 赵六 | 26  | 2      |
| 392456197008193000 | 张三 | 20  | 0      |
| 392456197008193000 | 张三 | 20  | 0      |
+--------------------+------+-----+--------+
Fetched 6 row(s) in 0.43s
[node01:21000] > profile;
Query Runtime Profile:
Query (id=944cd4a45137b96a:b8ab935b8d60adab):Summary:Session ID: e44a9c110e12d2cb:24ff32547ca3fc99Session Type: BEESWAXStart Time: 2021-10-11 11:20:45.844713000End Time: 2021-10-11 11:20:46.276056000Query Type: QUERYQuery State: FINISHEDQuery Status: OKImpala Version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a)User: rootConnected User: rootDelegated User: Network Address: ::ffff:192.168.31.157:59392Default Db: defaultSql Statement: select * from t1Coordinator: node01:22000Query Options (non default): EXPLAIN_LEVEL=3Plan: 
----------------
Estimated Per-Host Requirements: Memory=32.00MB VCores=1
WARNING: The following tables are missing relevant table and/or column statistics.
default.t101:EXCHANGE [UNPARTITIONED]
|  hosts=1 per-host-mem=unavailable
|  tuple-ids=0 row-size=38B cardinality=unavailable
|
00:SCAN HDFS [default.t1, RANDOM]partitions=1/1 files=1 size=186Btable stats: unavailablecolumn stats: unavailablehosts=1 per-host-mem=32.00MBtuple-ids=0 row-size=38B cardinality=unavailable
----------------Estimated Per-Host Mem: 33554432Estimated Per-Host VCores: 1Tables Missing Stats: default.t1Request Pool: default-poolExecSummary: 
Operator       #Hosts   Avg Time   Max Time  #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail        
-------------------------------------------------------------------------------------------------------
01:EXCHANGE         1    8.275us    8.275us      6          -1         0        -1.00 B  UNPARTITIONED 
00:SCAN HDFS        1  158.588ms  158.588ms      6          -1  61.00 KB       32.00 MB  default.t1    Planner Timeline: 6.764ms- Analysis finished: 2.300ms (2.300ms)- Equivalence classes computed: 2.466ms (166.581us)- Single node plan created: 3.901ms (1.434ms)- Runtime filters computed: 3.917ms (16.218us)- Distributed plan created: 4.972ms (1.054ms)- Planning finished: 6.764ms (1.792ms)Query Timeline: 432.815ms- Start execution: 44.969us (44.969us)- Planning finished: 14.847ms (14.803ms)- Ready to start 1 remote fragments: 15.241ms (393.172us)- All 1 remote fragments started: 20.207ms (4.966ms)- Child queries finished: 20.823ms (615.552us)- Rows available: 344.381ms (323.558ms)- First row fetched: 428.666ms (84.284ms)- Unregister query: 431.338ms (2.672ms)ImpalaServer:- ClientFetchWaitTimer: 86.786ms- RowMaterializationTimer: 40.424usExecution Profile 944cd4a45137b96a:b8ab935b8d60adab:(Total: 329.230ms, non-child: 0.000ns, % non-child: 0.00%)Number of filters: 0Filter routing table: ID  Src. Node  Tgt. Node  Targets  Type   Partition filter
-----------------------------------------------------------Fragment start latencies: Count: 1, 25th %-ile: 3ms, 50th %-ile: 3ms, 75th %-ile: 3ms, 90th %-ile: 3ms, 95th %-ile: 3ms, 99.9th %-ile: 3msPer Node Peak Memory Usage: node03:22000(71.66 KB) node01:22000(0) - FiltersReceived: 0 (0)- FinalizationTimer: 0.000nsCoordinator Fragment F01:(Total: 323.661ms, non-child: 107.505us, % non-child: 0.03%)MemoryUsage(500.000ms): 8.00 KB- AverageThreadTokens: 0.00 - BloomFilterBytes: 0- PeakMemoryUsage: 12.05 KB (12336)- PerHostPeakMemUsage: 0- PrepareTime: 29.640us- RowsProduced: 6 (6)- TotalCpuTime: 90.045ms- TotalNetworkReceiveTime: 323.548ms- TotalNetworkSendTime: 0.000ns- TotalStorageWaitTime: 0.000nsBlockMgr:- BlockWritesOutstanding: 0 (0)- BlocksCreated: 0 (0)- BlocksRecycled: 0 (0)- BufferedPins: 0 (0)- BytesWritten: 0- MaxBlockSize: 8.00 MB (8388608)- MemoryLimit: 1.14 GB (1221100928)- PeakMemoryUsage: 0- TotalBufferWaitTime: 0.000ns- TotalEncryptionTime: 0.000ns- TotalIntegrityCheckTime: 0.000ns- TotalReadBlockTime: 0.000nsEXCHANGE_NODE (id=1):(Total: 323.554ms, non-child: 8.275us, % non-child: 0.00%)BytesReceived(500.000ms): 0- BytesReceived: 264.00 B (264)- ConvertRowBatchTime: 1.697us- DeserializeRowBatchTimer: 7.709us- FirstBatchArrivalWaitTime: 323.545ms- PeakMemoryUsage: 0- RowsReturned: 6 (6)- RowsReturnedRate: 18.00 /sec- SendersBlockedTimer: 0.000ns- SendersBlockedTotalTimer(*): 0.000nsAveraged Fragment F00:(Total: 161.640ms, non-child: 0.000ns, % non-child: 0.00%)split sizes:  min: 186.00 B, max: 186.00 B, avg: 186.00 B, stddev: 0completion times: min:326.012ms  max:326.012ms  mean: 326.012ms  stddev:0.000nsexecution rates: min:570.00 B/sec  max:570.00 B/sec  mean:570.00 B/sec  stddev:0.53 B/secnum instances: 1- AverageThreadTokens: 1.00 - BloomFilterBytes: 0- PeakMemoryUsage: 71.66 KB (73384)- PerHostPeakMemUsage: 71.66 KB (73384)- PrepareTime: 132.318ms- RowsProduced: 6 (6)- TotalCpuTime: 325.383ms- TotalNetworkReceiveTime: 0.000ns- TotalNetworkSendTime: 684.107us- TotalStorageWaitTime: 24.116msBlockMgr:- BlockWritesOutstanding: 0 (0)- BlocksCreated: 0 (0)- BlocksRecycled: 0 (0)- BufferedPins: 0 (0)- BytesWritten: 0- MaxBlockSize: 8.00 MB (8388608)- MemoryLimit: 1.14 GB (1221100928)- PeakMemoryUsage: 0- TotalBufferWaitTime: 0.000ns- TotalEncryptionTime: 0.000ns- TotalIntegrityCheckTime: 0.000ns- TotalReadBlockTime: 0.000nsCodeGen:(Total: 290.903ms, non-child: 290.903ms, % non-child: 100.00%)- CodegenTime: 4.578ms- CompileTime: 13.251ms- LoadTime: 0.000ns- ModuleBitcodeSize: 1.87 MB (1959112)- OptimizationTime: 149.882ms- PrepareTime: 123.054msDataStreamSender (dst_id=1):(Total: 165.873us, non-child: 165.873us, % non-child: 100.00%)- BytesSent: 264.00 B (264)- NetworkThroughput(*): 3.65 MB/sec- OverallThroughput: 1.52 MB/sec- PeakMemoryUsage: 2.66 KB (2728)- RowsReturned: 6 (6)- SerializeBatchTime: 40.210us- TransmitDataRPCTime: 68.991us- UncompressedRowBatchSize: 460.00 B (460)HDFS_SCAN_NODE (id=0):(Total: 158.588ms, non-child: 158.588ms, % non-child: 100.00%)- AverageHdfsReadThreadConcurrency: 0.00 - AverageScannerThreadConcurrency: 0.00 - BytesRead: 186.00 B (186)- BytesReadDataNodeCache: 0- BytesReadLocal: 186.00 B (186)- BytesReadRemoteUnexpected: 0- BytesReadShortCircuit: 186.00 B (186)- DecompressionTime: 0.000ns- MaxCompressedTextFileLength: 0- NumDisksAccessed: 1 (1)- NumScannerThreadsStarted: 1 (1)- PeakMemoryUsage: 61.00 KB (62464)- PerReadThreadRawHdfsThroughput: 3.53 MB/sec- RemoteScanRanges: 0 (0)- RowsRead: 6 (6)- RowsReturned: 6 (6)- RowsReturnedRate: 37.00 /sec- ScanRangesComplete: 1 (1)- ScannerThreadsInvoluntaryContextSwitches: 0 (0)- ScannerThreadsTotalWallClockTime: 24.753ms- DelimiterParseTime: 51.707us- MaterializeTupleTime(*): 903.000ns- ScannerThreadsSysTime: 351.000us- ScannerThreadsUserTime: 0.000ns- ScannerThreadsVoluntaryContextSwitches: 3 (3)- TotalRawHdfsReadTime(*): 50.213us- TotalReadThroughput: 0.00 /secFragment F00:Instance 944cd4a45137b96a:b8ab935b8d60adad (host=node03:22000):(Total: 161.640ms, non-child: 0.000ns, % non-child: 0.00%)Hdfs split stats (<volume id>:<# splits>/): 0:1/186.00 B MemoryUsage(500.000ms): 0ThreadUsage(500.000ms): 1- AverageThreadTokens: 1.00 - BloomFilterBytes: 0- PeakMemoryUsage: 71.66 KB (73384)- PerHostPeakMemUsage: 71.66 KB (73384)- PrepareTime: 132.318ms- RowsProduced: 6 (6)- TotalCpuTime: 325.383ms- TotalNetworkReceiveTime: 0.000ns- TotalNetworkSendTime: 684.107us- TotalStorageWaitTime: 24.116msBlockMgr:- BlockWritesOutstanding: 0 (0)- BlocksCreated: 0 (0)- BlocksRecycled: 0 (0)- BufferedPins: 0 (0)- BytesWritten: 0- MaxBlockSize: 8.00 MB (8388608)- MemoryLimit: 1.14 GB (1221100928)- PeakMemoryUsage: 0- TotalBufferWaitTime: 0.000ns- TotalEncryptionTime: 0.000ns- TotalIntegrityCheckTime: 0.000ns- TotalReadBlockTime: 0.000nsCodeGen:(Total: 290.903ms, non-child: 290.903ms, % non-child: 100.00%)- CodegenTime: 4.578ms- CompileTime: 13.251ms- LoadTime: 0.000ns- ModuleBitcodeSize: 1.87 MB (1959112)- OptimizationTime: 149.882ms- PrepareTime: 123.054msDataStreamSender (dst_id=1):(Total: 165.873us, non-child: 165.873us, % non-child: 100.00%)- BytesSent: 264.00 B (264)- NetworkThroughput(*): 3.65 MB/sec- OverallThroughput: 1.52 MB/sec- PeakMemoryUsage: 2.66 KB (2728)- RowsReturned: 6 (6)- SerializeBatchTime: 40.210us- TransmitDataRPCTime: 68.991us- UncompressedRowBatchSize: 460.00 B (460)HDFS_SCAN_NODE (id=0):(Total: 158.588ms, non-child: 158.588ms, % non-child: 100.00%)ExecOption: Expr Evaluation Codegen Disabled, Codegen enabled: 1 out of 1Hdfs split stats (<volume id>:<# splits>/): 0:1/186.00 B Hdfs Read Thread Concurrency Bucket: 0:0% 1:0% 2:0% 3:0% 4:0% 5:0% File Formats: TEXT/NONE:1 BytesRead(500.000ms): 0- AverageHdfsReadThreadConcurrency: 0.00 - AverageScannerThreadConcurrency: 0.00 - BytesRead: 186.00 B (186)- BytesReadDataNodeCache: 0- BytesReadLocal: 186.00 B (186)- BytesReadRemoteUnexpected: 0- BytesReadShortCircuit: 186.00 B (186)- DecompressionTime: 0.000ns- MaxCompressedTextFileLength: 0- NumDisksAccessed: 1 (1)- NumScannerThreadsStarted: 1 (1)- PeakMemoryUsage: 61.00 KB (62464)- PerReadThreadRawHdfsThroughput: 3.53 MB/sec- RemoteScanRanges: 0 (0)- RowsRead: 6 (6)- RowsReturned: 6 (6)- RowsReturnedRate: 37.00 /sec- ScanRangesComplete: 1 (1)- ScannerThreadsInvoluntaryContextSwitches: 0 (0)- ScannerThreadsTotalWallClockTime: 24.753ms- DelimiterParseTime: 51.707us- MaterializeTupleTime(*): 903.000ns- ScannerThreadsSysTime: 351.000us- ScannerThreadsUserTime: 0.000ns- ScannerThreadsVoluntaryContextSwitches: 3 (3)- TotalRawHdfsReadTime(*): 50.213us- TotalReadThroughput: 0.00 /sec

其中,help命令输出的常用命令中,包含了文档化命令(可以使用help查看具体用法)和非文档化命令(不能使用help查看具体用法);

refresh命令和invalidate metadata命令作用相同,都可以用于更新元数据,但是前者可以指定某张表进行更新(增量更新),而后者只能全部更新(全量更新),性能表现也不同;

使用explain查看查询计划时,SCAN HDFS用于扫表,partitions用于指定分区,同时指定了文件数量和大小,可以进行验证:

[root@node03 ~]$ hdfs dfs -ls /user/impala/t1
Found 1 items
-rw-r--r--   3 root supergroup        186 2021-10-10 21:44 /user/impala/t1/user.csv

可以看到,结果是相同的。

同时,设置explain的输出级别为3后,信息更多。

explain和profile的比较:

expalin:可以不真正执行任务,只是展示任务的执行计划;
profile:需要任务执行完成后调用,可以从更底层、更详细的层面来观察运行Impala的任务,进行调优。

2.Impala SQL语法

(1)数据库操作

CREATE DATABASE语句用于在Impala中创建新数据库:
CREATE DATABASE IF NOT EXISTS database_name;中,IF NOT EXISTS是一个可选的子句,如果使用此子句,则只有在没有同名的现有数据库时,才会创建具有给定名称的数据库。

如下:

[node03:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default          | Default Hive database                        |
| homework         |                                              |
| mydb             |                                              |
| mydb2            |                                              |
| sale             |                                              |
| tuning           |                                              |
+------------------+----------------------------------------------+
Fetched 7 row(s) in 0.01s
[node03:21000] > create database if not exists cltest;
Query: create database if not exists cltestFetched 0 row(s) in 0.79s
[node03:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| cltest           |                                              |
| default          | Default Hive database                        |
| homework         |                                              |
| mydb             |                                              |
| mydb2            |                                              |
| sale             |                                              |
| tuning           |                                              |
+------------------+----------------------------------------------+
Fetched 8 row(s) in 0.00s

可以看到,创建了数据库。

创建数据库之后,默认就会在HDFS中Hive的数仓路径下创建新的数据库名文件夹,如下:

[root@node03 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 7 items
drwxrwxrwx   - impala    supergroup          0 2021-10-11 13:47 /user/hive/warehouse/cltest.db
drwxrwxrwx   - root      supergroup          0 2021-10-05 20:16 /user/hive/warehouse/homework.db
drwxrwxrwx   - root      supergroup          0 2021-10-10 22:13 /user/hive/warehouse/mydb.db
drwxrwxrwx   - anonymous supergroup          0 2021-09-28 20:30 /user/hive/warehouse/mydb2.db
drwxrwxrwx   - root      supergroup          0 2021-10-03 13:06 /user/hive/warehouse/sale.db
drwxrwxrwx   - root      supergroup          0 2021-09-28 21:00 /user/hive/warehouse/t2
drwxrwxrwx   - root      supergroup          0 2021-10-08 23:44 /user/hive/warehouse/tuning.db

Impala的DROP DATABASE语句用于从Impala中删除数据库,在删除数据库之前,建议从中删除所有表;
如果使用级联删除cascade,可以直接删除包含了表和数据的数据库。

如下:

[node03:21000] > drop database cltest cascade;
Query: drop database cltest cascade
[node03:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default          | Default Hive database                        |
| homework         |                                              |
| mydb             |                                              |
| mydb2            |                                              |
| sale             |                                              |
| tuning           |                                              |
+------------------+----------------------------------------------+
Fetched 7 row(s) in 0.01s

(2)表操作

CREATE TABLE语句用于在Impala中的所需数据库中创建新表,需要指定表名字并定义其列和每列的数据类型。
Impala支持的数据类型和Hive类似;

Impala对复杂数据类型的支持:
对于Text存储格式中的复杂类型不支持,复杂类型要使用parquet格式。

语法如下:

create table IF NOT EXISTS database_name.table_name(column1 data_type,column2 data_type,column3 data_type,...,columnN data_type
);

举例如下:

[node03:21000] > create table if not exists mydb.student(>   name string,>   age int,>   concat int> );
Query: create table if not exists mydb.student(name string,age int,concat int
)Fetched 0 row(s) in 0.31s
[node03:21000] > use mydb;
Query: use mydb
[node03:21000] > show tables;
Query: show tables
+-----------------+
| name            |
+-----------------+
| course          |
| emp             |
| goodtbl         |
| line2row        |
| row2line        |
| stu             |
| student         |
| studscore       |
| t1              |
| t2              |
| t3              |
| tab1            |
| taba            |
| tabb            |
| tabc            |
| tabd            |
| tabe            |
| temp2           |
| u1              |
| u2              |
| uaction_orc     |
| uaction_parquet |
| uaction_text    |
| ulogin          |
| userpv          |
| zxz_data        |
+-----------------+
Fetched 26 row(s) in 0.01s

默认建表的数据存储路径与Hive一致,也可以在建表的时候通过location指定具体路径。

Impala的INSERT语句有两个子句,into和overwrite:
into用于插入新记录数据;
overwrite用于覆盖已有的记录。

insert into语句的语法如下:

-- 形式1
insert into table_name(column1, column2, column3,...columnN) values(value1,value2, value3,...valueN);
-- 形式2
insert into table_name values(value1, value2, value2);

这里,column1、column2、…、columnN是要插入数据的表中的列的名称;
还可以添加值而不指定列名,但是,需要确保值的顺序与表中的列的顺序相同。

举例如下:

[node03:21000] > create table employee(>   Id INT,>   name STRING,>   age INT,>   address STRING,>   salary BIGINT> );
Query: create table employee(Id INT,name STRING,age INT,address STRING,salary BIGINT
)Fetched 0 row(s) in 0.40s
[node03:21000] > insert into employee VALUES (1, 'Ramesh', 32, 'Ahmedabad', 20000);
Query: insert into employee VALUES (1, 'Ramesh', 32, 'Ahmedabad', 20000)
Inserted 1 row(s) in 2.46s
[node03:21000] > insert into employee values (2, 'Khilan', 25, 'Delhi', 15000);
Query: insert into employee values (2, 'Khilan', 25, 'Delhi', 15000 )
Inserted 1 row(s) in 0.32s
[node03:21000] > Insert into employee values (3, 'kaushik', 23, 'Kota', 30000);
Query: insert into employee values (3, 'kaushik', 23, 'Kota', 30000 )
Inserted 1 row(s) in 0.32s
[node03:21000] > Insert into employee values (4, 'Chaitali', 25, 'Mumbai', 35000);
Query: insert into employee values (4, 'Chaitali', 25, 'Mumbai', 35000 )
Inserted 1 row(s) in 0.21s
[node03:21000] > Insert into employee values (5, 'Hardik', 27, 'Bhopal', 40000);
Query: insert into employee values (5, 'Hardik', 27, 'Bhopal', 40000 )
Inserted 1 row(s) in 0.46s
[node03:21000] > Insert into employee values (6, 'Komal', 22, 'MP', 32000);
Query: insert into employee values (6, 'Komal', 22, 'MP', 32000)
Inserted 1 row(s) in 0.33s

overwrite覆盖子句覆盖表当中全部记录,被覆盖的记录将从表中永久删除。

举例如下:

insert overwrite employee values (1, 'Ram', 26, 'Vishakhapatnam', 37000);

Impala SELECT语句用于从数据库查询数据,并以表的形式返回数据。

如下:

[node03:21000] > select * from t3;
Query: select * from t3
+--------------------+------+-----+--------+
| id                 | name | age | gender |
+--------------------+------+-----+--------+
| 392456197008193000 | 张三 | 20  | 0      |
| 392456197008193000 | 张三 | 20  | 0      |
| 392456197008193000 | 张三 | 20  | 0      |
+--------------------+------+-----+--------+
Fetched 3 row(s) in 2.85s

Impala中的describe/desc语句用于提供表的描述,结果包含有关表的信息,例如列名称及其数据类型。s

使用如下:

[node03:21000] > desc employee;
Query: describe employee
+---------+--------+---------+
| name    | type   | comment |
+---------+--------+---------+
| id      | int    |         |
| name    | string |         |
| age     | int    |         |
| address | string |         |
| salary  | bigint |         |
+---------+--------+---------+
Fetched 5 row(s) in 0.01s
[node03:21000] > desc formatted employee;
Query: describe formatted employee
+------------------------------+------------------------------------------------------------+----------------------+
| name                         | type                                                       | comment              |
+------------------------------+------------------------------------------------------------+----------------------+
| # col_name                   | data_type                                                  | comment              |
|                              | NULL                                                       | NULL                 |
| id                           | int                                                        | NULL                 |
| name                         | string                                                     | NULL                 |
| age                          | int                                                        | NULL                 |
| address                      | string                                                     | NULL                 |
| salary                       | bigint                                                     | NULL                 |
|                              | NULL                                                       | NULL                 |
| # Detailed Table Information | NULL                                                       | NULL                 |
| Database:                    | mydb                                                       | NULL                 |
| Owner:                       | root                                                       | NULL                 |
| CreateTime:                  | Mon Oct 11 13:58:02 CST 2021                               | NULL                 |
| LastAccessTime:              | UNKNOWN                                                    | NULL                 |
| Protect Mode:                | None                                                       | NULL                 |
| Retention:                   | 0                                                          | NULL                 |
| Location:                    | hdfs://node01:9000/user/hive/warehouse/mydb.db/employee    | NULL                 |
| Table Type:                  | MANAGED_TABLE                                              | NULL                 |
| Table Parameters:            | NULL                                                       | NULL                 |
|                              | transient_lastDdlTime                                      | 1633931882           |
|                              | NULL                                                       | NULL                 |
| # Storage Information        | NULL                                                       | NULL                 |
| SerDe Library:               | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe         | NULL                 |
| InputFormat:                 | org.apache.hadoop.mapred.TextInputFormat                   | NULL                 |
| OutputFormat:                | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL                 |
| Compressed:                  | No                                                         | NULL                 |
| Num Buckets:                 | 0                                                          | NULL                 |
| Bucket Columns:              | []                                                         | NULL                 |
| Sort Columns:                | []                                                         | NULL                 |
+------------------------------+------------------------------------------------------------+----------------------+
Fetched 28 row(s) in 0.03s

Impala中的Alter table语句用于对给定表执行更改,可以添加、删除或修改现有表中的列,也可以重命名它们。

需要注意,Impala中可以修改表,但是不建议在Impala中进行,而是将其视为专门的查询工具,更新操作可以在Hive中进行。

Impala drop table语句用于删除Impala中的现有表,会从数据库中删除完整的表结构,同时还会删除内部表的底层HDFS文件。

语法如下:

drop table database_name.table_name;

注意:
使用此命令时必须小心,因为删除表后,表中可用的所有信息也将永远丢失。

Impala的Truncate Table语句用于从现有表中删除所有记录,保留表结构。

语法格式如下:

truncate table_name;

drop tabletruncate的使用举例如下:

[node03:21000] > drop table temp2;
Query: drop table temp2
[node03:21000] > show tables;
Query: show tables
+-----------------+
| name            |
+-----------------+
| course          |
| emp             |
| employee        |
| goodtbl         |
| line2row        |
| row2line        |
| stu             |
| student         |
| studscore       |
| t1              |
| t2              |
| t3              |
| tab1            |
| taba            |
| tabb            |
| tabc            |
| tabd            |
| tabe            |
| u1              |
| u2              |
| uaction_orc     |
| uaction_parquet |
| uaction_text    |
| ulogin          |
| userpv          |
| zxz_data        |
+-----------------+
Fetched 26 row(s) in 0.01s
[node03:21000] > select * from t3;
Query: select * from t3
+--------------------+------+-----+--------+
| id                 | name | age | gender |
+--------------------+------+-----+--------+
| 392456197008193000 | 张三 | 20  | 0      |
| 392456197008193000 | 张三 | 20  | 0      |
| 392456197008193000 | 张三 | 20  | 0      |
+--------------------+------+-----+--------+
Fetched 3 row(s) in 2.85s
[node03:21000] > truncate t3;
Query: truncate t3Fetched 0 row(s) in 0.22s
[node03:21000] > select * from t3;
Query: select * from t3Fetched 0 row(s) in 0.15s

视图是存储在数据库中具有关联名称的Impala查询语言的语句,是以预定义的SQL查询形式的表的组合。
视图可以包含表的所有行或选定的行。

创建、修改和删除视图的语法如下:

-- 创建视图
create view if not exists view_name as select statement;
-- 修改视图
alter view database_name.view_name as select statement;
-- 删除视图
drop view database_name.view_name;

使用如下:

[node03:21000] > create view if not exists emp_view as select * from employee where salary > 20000;
Query: create view if not exists emp_view as select * from employee where salary > 20000Fetched 0 row(s) in 0.14s
[node03:21000] > show tables;
Query: show tables
+-----------------+
| name            |
+-----------------+
| course          |
| emp             |
| emp_view        |
| employee        |
| goodtbl         |
| line2row        |
| row2line        |
| stu             |
| student         |
| studscore       |
| t1              |
| t2              |
| t3              |
| tab1            |
| taba            |
| tabb            |
| tabc            |
| tabd            |
| tabe            |
| u1              |
| u2              |
| uaction_orc     |
| uaction_parquet |
| uaction_text    |
| ulogin          |
| userpv          |
| zxz_data        |
+-----------------+
Fetched 27 row(s) in 0.02s
[node03:21000] > select * from emp_view;
Query: select * from emp_view
+----+----------+-----+---------+--------+
| id | name     | age | address | salary |
+----+----------+-----+---------+--------+
| 3  | kaushik  | 23  | Kota    | 30000  |
| 4  | Chaitali | 25  | Mumbai  | 35000  |
| 5  | Hardik   | 27  | Bhopal  | 40000  |
| 6  | Komal    | 22  | MP      | 32000  |
+----+----------+-----+---------+--------+
Fetched 4 row(s) in 4.14s
[node03:21000] > drop view emp_view;
Query: drop view emp_view
[node03:21000] > show tables;
Query: show tables
+-----------------+
| name            |
+-----------------+
| course          |
| emp             |
| employee        |
| goodtbl         |
| line2row        |
| row2line        |
| stu             |
| student         |
| studscore       |
| t1              |
| t2              |
| t3              |
| tab1            |
| taba            |
| tabb            |
| tabc            |
| tabd            |
| tabe            |
| u1              |
| u2              |
| uaction_orc     |
| uaction_parquet |
| uaction_text    |
| ulogin          |
| userpv          |
| zxz_data        |
+-----------------+
Fetched 26 row(s) in 0.01s

Impala ORDER BY子句用于根据一个或多个列以升序或降序对数据进行排序。

语法如下:

select * from table_name ORDER BY col_name [ASC|DESC] [NULLS FIRST|NULLS LAST];

默认情况下,一些数据库按升序对查询结果进行排序,可以使用关键字ASC(默认)或DESC分别按升序或降序排列表中的数据。
如果使用NULLS FIRST,表中的所有空值都排列在顶行;
如果我们使用NULLS LAST,包含空值的行将最后排列。

Impala GROUP BY子句与SELECT语句协作使用,以将相同的数据排列到组中。

having子句用于对分组等查询过后得到的数据进行筛选;
容易与where过滤混淆,其区别如下:
where滤的数据是原始数据,表中本来就存在的数据;
having过滤的是查询结果数据。

Impala中的limit子句用于将结果集的行数限制为所需的数,即查询的结果集不包含超过指定限制的记录。
一般来说,select查询的resultset中的行从0开始,使用offset子句,可以决定从什么位置开始输出。

order by、group by和having的使用如下:

[node03:21000] > select * from employee order by salary desc;
Query: select * from employee order by salary desc
+----+----------+-----+-----------+--------+
| id | name     | age | address   | salary |
+----+----------+-----+-----------+--------+
| 5  | Hardik   | 27  | Bhopal    | 40000  |
| 4  | Chaitali | 25  | Mumbai    | 35000  |
| 6  | Komal    | 22  | MP        | 32000  |
| 3  | kaushik  | 23  | Kota      | 30000  |
| 1  | Ramesh   | 32  | Ahmedabad | 20000  |
| 2  | Khilan   | 25  | Delhi     | 15000  |
+----+----------+-----+-----------+--------+
Fetched 6 row(s) in 0.87s
[node03:21000] > select name, sum(salary) from employee group by name;
Query: select name, sum(salary) from employee group by name
+----------+-------------+
| name     | sum(salary) |
+----------+-------------+
| Ramesh   | 20000       |
| Hardik   | 40000       |
| Komal    | 32000       |
| Chaitali | 35000       |
| kaushik  | 30000       |
| Khilan   | 15000       |
+----------+-------------+
Fetched 6 row(s) in 1.28s
[node03:21000] > select age, max(salary) from employee group by age having max(salary) > 20000;
Query: select age, max(salary) from employee group by age having max(salary) > 20000
+-----+-------------+
| age | max(salary) |
+-----+-------------+
| 23  | 30000       |
| 25  | 35000       |
| 27  | 40000       |
| 22  | 32000       |
+-----+-------------+
Fetched 4 row(s) in 0.67s
[node03:21000] > select * from employee order by salarylimit 2 offset 0;
Query: select * from employee order by salarylimit 2 offset 0
ERROR: AnalysisException: Syntax error in line 1:
select * from employee order by salarylimit 2 offset 0^
Encountered: INTEGER LITERAL
Expected: AND, AS, ASC, BETWEEN, CROSS, DESC, DIV, ELSE, END, FOLLOWING, FROM, FULL, GROUP, HAVING, ILIKE, IN, INNER, IREGEXP, IS, JOIN, LEFT, LIKE, LIMIT, LOCATION, NOT, NULLS, OFFSET, ON, OR, ORDER, PRECEDING, RANGE, REGEXP, RIGHT, RLIKE, ROWS, STRAIGHT_JOIN, THEN, UNION, USING, WHEN, WHERE, COMMA, IDENTIFIERCAUSED BY: Exception: Syntax error[node03:21000] > select * from employee order by salary limit 2 offset 0;
Query: select * from employee order by salary limit 2 offset 0
+----+--------+-----+-----------+--------+
| id | name   | age | address   | salary |
+----+--------+-----+-----------+--------+
| 2  | Khilan | 25  | Delhi     | 15000  |
| 1  | Ramesh | 32  | Ahmedabad | 20000  |
+----+--------+-----+-----------+--------+
Fetched 2 row(s) in 6.16s
[node03:21000] > select * from employee order by salary limit 2 offset 2;
Query: select * from employee order by salary limit 2 offset 2
+----+---------+-----+---------+--------+
| id | name    | age | address | salary |
+----+---------+-----+---------+--------+
| 3  | kaushik | 23  | Kota    | 30000  |
| 6  | Komal   | 22  | MP      | 32000  |
+----+---------+-----+---------+--------+
Fetched 2 row(s) in 2.53s
[node03:21000] > select * from employee order by salary limit 2 offset 4;
Query: select * from employee order by salary limit 2 offset 4
+----+----------+-----+---------+--------+
| id | name     | age | address | salary |
+----+----------+-----+---------+--------+
| 4  | Chaitali | 25  | Mumbai  | 35000  |
| 5  | Hardik   | 27  | Bhopal  | 40000  |
+----+----------+-----+---------+--------+
Fetched 2 row(s) in 2.45s
[node03:21000] > select * from employee order by salary limit 2 offset 6;
Query: select * from employee order by salary limit 2 offset 6Fetched 0 row(s) in 1.47s

可以看到,limit... offset...语句实现了分页的效果;

需要注意,使⽤用offset关键字要求结果数据必须经过排序。

3.导入数据以及JDBC方式查询Impala

和Hive类似,导入数据也有多种方式:

方式含义
insert into values类似于RDBMS的数据插入方式,插入数据时手动指定数据
insert into select插入一张表的数据来自于后面的select查询语句返回的结果
create table as select建表的字段个数、类型、数据来自于后续的select查询语句
load data从HDFS文件中导入数据,该方式不建议在Impala中使用,可以先用load data方式把数据加载到Hive表中,再使用上面的方式插入Impala表中

在实际工作当中,因为Impala的查询比较快,所以可能会使用Impala来做数据库查询的情况,可以通过Java代码来进行Impala查询。

创建一个Maven项目ImpalaJDBC,依赖如下:


<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0modelVersion><groupId>com.bigdata.impalagroupId><artifactId>ImpalaJDBCartifactId><version>1.0-SNAPSHOTversion><properties><maven.compiler.source>8maven.compiler.source><maven.compiler.target>8maven.compiler.target>properties><dependencies><dependency><groupId>org.apache.hadoopgroupId><artifactId>hadoop-commonartifactId><version>2.9.2version>dependency><dependency><groupId>org.apache.hivegroupId><artifactId>hive-commonartifactId><version>2.3.7version>dependency><dependency><groupId>org.apache.hivegroupId><artifactId>hive-metastoreartifactId><version>2.3.7version>dependency><dependency><groupId>org.apache.hivegroupId><artifactId>hive-serviceartifactId><version>2.3.7version>dependency><dependency><groupId>org.apache.hivegroupId><artifactId>hive-jdbcartifactId><version>2.3.7version>dependency><dependency><groupId>org.apache.hivegroupId><artifactId>hive-execartifactId><version>2.3.7version>dependency>dependencies>
project>

等待下载并加载所有依赖。

创建包com.bigdata.impala,impala包下创建类ImpalaTest,如下:

package com.bigdata.impala;import java.sql.*;/*** @author Corley* @date 2021/10/11 15:41* @description ImpalaJDBC-com.bigdata.impala*/
public class ImpalaTest {public static void main(String[] args) throws ClassNotFoundException, SQLException {// 定义连接Impala的驱动和连接urlString driver = "org.apache.hive.jdbc.HiveDriver";String driverUrl = "jdbc:hive2://node02:21050/mydb;auth=noSasl";// 查询的sql语句String querySql = "select * from employee;";// 加载驱动Class.forName(driver);// 通过DriverManager获取连接final Connection connection = DriverManager.getConnection(driverUrl);final PreparedStatement ps = connection.prepareStatement(querySql);// 执行查询final ResultSet resultSet = ps.executeQuery();// 解析返回结果// 获取到每条数据的列数final int columnCount = resultSet.getMetaData().getColumnCount();// 遍历结果集while (resultSet.next()) {for (int i = 1; i <= columnCount; i++) {final String string = resultSet.getString(i);System.out.print(string + "\t");}System.out.println();}//关闭资源ps.close();connection.close();}
}

输出:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/Users/LENOVO/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/Users/LENOVO/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
1  Ramesh  32  Ahmedabad  20000  
3  kaushik  23  Kota  30000  
4  Chaitali  25  Mumbai  35000  
5  Hardik  27  Bhopal  40000  
6  Komal  22  MP  32000  
2  Khilan  25  Delhi  15000  

可以看到,获取到了查询结果。

总结

Impala也是一个交互式查询引擎,可以实现高效查询,在于Hive之间选择时,可以根据实际需求进行选择。


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部