hive

2021-03-03

字数：1.1k字 | 预计阅读时长：5分钟

组件架构：
hiveserver2（beeline）,hive,metadb

Execution Engine – The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components.
连接hiveserver2
GUI CLI JDBC (beeline)
数据源
用kafka，sqoop等获得data，放入hdfs，这些数据各种结构都有。
关系数据库的表，MongoDB 或json数据，或日志
执行hql
背后运行的是mapreduce or Tez jobs(类似于pig latin脚本执行pig)
insert into test values("wangyuq","123");
查看tracking url
stage
将你的数据移到目的位置之前，将会staing 那儿一段时间。staging文件最终丢弃。
比对
pig是对非结构化数据处理的好的etl。
hive不是关系数据库，只是维护存储在HDFS的数据的metadata，使得对大数据操作就像sql操作表一样，只不过hql和sql稍有出入。使我们能用sql来执行mr。可以对hdfs数据进行query。
hive使用metastore存表。hive默认derby但是可自定义更换。
劣
hive不能承诺优化，只是简单，因此hive不能支持实时，性能差
index view有限制（partition bucket 弥补）
和sql 的datatype不完全一样
与hdfs关系
hdfs里有hive，data在hdfs上，schema在metastore里。
load语句：将hdfs搬运到hive，hdfs不再有该数据。只是将真正的data转到了hive目录下。

Making Multiple Passes over the Same Data

Hive has a special syntax for producing multiple aggregations from a single pass through a source of data, rather than rescanning it for each aggregation. This change can save considerable processing time for large input data sets.

因此如下方式更加高效,并且可开启并行：

FROM pv_users
    INSERT OVERWRITE TABLE pv_gender_sum
        SELECT pv_users.gender, count_distinct(pv_users.userid)
        GROUP BY pv_users.gender

    INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
        SELECT pv_users.age, count_distinct(pv_users.userid)
        GROUP BY pv_users.age;

1 2	set hive.exec.parallel=true; //打开任务并行执行 set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度，默认为8。

日期处理
查看N天前的日期：

1	select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;

获取两个日期之间的天数/秒数/分钟数等等：

1	select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400 from t_lxw_test limit 1;

left outer join
1
2
3
4
5
6
--query 1
select count(id) from
(select id from a left outer join b
on a.id=b.id and b.date='2017-10-27'
where to_date(a.adate) >= '2017-10-27' and a.date='2017-07-24'
) a
1
2
3
4
5
6
--query 2
select count(id) from
(select id from a left outer join b
on a.id=b.id and b.date='2017-10-27' and a.date='2017-07-24'
where to_date(a.adate) >= '2017-10-27'
) a
区别？where 后面跟的是过滤条件，query 1 中的a.date=’2017-07-24’, 在table scan之前就会Partition Pruner 过滤分区，所以只有’2017-07-24’下的数据会和b进行join。
而query 2中会读入所有partition下的数据，再和b join，并且根据join的关联条件只有a.date=’2017-07-24’ 的时候才会真正执行join，其余情况下又由于是left outer join, 右面会留NULL
配置文件

正则
java中的正则匹配即可:

1 2	name rlike '^[\\u4e00-\\u9fa5]+$' select mobile from phone where mobile rlike '^\\d+$' ;

控制hive任务中的map数和reduce数
SQLWindowing

hdfs目录创建hive表,指定分区

CREATE EXTERNAL TABLE if not exists push_log(
     hostid STRING, dayid STRING
     plmn STRING)
 COMMENT ' log table'
 PARTITIONED BY (hostid STRING, dayid STRING) 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
 STORED AS TEXTFILE
 LOCATION '/user/data/push';
alter table push_log add partition(hostid='$hostid', dayid='$dayid') location '/user/data/push/$hostid/$dayid';

testtext 数据wer 46 weree 78 wer 89 rr 89

create table d_part(name string)  partitioned by(value string) row format delimited fields terminated by '\t'  lines terminated by '\n' stored as textfile;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrick;

insert overwrite table d_part partition(value) select name,addr as value from testtext;

select * from d_part;
show partitions d_part;

hive> create table d_part2(
    > name string
    > )
    > partitioned by(value string,dt string)
    > row format delimited fields terminated by '\t' 
    > lines terminated by '\n'
    > stored as textfile;
hive> insert overwrite table d_part2 partition(value,dt)
    > select 'test' as name,  
    > addr as value,
    > name as dt
    > from testtext;
show partitions d_part2;

本文作者： yuqing wang
本文链接： https://satyrswang.github.io/2021/03/03/hive/
版权声明： 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。转载请注明出处！