flume mysql hive_黑马论坛日志项目(hive、sqoop、flume、mysql)

一、准备工作

1、项目描述

通过对黑马技术论坛的apache common日志进行分析, 计算论坛关键指标,供运营者决策。

2、数据情况

每行记录有5部分组成: 访问ip 、访问时间 、访问资源【跟着两个访问的Url】、访问状态 、本次流量

截取部分数据如下:

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/shy.gif HTTP/1.1" 200 2663

8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/nv_a.png HTTP/1.1" 200 2076

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/titter.gif HTTP/1.1" 200 1398

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sweat.gif HTTP/1.1" 200 1879

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/mad.gif HTTP/1.1" 200 2423

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/hug.gif HTTP/1.1" 200 1054

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/lol.gif HTTP/1.1" 200 1443

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/victory.gif HTTP/1.1" 200 1275

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/time.gif HTTP/1.1" 200 687

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/kiss.gif HTTP/1.1" 200 987

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/handshake.gif HTTP/1.1" 200 1322

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/loveliness.gif HTTP/1.1" 200 1579

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/call.gif HTTP/1.1" 200 603

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/funk.gif HTTP/1.1" 200 2928

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/curse.gif HTTP/1.1" 200 1543

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/dizzy.gif HTTP/1.1" 200 1859

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/shutup.gif HTTP/1.1" 200 2500

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sleepy.gif HTTP/1.1" 200 2375

8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/pn.png HTTP/1.1" 200 592

8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=56212&size=middle HTTP/1.1" 301 -

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/uploadbutton_small.png HTTP/1.1" 200 690

8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/fastreply.gif HTTP/1.1" 200 608

8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=21212&size=middle HTTP/1.1" 301 -

8.35.201.144 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=28823&size=middle HTTP/1.1" 301 -

8.35.201.161 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/taobao.gif HTTP/1.1" 200 1021

8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/02/93/31_avatar_middle.jpg HTTP/1.1" 200 6519

8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/security.png HTTP/1.1" 200 2203

8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=36174&size=middle HTTP/1.1" 301 -

8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/pn_post.png HTTP/1.1" 200 3309

8.35.201.164 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/data/avatar/000/05/72/32_avatar_middle.jpg HTTP/1.1" 200 5333

8.35.201.144 - - [30/May/2013:17:38:22 +0800] "GET /static/image/common/icon_quote_e.gif HTTP/1.1" 200 287

8.35.201.161 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/avatar.php?uid=27067&size=small HTTP/1.1" 301 -

8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/05/36/35_avatar_middle.jpg HTTP/1.1" 200 10087

8.35.201.165 - - [30/May/2013:17:38:22 +0800] "GET /data/attachment/common/c5/common_13_usergroup_icon.jpg HTTP/1.1" 200 3462

8.35.201.160 - - [30/May/2013:17:38:22 +0800] "GET /static/image/magic/bump.small.gif HTTP/1.1" 200 1052

8.35.201.165 - - [30/May/2013:17:38:22 +0800] "GET /static/image/common/arw.gif HTTP/1.1" 200 940

220.181.89.156 - - [30/May/2013:17:38:20 +0800] "GET /thread-24727-1-1.html HTTP/1.1" 200 79499

8.35.201.164 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/data/avatar/000/05/62/12_avatar_middle.jpg HTTP/1.1" 200 6415

211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /data/cache/style_1_forum_index.css?y7a HTTP/1.1" 200 2331

211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1" 200 1292

211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1" 200 1482

211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /static/js/logging.js?y7a HTTP/1.1" 200 603

211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /static/js/forum.js?y7a HTTP/1.1" 200 15256

8.35.201.165 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/data/avatar/000/02/88/23_avatar_middle.jpg HTTP/1.1" 200 6733

211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /static/js/md5.js?y7a HTTP/1.1" 200 5734

8.35.201.160 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/data/avatar/000/02/12/12_avatar_middle.jpg HTTP/1.1" 200 6606

211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /source/plugin/study_nge/css/nge.css HTTP/1.1" 200 2521

211.97.15.179 - - [30/May/2013:17:38:21 +0800] "GET /forum.php HTTP/1.1" 200 71064

211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /source/plugin/study_nge/js/HoverLi.js HTTP/1.1" 200 324

3、关键指标

⊙浏览量PV

定义:页面浏览量即为PV(Page View),是指所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次。

分析:网站总浏览量,可以考核用户对于网站的兴趣,就像收视率对于电视剧一样。但是对于网站运营者来说,更重要的是,每个栏目下的浏览量。

计算公式:记录计数

⊙注册用户数

计算公式:对访问member.php?mod=register的url,计数

⊙IP数

定义:一天之内,访问网站的不同独立IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。

分析:这是我们最熟悉的一个概念,无论同一个IP上有多少电脑,或者其他用户,从某种程度上来说,独立IP的多少,是衡量网站推广活动好坏最直接的数据。

公式:对不同ip,计数

⊙跳出率

定义:只浏览了一个页面便离开了网站的访问次数占总的访问次数的百分比,即只浏览了一个页面的访问次数 / 全部的访问次数汇总。

分析:跳出率是非常重要的访客黏性指标,它显示了访客对网站的兴趣程度:跳出率越低说明流量质量越好,访客对网站的内容越感兴趣,这些访客越可能是网站的有效用户、忠实用户。

该指标也可以衡量网络营销的效果,指出有多少访客被网络营销吸引到宣传产品页或网站上之后,又流失掉了,可以说就是煮熟的鸭子飞了。比如,网站在某媒体上打广告推广,分析从这个推广来源进入的访客指标,其跳出率可以反映出选择这个媒体是否合适,广告语的撰写是否优秀,以及网站入口页的设计是否用户体验良好。

计算公式:(1)统计一天内只出现一条记录的ip,称为跳出数 (2)跳出数/PV

4、项目开发步骤

1.flume

2.对数据进行清洗

3.使用hive进行数据的多维分析

4.把hive分析结果通过sqoop导出到mysql中

5.提供视图工具供用户使用

上面介绍了项目的一些基本情况,下面我们将详细介绍项目的开发过程:

二、开发过程

1、使用flume把日志导入到hdfs中

a4.conf文件配置:

#定义agent名, source、channel、sink的名称

a4.sources=r1

a4.channels=c1

a4.sinks=k1

#具体定义source

a4.sources.r1.type=spooldir

a4.sources.r1.spoolDir= /home/hadoop/logs

#具体定义channel

a4.channels.c1.type=memory

a4.channels.c1.capacity= 10000a4.channels.c1.transactionCapacity= 100#定义拦截器,为消息添加时间戳

a4.sources.r1.interceptors=i1

a4.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.TimestampInterceptor$Builder

#具体定义sink

a4.sinks.k1.type=hdfs

a4.sinks.k1.hdfs.path= hdfs://ns1/flume/%Y%m%d

a4.sinks.k1.hdfs.filePrefix = events-a4.sinks.k1.hdfs.fileType=DataStream

#不按照条数生成文件

a4.sinks.k1.hdfs.rollCount= 0#HDFS上的文件达到128M时生成一个文件

a4.sinks.k1.hdfs.rollSize= 134217728#HDFS上的文件达到60秒生成一个文件

a4.sinks.k1.hdfs.rollInterval= 60#组装source、channel、sink

a4.sources.r1.channels=c1

a4.sinks.k1.channel= c1

执行:

bin/flume-ng agent -n a4 -c conf -f conf/a4.conf -Dflume.root.logger=INFO,console

2、初始化

1. hive创建一张外部表

create external table hmbbs (ip string, logtime string, url string)patitioned by (logdate string) row format delimited felds termianted by '\t' location '/flume';

2.创建一个shell脚本

touch daily.sh

# 添加执行权限

chmod +x daily.sh

3、写shell脚本(一步一步测试,注释)

# 给变量赋值··飘号

CURRENT=`date +%y%m%d`

# 对日志进行清洗筛选,使用MapReduce,cleaner.jar

/hadoop/bin/hadoop jar /root/cleaner.jar /flume/$CURRENT /cleaned/$CURRENT

# 分区表,使用经过dfs方法添加的数据,需要指定partition

/hive/bin/hive -e "alter table hmbbs add partition (logdate=$CURRENT) location '/cleaned/$CURRENT'";

# 把logdate是当天的,取出来数个数,保存在新表里pv

/hive/bin/hive -e "create table pv_$CURRENT row format delimited fields terminated by '\t' as select count(*) from hmbbs where logdate=$CURRENT";

# 按ip分组,统计个数,个数大于20的,排序,输出20个,保存新表vip

/hive/bin/hive -e "create table vip_$CURRENT row format delimited fields terminated by '\t' as select $CURRENT, ip, count(*) as hits from hmbbs where logdate=$CURRENT group by ip having hits > 20 order by hits desc limit 20"

# distinct 取不同的ip,最后计算个数,保存新表uv

/hive/bin/hive -e "create table uv_$CURRENT row format delimited fields terminated by '\t' as select $CURRENT, count(distinct ip) from hmbbs where logdate=$CURRENT";

# instr hive的函数,查询url里是否含有该字段

/hive/bin/hive -e "select count(*) from hmbbs where logdate=$CURRENT and instr(url,'member.php?mod=register')>0";

# 导入到mysql里

/sqoop-1.4.3/bin/sqoop export --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --export-dir "/user/hive/warehouse/pv_$CURRENT" --table pv --fields-terminated-by '\t'

写入crontab里,可以定时执行。

附:数据清洗java代码

packagecn.itcast.hadoop;importjava.text.ParseException;importjava.text.SimpleDateFormat;importjava.util.Locale;importorg.apache.hadoop.conf.Configuration;importorg.apache.hadoop.conf.Configured;importorg.apache.hadoop.fs.Path;importorg.apache.hadoop.io.LongWritable;importorg.apache.hadoop.io.NullWritable;importorg.apache.hadoop.io.Text;importorg.apache.hadoop.mapreduce.Job;importorg.apache.hadoop.mapreduce.Mapper;importorg.apache.hadoop.mapreduce.Reducer;importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;importorg.apache.hadoop.util.Tool;importorg.apache.hadoop.util.ToolRunner;public class Cleaner extends Configured implementsTool{

@Overridepublic int run(String[] args) throwsException {final String inputPath = args[0];final String outPath = args[1];final Configuration conf = newConfiguration();final Job job = new Job(conf, Cleaner.class.getSimpleName());

job.setJarByClass(Cleaner.class);

FileInputFormat.setInputPaths(job, inputPath);

job.setMapperClass(MyMapper.class);

job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);

job.setReducerClass(MyReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(NullWritable.class);

FileOutputFormat.setOutputPath(job,newPath(outPath));

job.waitForCompletion(true);return 0;

}public static void main(String[] args) throwsException{

ToolRunner.run(newCleaner(), args);

}static class MyMapper extends Mapper{

LogParser parser= newLogParser();

Text v2= newText();protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context) throwsjava.io.IOException ,InterruptedException {final String line =value.toString();final String[] parsed =parser.parse(line);final String ip = parsed[0];final String logtime = parsed[1];

String url= parsed[2];//过滤所有静态的资源请求

if(url.startsWith("GET /static")||url.startsWith("GET /uc_server")){return;

}if(url.startsWith("GET")){

url= url.substring("GET ".length()+1, url.length()-" HTTP/1.1".length());

}if(url.startsWith("POST")){

url= url.substring("POST ".length()+1, url.length()-" HTTP/1.1".length());

}

v2.set(ip+"\t"+logtime +"\t"+url);

context.write(key, v2);

};

}static class MyReducer extends Reducer{protected void reduce(LongWritable k2, java.lang.Iterable v2s, org.apache.hadoop.mapreduce.Reducer.Context context) throwsjava.io.IOException ,InterruptedException {for(Text v2 : v2s) {

context.write(v2, NullWritable.get());

}

};

}

}classLogParser {public static final SimpleDateFormat FORMAT = new SimpleDateFormat("d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);public static final SimpleDateFormat dateformat1=new SimpleDateFormat("yyyyMMddHHmmss");//public static void main(String[] args) throws ParseException {//final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";//LogParser parser = new LogParser();//final String[] array = parser.parse(S1);//System.out.println("样例数据: "+S1);//System.out.format("解析结果: ip=%s, time=%s, url=%s, status=%s, traffic=%s", array[0], array[1], array[2], array[3], array[4]);//}

/*** 解析日志的行记录

*@paramline

*@return数组含有5个元素,分别是ip、时间、url、状态、流量*/

publicString[] parse(String line){

String ip=parseIP(line);

String time;try{

time=parseTime(line);

}catch(Exception e1) {

time= "null";

}

String url;try{

url=parseURL(line);

}catch(Exception e) {

url= "null";

}

String status=parseStatus(line);

String traffic=parseTraffic(line);return newString[]{ip, time ,url, status, traffic};

}privateString parseTraffic(String line) {final String trim = line.substring(line.lastIndexOf("\"")+1).trim();

String traffic= trim.split(" ")[1];returntraffic;

}privateString parseStatus(String line) {

String trim;try{

trim= line.substring(line.lastIndexOf("\"")+1).trim();

}catch(Exception e) {

trim= "null";

}

String status= trim.split(" ")[0];returnstatus;

}privateString parseURL(String line) {final int first = line.indexOf("\"");final int last = line.lastIndexOf("\"");

String url= line.substring(first+1, last);returnurl;

}privateString parseTime(String line) {final int first = line.indexOf("[");final int last = line.indexOf("+0800]");

String time= line.substring(first+1,last).trim();try{returndateformat1.format(FORMAT.parse(time));

}catch(ParseException e) {

e.printStackTrace();

}return "";

}privateString parseIP(String line) {

String ip= line.split("- -")[0].trim();returnip;

}

}


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部