简易Python脚本爬取我爱我家网站信息

2023-09-23 00:59:11

最近杭州房价涨得好凶，要不要跟风买房，确实是个头疼的问题，不过做点准备总是没坏处的。前段时间我找了一个我爱我家的中介了解了下情况，他提到我爱我家官网，说上面信息的时效性和准确度都不错，可以时常关注一下。本着程序员的天性，一切可以用脚本偷懒的事情就都不要麻烦自己动手了，于是就写了一个脚本，用于监测我爱我家官网的消息变动，有新的房源信息就发短信给自己。

首先分析一下可行性，爬取网站，取得HTML页面的信息当然是没什么难度的，接下来就是从中整理出有用的信息，然后发短信给自己了。

发送短信的服务，搜索了一下有好几家，但很多都是套餐包，最便宜的也要几百块，作为我这种非刚需用户来说有点贵了。挑来挑去，最终我选了阿里大于作为短信发送平台，因为阿里大于是按条数收费的，按需收费，每条4.5分，而且是淘宝账号登录，支付宝付款，也比较方便。

Talk is cheap, show me the code. 先上代码。以之江一号小区为例，如果要搜索其他小区，只要修改url的最后一个字段即可。

wiwj.py文件

  1 # -*- coding: utf-8 -*-
  2 
  3 import urllib2
  4 import re
  5 import sms
  6 import logging
  7 logging.basicConfig(level=logging.DEBUG,
  8                     filename='log.log'
  9                     )
 10 
 11 
 12 def del_html_mark(content):
 13     inner_pattern = re.compile('<.*?>')
 14     return re.sub(inner_pattern, '', content)
 15 
 16 
 17 def del_additional_mark(content):
 18     inner_pattern = re.compile(' ')
 19     content = re.sub(inner_pattern, '', content)
 20     inner_pattern = re.compile('[;\t\n ]')
 21     content = re.sub(inner_pattern, '', content)
 22     return content
 23 
 24 
 25 def handle_content(content):
 26     content = del_html_mark(content)
 27     content = del_additional_mark(content)
 28     return content
 29 
 30 
 31 url = 'http://hz.5i5j.com/exchange/_%E4%B9%8B%E6%B1%9F%E4%B8%80%E5%8F%B7'
 32 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
 33 headers = {'User-Agent': user_agent}
 34 html_content = ''
 35 logging.info('Starting scanning...')
 36 try:
 37     request = urllib2.Request(url, headers=headers)
 38     response = urllib2.urlopen(request)
 39     html_content = response.read()
 40     # print html_content
 41 except urllib2.URLError, e:
 42     if hasattr(e, "code"):
 43         print e.code
 44     if hasattr(e, "reason"):
 45         print e.reason
 46 
 47 content = html_content.decode('utf-8')
 48 pattern = re.compile('(.*?)', re.S)
 49 items = re.findall(pattern, content)
 50 for item in items:
 51     title_pattern = re.compile('(.*?)')
 52     title_list = re.findall(title_pattern, item)
 53     title = title_list[0]
 54     title = handle_content(title)
 55     # print title
 56     community_pattern = re.compile('(.*?)')
 57     community_list = re.findall(community_pattern, item)
 58     community = community_list[0]
 59     community = handle_content(community)
 60     # print community
 61     detail_pattern = re.compile('(.*?)')
 62     detail_list = re.findall(detail_pattern, item)
 63     detail = detail_list[0]
 64     detail = handle_content(detail)
 65     # print detail
 66     price_pattern = re.compile('(.*?)')
 67     price_list = re.findall(price_pattern, item)
 68     price = price_list[0]
 69     price = handle_content(price)
 70     # print price
 71     size_pattern = re.compile('(.*?)/.*?')
 72     size_list = re.findall(size_pattern, item)
 73     size = size_list[0]
 74     size = handle_content(size)
 75     total_msg = {
 76         'title': title.encode('utf-8'),
 77         'community': community.encode('utf-8'),
 78         'detail': detail.encode('utf-8'),
 79         'price': price.encode('utf-8'),
 80         'size': size.encode('utf-8')
 81     }
 82     file = open('record.log', 'r')
 83     old_info_list = file.readlines()
 84     for old_info in old_info_list:
 85         old_info = del_additional_mark(old_info)
 86     file.close()
 87     is_old_message = False
 88     for old_info in old_info_list:
 89         if old_info == (total_msg['title'] + '\n'):
 90             is_old_message = True
 91             break
 92     if is_old_message:
 93         continue
 94     file = open('record.log', 'a')
 95     print total_msg
 96     sms.send_msg(total_msg, '我的手机号码')
 97     file.write(total_msg['title'] + '\n')
 98     file.close()
 99 logging.info(total_msg['title'])
100 logging.info(total_msg['detail'])
101 logging.info(total_msg['community'])
102 logging.info(total_msg['size'])
103 logging.info(total_msg['price'])
104 logging.info('Finished scanning.')
105 file.close()

sms.py文件

 1 # -*- coding: utf-8 -*-
 2 
 3 import top.api
 4 
 5 
 6 def send_msg(msg, tel):
 7     app_key = "******"
 8     secret_key = "******"
 9 
10     req = top.api.AlibabaAliqinFcSmsNumSendRequest()
11     req.set_app_info(top.appinfo(app_key, secret_key))
12 
13     req.extend = ""
14     req.sms_type = "normal"
15     req.sms_free_sign_name = "******"
16     req.sms_param = "{'position':'" + msg['community'] + "','temperature':'" + msg['size'] + \
17                     "','detail':'" + msg['price'] + "'}"
18     req.rec_num = tel
19     req.sms_template_code = "******"
20     try :
21         resp = req.getResponse()
22         print (resp)
23     except Exception, e:
24         print (e)
25 
26 
27 if __name__ == "__main__":
28     send_msg('测试文本', '测试电话号码')

wiwj.py是主文件，基本原理是将Python脚本伪装成浏览器，对特定页面进行爬取，然后将获取到的HTML文本信息用正则表达式取出需要的字段，包括房屋总价、面积、单价、描述等。

sms.py文件中接入了阿里大于的SDK top.api，通过这两句引入，然后根据申请的模板信息，设定一些参数

req = top.api.AlibabaAliqinFcSmsNumSendRequest()
req.set_app_info(top.appinfo(app_key, secret_key))

sms.py文件中我申请的模板的信息在这里用******替代了，读者如果需要，可以去申请一个模板。

因为我爱我家官网上的信息是时常变动的，即使是老信息，也可能突然更新一下，显示在最前面，而且网站上并没有房屋位置的详细信息，所以无法直接判定某条刚更新过的信息是不是之前就有的。我采取的判断一条信息是否是老信息的方法是，维护一个log文件，将每次爬取到的房产信息都写在这个log文件中，每次爬取到新的信息，就跟log文件中已有的信息进行对比，如果总价、单价和描述都相同，就认为这是同一个房子，不发送短信提醒。如果不完全相同，说明这就是新房子，发送短信提醒。

补充一句，有读者可能会对这一句感到奇怪

req.sms_param = "{'position':'" + msg['community'] + "','temperature':'" + msg['size'] + \
17                     "','detail':'" + msg['price'] + "'}"

这是因为阿里大于的短信发送模板是需要进行审核的，我之前提审了几个模板，其中有房产相关的文字，最终处理结果都是审批不通过，最后只好伪装成发送天气预报信息的短信模板，这样才过了审核。不过阿里大于比较赞的一点是，它会把审核不通过的原因写出来，这样就可以有针对性地进行修改，然后再次提审。

最后，将这些脚本放到我的个人VPS上(树莓派也可以)，并设定了crontab定时任务，

*/10 * * * * python ~/Documents/Fang/wiwj.py

每10分钟执行一次，实时度基本可以接受。

就这样，大功告成 XD

转载于:https://www.cnblogs.com/istream/p/5940429.html

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > 我爱我家逼迫客户免费当保洁员，自掏腰包买灯泡无处申冤
下一篇 > 四大名著搞笑

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

简易Python脚本爬取我爱我家网站信息

(.*?)

(.*?)

(.*?)

相关文章