股票数据爬取之——北向资金持股数据(通过selenium爬取数据)

目前很多网站对爬虫都有防范措施,传统的爬数据方法不一定有效。我们只能曲线救国,通过模拟网页浏览方式爬取数据,虽然速度慢,既能达到目标又不会网站服务器增加压力,双赢吧。

Python环境要先下载selenium模块,在工作目录下要下载相应浏览器的驱动,我这里用firefox,下载geckodriver。

具体代码如下:

from selenium import webdriver
import tushare as ts#批量爬取北向资金数据函数
def getNorthFundDateData(brow,url,dates):          #本函数金更新持有数量,另一函数刷新个股股价tryTimes = 3#brow.refresh()#time.sleep(1)brow.get(url)time.sleep(1)errorDate = []for date in dates:try:inputDate = brow.find_element_by_id('txtShareholdingDate')brow.execute_script("arguments[0].value = '" + date.replace('-','/') + "';", inputDate)  # 直接用js修改input的值brow.find_element_by_id('btnSearch').click()brow.implicitly_wait(10)#baseData = ts.get_stock_basics()time.sleep(1)text = brow.page_source    #获取网页代码#text = text.decode('utf-8')text = text.replace('\r\n', '')text = text.replace('\n', '')text = text.replace(' ', '')lines = re.findall(r'.*?', text)date = re.findall(r'txtShareholdingDate"type="text"value=.*?id="txtShareholdingDate', text)[0].split('value="')[1].split('"id=')[0]date = date.replace('/', '-')result = []for line in lines:   #分析网页内容,获取并转换数据if ('股份代號:') in line:datas = re.findall(r'.*?', line)code = datas[0].split('')[1].split('')[0]if code[0:1]=='9':code = '60' + code[1:]elif code[0:2]=='77':code = '300' + code[2:]elif code[0:1]=='7':code = '00' + code[1:]name = datas[1].split('')[1].split('')[0]try:aaa = zhconv.convert(name,'zh-hans')  #将股票名称从繁体转为简体name = aaaexcept:passvolume = datas[2].split('')[1].split('')[0].replace(',','')percent = datas[3].split('')[1].split('')[0]close = 0high = 0low = 0result.append([date+'/'+code,date,code,name,volume,percent,str(close),str(high),str(low)])'''sql = """INSERT ignore INTO HKbuyA (id,date,code,name,volume,percent,close,high,low)values (%s,%s,%s,%s,%s,%s,%s,%s,%s)"""db, cursor = initSql()n = cursor.executemany(sql, result)  # 将新记录插入数据库,忽略重复的db.commit()closeDb(db, cursor)'''print(time.strftime('%Y-%m-%d %H:%M:%S'),'成功',date,url)#return Trueexcept:time.sleep(1)traceback.print_exc()print(time.strftime('%Y-%m-%d %H:%M:%S'), '失败', date, url)errorDate.append(date)return Falsereturn True#下载沪股通/深股通数据,参数为下载时间段
def startToGetNorthFundData(fromDate,toDate):brow = webdriver.Firefox()   #启动浏览器df = ts.get_hist_data("sh", start=fromDate, end=toDate)   #获取开市的日期dates = list(df.index)urls = ['https://www.hkexnews.hk/sdw/search/mutualmarket_c.aspx?t=sz','https://www.hkexnews.hk/sdw/search/mutualmarket_c.aspx?t=sh']   #爬取网页地址brow.get(urls[0])time.sleep(2)for url in urls:timer = 0while timer<5:r = getNorthFundDateData(brow, url, dates)if r:timer = 5else:timer += 1brow.get('https://www.baidu.com')time.sleep(1)brow.quit()

 


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部