Python 爬取外文期刊论文信息（机械仪表工业）

2023-08-25 20:27:01

NSTL国家科技图书文献中心 2017 机械仪表工业所有期刊论文信息

代码比较随意，不要介意

第一步，爬取所有期刊链接

#coding=utf-8import time
from selenium import webdriver
from lxml import etree
from pymongo import MongoClientclient = MongoClient("IP", 27017)
db = client["nstl"]
collection=db["journal_urls"]
db.authenticate("","")driver = webdriver.Chrome(executable_path=r"D:\chromedriver_win32\chromedriver.exe")
driver.get('https://www.nstl.gov.cn/facade/search/clcSearch.do?&lan=eng&clc=TH')html = driver.page_source
tree = etree.HTML(html)
count = int(tree.xpath("//span[@id='totalPages1']/text()")[0])# 共47页
for i in range(count):html = driver.page_sourcetree = etree.HTML(html)# 提取当前页所有期刊链接并存储table = tree.xpath("//div[@class='s2listtd2']/span/a/@href")for j in table:bson = {}bson['url'] = jcollection.insert(bson)# i等于46时终止if i==(count-1):break# 点击接下来一页按钮driver.find_element_by_xpath('//div[@id="page"]/div//a[text()="%s"]'%str(i+2)).click()# 判断翻页成功后跳出whilewhile True:time.sleep(1)if driver.page_source!=html:breakdriver.close()

第二步，爬取每个期刊中所有2017年论文链接

#coding=utf-8
import requests
from pymongo import MongoClient
from lxml import etree
from selenium import webdriver
import timeclient = MongoClient("IP", 27017)
db = client["nstl"]
collection1=db["journal_urls"]
collection2=db["journalArticle2017_urls"]
db.authenticate("","")
driver = webdriver.Chrome(executable_path=r"D:\chromedriver_win32\chromedriver.exe")
# 循环所有期刊链接
for item in collection1.find({}, {"url":1, "_id":0}):driver.get(item['url'][29:-4])html = driver.page_sourcetree = etree.HTML(html)# 判断如果有18年论文，需要点击出17年论文table_2018 = tree.xpath("//div[@id='year_2018']")if table_2018!=[]:driver.find_element_by_xpath("//div[@id='year_2017']").click()time.sleep(1)driver.find_element_by_xpath("//div[@id='volumeUl_2017']/div[@class='ltreebom2']").click()# 获取17年期的个数并循环table = tree.xpath("//div[@id='volumeUl_2017']//div[@class='ltreebom3']/a")for i in range(1, len(table)+1):wen_html = driver.page_sourcewen_tree = etree.HTML(wen_html)# 获取当前一期的所有论文链接wen_table = tree.xpath("//div[@class='s2listtd2']/a/@href")for j in wen_table:bson = {}bson['url'] = jcollection2.insert(bson)# 判断结束循环if i==len(table):break# 点击出下一期论文try:driver.find_element_by_xpath("//div[@id='volumeUl_2017']//div[@class='ltreebom3'][%s]"%str(i+1)).click()except:break# 判断是否点击成功while True:time.sleep(1)if driver.page_source!=wen_html:breakdriver.close()

第三步，爬取论文信息详情页源码

#coding=utf-8
import requests
from pymongo import MongoClient
from lxml import etree
from selenium import webdriver
import timeclient = MongoClient("IP", 27017)
db = client["nstl"]
collection=db["journalArticle2017_urls"]
collection1=db["journalArticle2017_codes"]
db.authenticate("","")driver = webdriver.Chrome(executable_path=r"D:\chromedriver_win32\chromedriver.exe")# 循环所有论文并构造链接
for item in collection.find({}, {"url":1, "_id":0}):url = "https://www.nstl.gov.cn/facade/search/toFullView.do?checkedSEQNO="+item['url'][23:-11]+"&subDocType="+item['url'][-8:-3]# # post方法获取当前页源码# for i in range(100):#     try:#         result = requests.post(url, verify = False)#     except:#         time.sleep(1)#         continue#     html = result.text#     if html:#         break# 模拟浏览器获取源码, 得到含有文献数据的源码后跳出循环
    driver.get(url)for i in range(100):time.sleep(1)if driver.page_source!=html:break# 存储bson = {}html1 = driver.page_sourcebson['html'] = html1collection1.insert(bson)driver.close()

第四步，解析源码

#coding=utf-8
from pymongo import MongoClient
from lxml import etreeclient = MongoClient("IP", 27017)
db = client["nstl"]
collection1 = db["journalArticle2017_codes"]
collection2 = db["journalArticle2017_data"]
db.authenticate("","")zzdw, km, ma, cbn, j, q, qy, zy, zys, flh, gjc, yz, wz = u'【作者单位】：', u'【刊名】：', u'【ISSN】：', u'【出版年】：', u'【卷】：', u'【期】：', u'【起页】：', u'【止页】：', u'【总页数】：', u'【分类号】：', u'【关键词】：', u'【语种】：', u'【文摘】：'# 循环所有论文并构造链接
n = 0
for item in collection1.find({}, {"html":1, "_id":0}):html = item["html"]tree = etree.HTML(html)title = tree.xpath("//span[@name='title']/text()")author = tree.xpath("//a[starts-with(@href,'javascript:searchByAuthor')]/text()")organization = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%zzdw)journal_name = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%km)issn = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%ma)publication_year = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%cbn)volume = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%j)issue = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%q)page_start = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%qy)page_end = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%zy)page_count = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%zys)clc = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%flh)keywords = tree.xpath("//div[text()='%s']/following-sibling::*/span/a/text()"%gjc)language = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%yz)summary = tree.xpath("//div[text()='%s']/following-sibling::*/text()"%wz)dc = {}dc['title'] = title[0]if author: dc['author'] = authorif organization: dc['organization'] = organization[0]if journal_name: dc['journal_name'] = journal_name[0]if issn: dc['issn'] = issn[0]if publication_year: dc['publication_year'] = publication_year[0]if volume: dc['volume'] = volume[0]if issue: dc['issue'] = issue[0]if page_start: dc['page_start'] = page_start[0]if page_end: dc['page_end'] = page_end[0]if page_count: dc['page_count'] = page_count[0]if clc: dc['clc'] = clc[0]if keywords: dc['keywords'] = keywords[0]if language: dc['language'] = language[0]if summary: dc['summary'] = summary[0]collection2.insert(dc)

转载于:https://www.cnblogs.com/zhangtianyuan/p/9199324.html

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

Python 爬取外文期刊论文信息（机械 仪表工业）

相关文章

Python 爬取外文期刊论文信息（机械仪表工业）