爬取古诗词网(使用正则)

 一、正则使用

正则表达式常用匹配规则:

匹配某个字符串:

__author__ = '田明博'
__date__ = '2019/10/11 12:56'import re
import requestsdef get_page(link):url = linkheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',}resp = requests.get(url, headers=headers)text = resp.text# 获取题目titles = re.findall(r'.*?(.*?)', text, re.DOTALL)  # re.DOTALL: . 匹配所有字符source = re.findall(r'

.*?(.*?)', text, re.DOTALL) # 获取朝代authors = re.findall(r'

.*?(.*?)', text, re.DOTALL) # 获取作者poems_all = re.findall(r'(.*?)', text, re.DOTALL) # 后获取所有古诗内容contents = []for poems in poems_all:poems = re.sub(r'<.*?>|\n', "", poems) #去除\n
contents.append(poems.strip())# print(titles, source, authors, contents)'''all_poems = []for x in range(len(titles)):one_poem = {}one_poem['题目'] = titles[x]one_poem['朝代'] = source[x]one_poem['作者'] = authors[x]one_poem['内容'] = contents[x]all_poems.append(one_poem)print(all_poems)'''# 功能同上all_poems = []for value in zip(titles, source, authors, contents):title, source, author, content = value # 元组解包one_poem = {}one_poem['题目'] = titleone_poem['朝代'] = sourceone_poem['作者'] = authorone_poem['内容'] = contentall_poems.append(one_poem)print(all_poems)def main():num = int(input('输入爬取页数'))for i in range(num):link = 'https://www.gushiwen.org/default_{}.aspx'.format(num)get_page(link)if __name__ == '__main__':main()

运行截图:


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部