网易云音乐热歌榜爬取(用Jsonpath解析Json数据 + 面向对象写法)
要注意的一点是Jsonpath是从0开始数的,Xpath是从1开始数的
一般写法(函数式)
import requests
from requests.exceptions import RequestException
import re
import json
import jsonpath
import csv
import timeheaders = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
}fp = open('D:/网易云音乐Top200.csv','wt',newline='',encoding='utf8')
writer = csv.writer(fp)
writer.writerow(('歌名','歌手','图片链接','上次排名'))def get_html(url):try:content = requests.get(url,headers=headers)if content.status_code == requests.codes.OK:return content.textelse:print('debug1')return Noneexcept RequestException:print('debug2')return Nonedef get_json_data(html):json_content = re.findall('',html,re.S)# print(json_content) 此时的json_content是真正的json格式result = json.loads(json_content[0])#注意此时的result已经不能拿去www.json.cn解析了,因为已经转化为python对象是字典#json只有Array和object组成,如果是Array要记得加一个[x]'''测试提取结果print(result[0])#result[0]是指JavaScript里的第一个对象print(result[0]["name"])#titleprint(result[0]["artists"][0]["name"])#artistprint(result[0]["album"]["picUrl"])print(result[0]["lastRank"])#上一次的排名'''#测试成功,对应换成jsonpathfor section in result:title = jsonpath.jsonpath(section,expr='$.name')[0]artist = jsonpath.jsonpath(section,expr='$.artists..name')[0]picture_link = jsonpath.jsonpath(section,expr='$.album.picUrl')[0]lastRank = jsonpath.jsonpath(section,expr='$.lastRank')if not lastRank:lastRank = '等于当前排名'else:lastRank = lastRank[0]writer.writerow((title,artist,picture_link,lastRank))if __name__ == '__main__':'''值得注意的是这里url要把源网页的'https://music.163.com/#/discover/toplist?id=3778678'中的/#删除才可以得到因为可以观察到网页的url和解析网页时的url是这个地方有差别,故尝试'''url = 'https://music.163.com/discover/toplist?id=3778678'html = get_html(url)get_json_data(html)
尝试了一下面向对象写法,封装一只小蜘蛛
import requests
from requests.exceptions import RequestException
import re
import json
import jsonpath
import csv
import timeclass CloudMusicSpider:def __init__(self):self.headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',}self.url = 'https://music.163.com/discover/toplist?id=3778678'def parse_url(self,url):try:response = requests.get(url,headers=self.headers)if response.status_code == requests.codes.OK:return response.textelse:return Noneexcept RequestException:return Nonedef get_json_data(self,html):fp = open('D:/网易云音乐Top200.csv', 'wt', newline='', encoding='utf8')writer = csv.writer(fp)writer.writerow(('歌名', '歌手', '图片链接', '上次排名'))json_content = re.findall('', html,re.S)result = json.loads(json_content[0])for section in result:title = jsonpath.jsonpath(section, expr='$.name')[0]artist = jsonpath.jsonpath(section, expr='$.artists..name')[0]picture_link = jsonpath.jsonpath(section, expr='$.album.picUrl')[0]lastRank = jsonpath.jsonpath(section, expr='$.lastRank')if not lastRank:lastRank = '等于当前排名'else:lastRank = lastRank[0]writer.writerow((title, artist, picture_link, lastRank))def runspider(self):html = self.parse_url(self.url)self.get_json_data(html)if __name__ == '__main__':cloud_music_spider = CloudMusicSpider()cloud_music_spider.runspider()
刚刚爬取好的数据在文件里面会遇到乱码的现象
解决方法是:把这个刚刚爬好的文件以记事本的形式打开,并另存为另外的一个csv文件。
然后再打开就可以看到结果了
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
