pyspider爬虫框架之宝宝树需求

1 需求和分析

最近在做爬取宝宝树网站上商品信息的需求,原本以为很简单,没想到反爬还挺严重,研究了两天,发现有几个参数是经过JS加密的。通过分析,获取网站上的数据,需要constId这个请求参数,然而这个constId是经过三次网络请求得到的一个参数,最后一个请求是得到这个参数的关键请求,但是它依赖前两个请求,这几个请求的关键在于请求头里的“Param”参数,如下图所示:
这里写图片描述
这里写图片描述
通过查看network里请求的Initiator参数的js源码,可知加密过程就在这些js文件中
这里写图片描述

2 破解
  • 本人水平有限,请组里的Szpilman和煎饼两位大侠,通过JS调试,理清了加密过程。下面是JS的加密代码(在网站源码的const-id.js文件中):
// 随机获取31位的值,然后前补1构成lid。
function s() {for (var i = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ", c = 62, u = [], s = 31, d = 0; d < s; d++)u[d] = i["substr"](Math['floor'](Math["random"]() * c), 1);return u["join"]("")}
s()// 然后将html文档的"appKey": "7a0d42b97002353426c47d18f1cc0fbe",获取。构成第一次的加密的参数。
// lid是32位的字符串,要补1
//'{"lid": "1hNwgj22HZaf75p8rF97IicQBCRCx9Gz","appKey": "7a0d42b97002353426c47d18f1cc0fbe"}'var i = '', c, o, u, s, d, f, l, p = 0;
var S = "S0DOZN9bBJyPV-qczRa3oYvhGlUMrdjW7m2CkE5_FuKiTQXnwe6pg8fs4HAtIL1x="
for (a = '{"lid": "1hNwgj22HZaf75p8rF97IicQBCRCx9Gz","appKey": "7a0d42b97002353426c47d18f1cc0fbe"}'; p < a['length']; )c = a["charCodeAt"](p++),  //charCodeAt() 方法可返回指定位置的字符的 Unicode 编码。这个返回值是 0 - 65535 之间的整数。o = a["charCodeAt"](p++),u = a["charCodeAt"](p++),s = c >> 2,d = (c & 3) << 4 | o >> 4,f = (o & 15) << 2 | u >> 6,l = u & parseInt('77', 8),isNaN(o) ? f = l = parseInt('100', 8) : isNaN(u) && (l = parseInt('100', 8)),i = i + S["charAt"](s) + S["charAt"](d) + S["charAt"](f) + S["charAt"](l)//'{_v": "1.42.0.435","ua": "470b4b3af8a1eea1eafd570cb672de33","language": "en-US","cd": 24,"pr": 1,"hc": 4,"res": "1680;1050","ar": "1680;1026","to": -480,"ss": 1,"ls": 1,"ind": 1,"od": 1,"cc": "unknown","np": "Linux x86_64","dnt": "unknown","rp": "9597ec5d235f00b31ac537ef03b028cf","can": "f19bbe07be0ce9deb3b7c6d067f2ba53","web": "fac25db4cf995e91f7b62096e793f568","adb": false,"hll": false,"hlr": false,"hlo": false,"hlb": false,"ts": "0;false;false","jf": "745caf07297ffff67e829c8e9f977188","inet": "10.15.100.114","appKey": "7a0d42b97002353426c47d18f1cc0fbe","lid": "1JpFx0vb3baqOZep3haHLKpXREuhff7V"}'
  • 因为要写成python爬虫,所以得加密过程得改成python版,下面是本人,简单的进行了修改和测试(主要是测试得到的Param参数,再发送请求(即第三次请求)得到的constId参数是否可以请求到数据),代码如下:
import randomdef s():i = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"n = 31c = 62   # len(i)b = []for each in range(n):b.append(i[random.randint(0,c-1)])return '1' + ''.join(b)lib = s()print(lib)
# lib = "1hNwgj22HZaf75p8rF97IicQBCRCx9Gz"
print(len(lib))a = '{"lid": "1hNwgj22HZaf75p8rF97IicQBCRCx9Gz","appKey": "7a0d42b97002353426c47d18f1cc0fbe"}'
print(len(a)) # 88
S = "S0DOZN9bBJyPV-qczRa3oYvhGlUMrdjW7m2CkE5_FuKiTQXnwe6pg8fs4HAtIL1x="
print(len(S))def get_params(aa):S = "S0DOZN9bBJyPV-qczRa3oYvhGlUMrdjW7m2CkE5_FuKiTQXnwe6pg8fs4HAtIL1x="n = len(aa)param = ''for i in range(0,n,3):c = ord(aa[i])if i+1 < n:o = ord(aa[i+


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部