Python爬虫设置代理

2023-08-10 02:28:15

在Python中使用代理进行爬虫操作可以有效地隐藏用户的真实IP地址，防止被封禁或者限制访问。下面是设置代理的示例代码：

import requestsproxies = {"http": "http://127.0.0.1:8888","https": "http://127.0.0.1:8888",
}response = requests.get('http://www.example.com', proxies=proxies)

其中，proxies字典中的键http和https分别表示http协议和https协议，值为代理服务器地址和端口号。在使用requests库发起请求时，通过proxies参数传入代理设置即可。这里的代理服务器地址为127.0.0.1，端口号为8888，你可以将其替换为你自己的代理服务器地址和端口号。

爬虫编写

需求

做一个通用爬虫，根据github的搜索关键词进行全部内容爬取。

代码

首先开启代理，在设置中修改HTTP端口。

在这里插入图片描述

在爬虫中根据设置的系统代理修改proxies的端口号：

import requests
from lxml import html
import time
etree = html.etreedef githubSpider(keyword, pageNumberInit):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.62',}# 搜索的关键词keyword = keyword# 查询的起始页数pageNum = pageNumberInit# 设置一个通用的url模板url = 'https://github.com/search?p=%d&q={}'.format(keyword)# 根据代理配置端口进行修改proxies = {'http': 'http://127.0.0.1:1087', 'https': 'http://127.0.0.1:1087'}status_code = 200while True and pageNum:# 对应页码的urlnew_url = format(url % pageNum)# 使用通用爬虫对url对应的一整张页面进行爬取response = requests.get(url=new_url, proxies=proxies, headers=headers)status_code = response.status_code  # 状态码if status_code == 404:  # 最后一页print("===================================================")print("结束")returnif (status_code == 429):  # 访问次数过多print("正在重新获取第" + str(pageNum) + "页内容....")if (status_code == 200):  # 正常爬取print("===================================================")print("第" + str(pageNum) + "页：" + new_url)print("状态码：" + str(status_code))print("===================================================")page_text = response.texttree = etree.HTML(page_text)li_list = tree.xpath('//*[@id="js-pjax-container"]/div/div[3]/div/ul/li')for li in li_list:name = li.xpath('.//a[@class="v-align-middle"]/@href')[0].split('/', 1)[1]link = 'https://github.com' + li.xpath('.//a[@class="v-align-middle"]/@href')[0]# 解决没有star的问题try:stars = li.xpath('.//a[@class="Link--muted"]/text()')[1].replace('\n', '').replace(' ', '')except IndexError:print("名称：" + name + "\t链接：" + link + "\tstars:" + str(0))else:print("名称：" + name + "\t链接：" + link + "\tstars:" + stars)pageNum = pageNum + 1if __name__ == '__main__':githubSpider("hexo",1) # 输入搜索关键词和起始页数

爬取结果如下，包含搜索结果的名称、链接以及stars：

在这里插入图片描述

后记

爬取公网的简单测试，状态码：

import requests
#配置代理
#API代理提取链接：http://jshk.com.cn/mb/reg.asp?kefu=xjy
proxies={'http': 'http://127.0.0.1:1087', 'https': 'http://127.0.0.1:1087'}
response = requests.get('https://www.baidu.com/',proxies=proxies)
print(response.status_code)

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > wap的全面解析！！
下一篇 > Python 爬虫使用代理 IP 的正确方法

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

Python爬虫设置代理

相关文章