Python图片抓取实操

2023-08-09 12:33:35

爬取效果如下(别问我为什么被分成了两张还有去不掉的水印，不然网站怎么赚钱)：

爬取链接为（千图网的图片，顺便吐槽一下，这个有些图片是分两半的！！！）：

https://www.58pic.com/piccate/10-0-0-p1.html

记得先F12看所需要的内容具体位置和代码，可以找出，图片列表需要的href="//www.58pic.com/sucai/19381028.html"：


免费下载

详细页面所需要的图片链接src="//preview.qiantucdn.com/58picmark/back_origin_pic/19/38/10/28M58PICdRdmHaG8694atMaRk.JPG!w1024_small" class="show-area-pic"：

记得新建一个存储图片的文件夹picture。

抓取代码如下：

import socket
import re
import requests
from bs4 import BeautifulSoup
from lxml import etreedef test_pachong():url = r"https://www.58pic.com/piccate/10-0-0-p1.html"headers = {'user-agent': 'my-test/0.0.1','Referer':'https://www.58pic.com/piccate/10-0-0-p1.html'}html = requests.get(url,headers).textsoup = BeautifulSoup(html, 'lxml')infoData = soup.find_all(name='a', attrs={'class': 'thumb-box'})num = len(infoData)for j in range(num):picinfo = str(infoData[j])if '''target="_blank"''' in picinfo:rex1 = re.compile("href=\"(.+?)\"")pic_url = rex1.findall(picinfo)[0]try:get_picInfo(pic_url)except:print('异常url是{}'.format(pic_url))
def get_picInfo(pic_url):headers = {'user-agent': 'my-test/0.0.1','Referer':'https://www.58pic.com/piccate/10-0-0-p1.html'}url = "https:"+pic_urlhtml = requests.get(url,headers).textsoup = BeautifulSoup(html, 'lxml')infoData = soup.find_all(name='img',attrs={'class':'show-area-pic'})num = len(infoData)img_title = ''img_info = ''for j in range(num):info = str(infoData[j])rex1 = re.compile("title=\"(.+?)\"")rex2 = re.compile("src=\"(.+?)\"")infoData1 = rex1.findall(info)[0]infoData2 = rex2.findall(info)[0]img_url = "http:"+infoData2img_title = infoData1+'.jpg'print(img_title,'--url--',img_url)if num > 1: #这里确认是图片被网站分成了两部分，暂时还没时间处理能拼接这两部分图片if j == 0:save_img('picture\\'+img_title,requests.get(img_url,headers).content)else:img_title = infoData1+str(j)+'.jpg'save_img('picture\\'+img_title,requests.get(img_url,headers).content)else:save_img('picture\\'+img_title,requests.get(img_url,headers).content)
test_pachong()

最后就是保存我们通过url拿到的图片源requests.get(img_url,headers).content，代码如下：

def save_img(file_name,img):
'''保存图片'''
with open (file_name,'wb') as save_img:
save_img.write(img)
print ('正在下载{}'.format(file_name))

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > 域名解析类型：A记录、CNAME、MX记录、NS记录、TTL、TXT记录
下一篇 > 图片爬虫，手把手教你Python多线程下载获取图片

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

Python图片抓取实操

相关文章