Python爬虫实战之爬取网站全部图片(一)

命题

交互式输入网址，爬取网页中全部png图片。
若想爬取其他图片，只改动正则表达式即可

思路

读取网页中源码
根据网页源码制定正则表达式，进行匹配
存取图片至本地
交互模式输入网址，若不输入则按默认网址爬取(http://findicons.com/pack/2787/beautiful_flat_icons)

知识点

根据源码编写正则表达式：reg=r’src=”(.+?.png)” alt’
为提升速度，进行编译匹配
imgre=re.compile(reg)
imglist=imgre.findall(html)

源码

import urllib.request
import re
import os
import urllib
#打开网页,读取源码
def getHtml(url):
    page=urllib.request.urlopen(url)
    html=page.read()
    return html.decode('UTF-8')
#正则表达式，定位所有图片
def getImag(html):
    reg=r'src="(.+?\.png)" alt'
    imgre=re.compile(reg)#编译一下，提升运行速度
    imglist=imgre.findall(html)#匹配
#指定位置存放
    x=0
    path="e:\\test"
    if not os.path.isdir(path):
        os.makedirs(path)
    paths=path+"\\"
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'{}{}.jpg'.format(paths,x))#以第二个名字下载链接
        x=x+1
    return imglist
#变量
print(u'---------网页抓取图片----------')
print("请输入URL地址")
url=input()
if url:
    pass
else:
    print(u'---------没有输入地址正在使用默认地址----------')
    url='http://findicons.com/pack/2787/beautiful_flat_icons'
print(u'----------正在获取图片---------')
html_code=getHtml(url)
print(u'----------正在下载图片---------')
print(getImag(html_code))
print(u'----------下载成功---------')
input('Press Enter to exit')

运行结果

https://blog.csdn.net/Amy8020/article/details/88844309

Post Views: 105