Python3爬虫之urllib模块

发表于 2019-05-02 更新于 2020-07-14 分类于 Python后端本文字数： 8.6k 阅读时长 ≈ 8 分钟

Python2 中的urllib和Python3中的urlllib的区别

在python 2中有urllib和urllib2两个库进行实现请求的发送，但是再python 3中官方将urllib2库合并在了urllib库中。所以在3中使用urllib库时一定要先了解其中的四个模块：

request：这个模块是最基本的HTTP请求模块，用来模拟发送请求。
error：异常处理模块，用来捕获异常，保证程序不会意外终止。
parse：工具模块，提供了许多URL处理方法，如拆分、解析、编码等
robotparser:用来识别网站的robots.txt文件，判断哪些网站可以爬，一般不用。
此处只讲解前三个的用法。

请求的发送

1、urlopen()函数
在urllib.request模块中提供了最基本的构造HTTP请求的方法，可以利用它进行模拟浏览器发送一个请求，同时它还具有处理授权验证、重定向、浏览器cookies等其他内容
下面以百度为例子，进行网页的抓取：

from urllib import request

response = request.urlopen('http://www.baidu.com')
print(response.read().decode('utf8'))

可以很简单的看出，上面很简单的三行代码已经实现了一个简单的爬虫了。只不过这个爬虫获取下来的内容没有经过任何数据处理，我们可以很容的看到百度首页的源代码。
之后，查看返回的内容是什么类型的。
print(type(response))
输出为：<class 'http.client.HTTPResponse'>
可以发现，它是一个HTTPResponse类型的对象，接下来查看有那些函数和属性，使用dir(response)：
输出为:

['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 'writelines']

其中最主要函数为：read()、readinto()、getheader(name)、getheaders()、fileno()等方法。
在调用read()方法时可以返回网页内容，使用status属性可以查看返回的状态码。
在urlopen()方法中，最基本的是简单网页的GET请求抓取。但是如果要传入一个参数呢？下面就讲解参数的使用方法。

data参数
在讲解之前，首先需要说明的是data参数是可选的，如果传入了此参数，要以字节流编码格式的内容，即bytes类型，需要进行转化。此时的请求方式不再是GET方法，而是POST方法.因为如果使用GET方法，会将传入data暴露在url中，十分不安全。
实例一:

from urllib import request, parse

data = bytes(parse.urlencode({'word': 'hello world'}), encoding='utf8')
response = request.urlopen('http://httpbin.org/post', data=data)
print(response.read().decode('utf8'))

这里我们传入了一个参数word，值为hello world。由于发送的的data需要是字节流类型，此处使用到了上文提到的工具包函数parse，使用它将我们要传入的函数进行转码。在我们执行完这段代码后，服务器会给我们返回一下结果:

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "word": "hello world"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "16",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/3.7"
  },
  "json": null,
  "origin": "115.60.58.52",
  "url": "http://httpbin.org/post"
}

此时我们可以很容易的看到，在返回的结果中有form字段，字段中的内容就是我们传入的data中的内容，至此我们已经模拟了表单提交的过程，以POST请求方式尽心传输数据。

timeout参数
顾名思义，这个参数就是用来设置超时的。单位为秒，意思是当请求超出了设置的值后还有得到响应，就讲异常抛出。如果此参数不指定时间，使用全局默认时间。
实例：

from urllib import request

response = request .urlopen('http://httpbin.org/get', timeout=0.1)
print(response .read().decode('utf8'))

运行结果：

Traceback (most recent call last):
  File "test_urllib_001.py", line 14, in <module>
    response = request .urlopen('http://httpbin.org/get', timeout=0.1)
  File "/home/rain/.pyenv/versions/3.7.0/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/rain/.pyenv/versions/3.7.0/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/home/rain/.pyenv/versions/3.7.0/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/home/rain/.pyenv/versions/3.7.0/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/rain/.pyenv/versions/3.7.0/lib/python3.7/urllib/request.py", line 1345, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/home/rain/.pyenv/versions/3.7.0/lib/python3.7/urllib/request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error timed out>

这里我们设置了超时的时间是0.1秒。程序再执行0.1秒以后，服务器依然没有任何响应，于是抛出了URLError异常。可以从<urlopen error timed out>看出是网络超时。那么问题来了，如何抓取这种情况呢？
实例：

from urllib import request, error
import socket

try:
    response = request .urlopen('http://httpbin.org/get', timeout=0.1)
except error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('Time Out')

我们用try: … except: …来捕获异常，此处引入了socket模块。

2、Request
到此，我们知道了如何用urlopen方法实现一个简单的请求发送。但是这些简单参数有时候并不能满足我们的需求，比如我们需要在请求中添加header，此时就需要一个更加强大的类来实现了，这就是本节要讲的Request类。
实例是最好的验证。
实例：

from urllib import request

req = request.Request('https://baidu.com')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

此处我们依然使用了上文的urlopen()方法，只是此时的参数不再是一个URL了，而是一个Request对象。另外，这个函数我们可以很灵活的配置我们需要的参数。看下它的构造方法吧：
urllib.request.Request(url, data=None, headers={},origin_req_host=None, unverifiable=False,method=None)
解释下：

url：请求的URL，必填项
data：字节流，请求时需要带的数据
header：请求头，在创建实例后可以通过add_header()添加
origin_req_host: 请求方的host名称或IP地址
unverifiable：表示请求是否无法是验证的，默认为False。意思是用户没有足够的权限来选择接收这个请求的结果。
method：请求是用的方法，如GET、POST等

下面实现一个多参数构建请求的例子：

from urllib import request, parse

url = 'http://httpbin.org/post'

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'
}
dict = {
    'name': "spider_road"
}

data= bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method ='POST')
response = request.urlopen(req)
print(response.read().decode('utf8'))

我们通过四个参数四个参数构建了这个请求，运行结果如下：

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "spider_road"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "16",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36"
  },
  "json": null,
  "origin": "115.60.58.52",
  "url": "http://httpbin.org/post"
}

异常的处理

在上节中，我们简单的提到了如何对异常做出处理，但是程序在运行过程总会出现一些其他的错误，当出现这些错误时，异常处理还是很有必要做的。
1、URLError
URLError类来自于urllib库的error模块，继承自OSError类，由request模块产生的异常都已使用这个类来捕捉。
实例：

from urllib import request, error

try:
    response = request.urlopen('https://spider-road.com')
except error.URLError as e:
    print(e.reason)

上面的程序我们使用它打开了一个不存在的网址，我们使用try:… except:…来捕获了异常。
运行输出：[Errno -2] Name or service not known，此时有效得避免了程序异常终止。

2、HTTPError
它是URLError的子类，用来处理HTTP请求错误，比如认证请求失败等。它有一下3个属性：

code： http状态码
reason：错误原因
headers：返回请求头

实例验证：

from urllib import request, error

try:
    response = request.urlopen('https://spider-road.com')
except error.HTTPError as e:
    print(f'code:{e.code}\nreason:{e.reason}\nheaders:{e.headers}')
except error.URLError as e:
    print(e.reason)

由于没有该网站，所以最后返回的是：[Errno 110] Connection timed out
至此，关于urllib请求和错误处理已经讲解完毕，后续还有关于此节更为高级的用法。欢迎订阅爬虫之道，我们在这里等待你的到来。