1.. _urllib-howto:
2
3***********************************************************
4  HOWTO Fetch Internet Resources Using The urllib Package
5***********************************************************
6
7:Author: `Michael Foord <https://agileabstractions.com/>`_
8
9
10Introduction
11============
12
13.. sidebar:: Related Articles
14
15    You may also find useful the following article on fetching web resources
16    with Python:
17
18    * `Basic Authentication <https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
19
20        A tutorial on *Basic Authentication*, with examples in Python.
21
22**urllib.request** is a Python module for fetching URLs
23(Uniform Resource Locators). It offers a very simple interface, in the form of
24the *urlopen* function. This is capable of fetching URLs using a variety of
25different protocols. It also offers a slightly more complex interface for
26handling common situations - like basic authentication, cookies, proxies and so
27on. These are provided by objects called handlers and openers.
28
29urllib.request supports fetching URLs for many "URL schemes" (identified by the string
30before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
31``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
32This tutorial focuses on the most common case, HTTP.
33
34For straightforward situations *urlopen* is very easy to use. But as soon as you
35encounter errors or non-trivial cases when opening HTTP URLs, you will need some
36understanding of the HyperText Transfer Protocol. The most comprehensive and
37authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
38not intended to be easy to read. This HOWTO aims to illustrate using *urllib*,
39with enough detail about HTTP to help you through. It is not intended to replace
40the :mod:`urllib.request` docs, but is supplementary to them.
41
42
43Fetching URLs
44=============
45
46The simplest way to use urllib.request is as follows::
47
48    import urllib.request
49    with urllib.request.urlopen('http://python.org/') as response:
50       html = response.read()
51
52If you wish to retrieve a resource via URL and store it in a temporary
53location, you can do so via the :func:`shutil.copyfileobj` and
54:func:`tempfile.NamedTemporaryFile` functions::
55
56    import shutil
57    import tempfile
58    import urllib.request
59
60    with urllib.request.urlopen('http://python.org/') as response:
61        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
62            shutil.copyfileobj(response, tmp_file)
63
64    with open(tmp_file.name) as html:
65        pass
66
67Many uses of urllib will be that simple (note that instead of an 'http:' URL we
68could have used a URL starting with 'ftp:', 'file:', etc.).  However, it's the
69purpose of this tutorial to explain the more complicated cases, concentrating on
70HTTP.
71
72HTTP is based on requests and responses - the client makes requests and servers
73send responses. urllib.request mirrors this with a ``Request`` object which represents
74the HTTP request you are making. In its simplest form you create a Request
75object that specifies the URL you want to fetch. Calling ``urlopen`` with this
76Request object returns a response object for the URL requested. This response is
77a file-like object, which means you can for example call ``.read()`` on the
78response::
79
80    import urllib.request
81
82    req = urllib.request.Request('http://python.org/')
83    with urllib.request.urlopen(req) as response:
84       the_page = response.read()
85
86Note that urllib.request makes use of the same Request interface to handle all URL
87schemes.  For example, you can make an FTP request like so::
88
89    req = urllib.request.Request('ftp://example.com/')
90
91In the case of HTTP, there are two extra things that Request objects allow you
92to do: First, you can pass data to be sent to the server.  Second, you can pass
93extra information ("metadata") *about* the data or about the request itself, to
94the server - this information is sent as HTTP "headers".  Let's look at each of
95these in turn.
96
97Data
98----
99
100Sometimes you want to send data to a URL (often the URL will refer to a CGI
101(Common Gateway Interface) script or other web application). With HTTP,
102this is often done using what's known as a **POST** request. This is often what
103your browser does when you submit a HTML form that you filled in on the web. Not
104all POSTs have to come from forms: you can use a POST to transmit arbitrary data
105to your own application. In the common case of HTML forms, the data needs to be
106encoded in a standard way, and then passed to the Request object as the ``data``
107argument. The encoding is done using a function from the :mod:`urllib.parse`
108library. ::
109
110    import urllib.parse
111    import urllib.request
112
113    url = 'http://www.someserver.com/cgi-bin/register.cgi'
114    values = {'name' : 'Michael Foord',
115              'location' : 'Northampton',
116              'language' : 'Python' }
117
118    data = urllib.parse.urlencode(values)
119    data = data.encode('ascii') # data should be bytes
120    req = urllib.request.Request(url, data)
121    with urllib.request.urlopen(req) as response:
122       the_page = response.read()
123
124Note that other encodings are sometimes required (e.g. for file upload from HTML
125forms - see `HTML Specification, Form Submission
126<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
127details).
128
129If you do not pass the ``data`` argument, urllib uses a **GET** request. One
130way in which GET and POST requests differ is that POST requests often have
131"side-effects": they change the state of the system in some way (for example by
132placing an order with the website for a hundredweight of tinned spam to be
133delivered to your door).  Though the HTTP standard makes it clear that POSTs are
134intended to *always* cause side-effects, and GET requests *never* to cause
135side-effects, nothing prevents a GET request from having side-effects, nor a
136POST requests from having no side-effects. Data can also be passed in an HTTP
137GET request by encoding it in the URL itself.
138
139This is done as follows::
140
141    >>> import urllib.request
142    >>> import urllib.parse
143    >>> data = {}
144    >>> data['name'] = 'Somebody Here'
145    >>> data['location'] = 'Northampton'
146    >>> data['language'] = 'Python'
147    >>> url_values = urllib.parse.urlencode(data)
148    >>> print(url_values)  # The order may differ from below.  #doctest: +SKIP
149    name=Somebody+Here&language=Python&location=Northampton
150    >>> url = 'http://www.example.com/example.cgi'
151    >>> full_url = url + '?' + url_values
152    >>> data = urllib.request.urlopen(full_url)
153
154Notice that the full URL is created by adding a ``?`` to the URL, followed by
155the encoded values.
156
157Headers
158-------
159
160We'll discuss here one particular HTTP header, to illustrate how to add headers
161to your HTTP request.
162
163Some websites [#]_ dislike being browsed by programs, or send different versions
164to different browsers [#]_. By default urllib identifies itself as
165``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
166numbers of the Python release,
167e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
168not work. The way a browser identifies itself is through the
169``User-Agent`` header [#]_. When you create a Request object you can
170pass a dictionary of headers in. The following example makes the same
171request as above, but identifies itself as a version of Internet
172Explorer [#]_. ::
173
174    import urllib.parse
175    import urllib.request
176
177    url = 'http://www.someserver.com/cgi-bin/register.cgi'
178    user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
179    values = {'name': 'Michael Foord',
180              'location': 'Northampton',
181              'language': 'Python' }
182    headers = {'User-Agent': user_agent}
183
184    data = urllib.parse.urlencode(values)
185    data = data.encode('ascii')
186    req = urllib.request.Request(url, data, headers)
187    with urllib.request.urlopen(req) as response:
188       the_page = response.read()
189
190The response also has two useful methods. See the section on `info and geturl`_
191which comes after we have a look at what happens when things go wrong.
192
193
194Handling Exceptions
195===================
196
197*urlopen* raises :exc:`URLError` when it cannot handle a response (though as
198usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
199:exc:`TypeError` etc. may also be raised).
200
201:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
202HTTP URLs.
203
204The exception classes are exported from the :mod:`urllib.error` module.
205
206URLError
207--------
208
209Often, URLError is raised because there is no network connection (no route to
210the specified server), or the specified server doesn't exist.  In this case, the
211exception raised will have a 'reason' attribute, which is a tuple containing an
212error code and a text error message.
213
214e.g. ::
215
216    >>> req = urllib.request.Request('http://www.pretend_server.org')
217    >>> try: urllib.request.urlopen(req)
218    ... except urllib.error.URLError as e:
219    ...     print(e.reason)      #doctest: +SKIP
220    ...
221    (4, 'getaddrinfo failed')
222
223
224HTTPError
225---------
226
227Every HTTP response from the server contains a numeric "status code". Sometimes
228the status code indicates that the server is unable to fulfil the request. The
229default handlers will handle some of these responses for you (for example, if
230the response is a "redirection" that requests the client fetch the document from
231a different URL, urllib will handle that for you). For those it can't handle,
232urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
233found), '403' (request forbidden), and '401' (authentication required).
234
235See section 10 of :rfc:`2616` for a reference on all the HTTP error codes.
236
237The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
238corresponds to the error sent by the server.
239
240Error Codes
241~~~~~~~~~~~
242
243Because the default handlers handle redirects (codes in the 300 range), and
244codes in the 100--299 range indicate success, you will usually only see error
245codes in the 400--599 range.
246
247:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
248response codes in that shows all the response codes used by :rfc:`2616`. The
249dictionary is reproduced here for convenience ::
250
251    # Table mapping response codes to messages; entries have the
252    # form {code: (shortmessage, longmessage)}.
253    responses = {
254        100: ('Continue', 'Request received, please continue'),
255        101: ('Switching Protocols',
256              'Switching to new protocol; obey Upgrade header'),
257
258        200: ('OK', 'Request fulfilled, document follows'),
259        201: ('Created', 'Document created, URL follows'),
260        202: ('Accepted',
261              'Request accepted, processing continues off-line'),
262        203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
263        204: ('No Content', 'Request fulfilled, nothing follows'),
264        205: ('Reset Content', 'Clear input form for further input.'),
265        206: ('Partial Content', 'Partial content follows.'),
266
267        300: ('Multiple Choices',
268              'Object has several resources -- see URI list'),
269        301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
270        302: ('Found', 'Object moved temporarily -- see URI list'),
271        303: ('See Other', 'Object moved -- see Method and URL list'),
272        304: ('Not Modified',
273              'Document has not changed since given time'),
274        305: ('Use Proxy',
275              'You must use proxy specified in Location to access this '
276              'resource.'),
277        307: ('Temporary Redirect',
278              'Object moved temporarily -- see URI list'),
279
280        400: ('Bad Request',
281              'Bad request syntax or unsupported method'),
282        401: ('Unauthorized',
283              'No permission -- see authorization schemes'),
284        402: ('Payment Required',
285              'No payment -- see charging schemes'),
286        403: ('Forbidden',
287              'Request forbidden -- authorization will not help'),
288        404: ('Not Found', 'Nothing matches the given URI'),
289        405: ('Method Not Allowed',
290              'Specified method is invalid for this server.'),
291        406: ('Not Acceptable', 'URI not available in preferred format.'),
292        407: ('Proxy Authentication Required', 'You must authenticate with '
293              'this proxy before proceeding.'),
294        408: ('Request Timeout', 'Request timed out; try again later.'),
295        409: ('Conflict', 'Request conflict.'),
296        410: ('Gone',
297              'URI no longer exists and has been permanently removed.'),
298        411: ('Length Required', 'Client must specify Content-Length.'),
299        412: ('Precondition Failed', 'Precondition in headers is false.'),
300        413: ('Request Entity Too Large', 'Entity is too large.'),
301        414: ('Request-URI Too Long', 'URI is too long.'),
302        415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
303        416: ('Requested Range Not Satisfiable',
304              'Cannot satisfy request range.'),
305        417: ('Expectation Failed',
306              'Expect condition could not be satisfied.'),
307
308        500: ('Internal Server Error', 'Server got itself in trouble'),
309        501: ('Not Implemented',
310              'Server does not support this operation'),
311        502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
312        503: ('Service Unavailable',
313              'The server cannot process the request due to a high load'),
314        504: ('Gateway Timeout',
315              'The gateway server did not receive a timely response'),
316        505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
317        }
318
319When an error is raised the server responds by returning an HTTP error code
320*and* an error page. You can use the :exc:`HTTPError` instance as a response on the
321page returned. This means that as well as the code attribute, it also has read,
322geturl, and info, methods as returned by the ``urllib.response`` module::
323
324    >>> req = urllib.request.Request('http://www.python.org/fish.html')
325    >>> try:
326    ...     urllib.request.urlopen(req)
327    ... except urllib.error.HTTPError as e:
328    ...     print(e.code)
329    ...     print(e.read())  #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
330    ...
331    404
332    b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
333      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
334      ...
335      <title>Page Not Found</title>\n
336      ...
337
338Wrapping it Up
339--------------
340
341So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
342basic approaches. I prefer the second approach.
343
344Number 1
345~~~~~~~~
346
347::
348
349
350    from urllib.request import Request, urlopen
351    from urllib.error import URLError, HTTPError
352    req = Request(someurl)
353    try:
354        response = urlopen(req)
355    except HTTPError as e:
356        print('The server couldn\'t fulfill the request.')
357        print('Error code: ', e.code)
358    except URLError as e:
359        print('We failed to reach a server.')
360        print('Reason: ', e.reason)
361    else:
362        # everything is fine
363
364
365.. note::
366
367    The ``except HTTPError`` *must* come first, otherwise ``except URLError``
368    will *also* catch an :exc:`HTTPError`.
369
370Number 2
371~~~~~~~~
372
373::
374
375    from urllib.request import Request, urlopen
376    from urllib.error import URLError
377    req = Request(someurl)
378    try:
379        response = urlopen(req)
380    except URLError as e:
381        if hasattr(e, 'reason'):
382            print('We failed to reach a server.')
383            print('Reason: ', e.reason)
384        elif hasattr(e, 'code'):
385            print('The server couldn\'t fulfill the request.')
386            print('Error code: ', e.code)
387    else:
388        # everything is fine
389
390
391info and geturl
392===============
393
394The response returned by urlopen (or the :exc:`HTTPError` instance) has two
395useful methods :meth:`info` and :meth:`geturl` and is defined in the module
396:mod:`urllib.response`..
397
398**geturl** - this returns the real URL of the page fetched. This is useful
399because ``urlopen`` (or the opener object used) may have followed a
400redirect. The URL of the page fetched may not be the same as the URL requested.
401
402**info** - this returns a dictionary-like object that describes the page
403fetched, particularly the headers sent by the server. It is currently an
404:class:`http.client.HTTPMessage` instance.
405
406Typical headers include 'Content-length', 'Content-type', and so on. See the
407`Quick Reference to HTTP Headers <https://jkorpela.fi/http.html>`_
408for a useful listing of HTTP headers with brief explanations of their meaning
409and use.
410
411
412Openers and Handlers
413====================
414
415When you fetch a URL you use an opener (an instance of the perhaps
416confusingly named :class:`urllib.request.OpenerDirector`). Normally we have been using
417the default opener - via ``urlopen`` - but you can create custom
418openers. Openers use handlers. All the "heavy lifting" is done by the
419handlers. Each handler knows how to open URLs for a particular URL scheme (http,
420ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
421redirections or HTTP cookies.
422
423You will want to create openers if you want to fetch URLs with specific handlers
424installed, for example to get an opener that handles cookies, or to get an
425opener that does not handle redirections.
426
427To create an opener, instantiate an ``OpenerDirector``, and then call
428``.add_handler(some_handler_instance)`` repeatedly.
429
430Alternatively, you can use ``build_opener``, which is a convenience function for
431creating opener objects with a single function call.  ``build_opener`` adds
432several handlers by default, but provides a quick way to add more and/or
433override the default handlers.
434
435Other sorts of handlers you might want to can handle proxies, authentication,
436and other common but slightly specialised situations.
437
438``install_opener`` can be used to make an ``opener`` object the (global) default
439opener. This means that calls to ``urlopen`` will use the opener you have
440installed.
441
442Opener objects have an ``open`` method, which can be called directly to fetch
443urls in the same way as the ``urlopen`` function: there's no need to call
444``install_opener``, except as a convenience.
445
446
447Basic Authentication
448====================
449
450To illustrate creating and installing a handler we will use the
451``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
452including an explanation of how Basic Authentication works - see the `Basic
453Authentication Tutorial
454<https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`__.
455
456When authentication is required, the server sends a header (as well as the 401
457error code) requesting authentication.  This specifies the authentication scheme
458and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
459realm="REALM"``.
460
461e.g.
462
463.. code-block:: none
464
465    WWW-Authenticate: Basic realm="cPanel Users"
466
467
468The client should then retry the request with the appropriate name and password
469for the realm included as a header in the request. This is 'basic
470authentication'. In order to simplify this process we can create an instance of
471``HTTPBasicAuthHandler`` and an opener to use this handler.
472
473The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
474the mapping of URLs and realms to passwords and usernames. If you know what the
475realm is (from the authentication header sent by the server), then you can use a
476``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
477case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
478you to specify a default username and password for a URL. This will be supplied
479in the absence of you providing an alternative combination for a specific
480realm. We indicate this by providing ``None`` as the realm argument to the
481``add_password`` method.
482
483The top-level URL is the first URL that requires authentication. URLs "deeper"
484than the URL you pass to .add_password() will also match. ::
485
486    # create a password manager
487    password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
488
489    # Add the username and password.
490    # If we knew the realm, we could use it instead of None.
491    top_level_url = "http://example.com/foo/"
492    password_mgr.add_password(None, top_level_url, username, password)
493
494    handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
495
496    # create "opener" (OpenerDirector instance)
497    opener = urllib.request.build_opener(handler)
498
499    # use the opener to fetch a URL
500    opener.open(a_url)
501
502    # Install the opener.
503    # Now all calls to urllib.request.urlopen use our opener.
504    urllib.request.install_opener(opener)
505
506.. note::
507
508    In the above example we only supplied our ``HTTPBasicAuthHandler`` to
509    ``build_opener``. By default openers have the handlers for normal situations
510    -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
511    environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
512    ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
513    ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``.
514
515``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
516component and the hostname and optionally the port number)
517e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname,
518optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
519(the latter example includes a port number).  The authority, if present, must
520NOT contain the "userinfo" component - for example ``"joe:[email protected]"`` is
521not correct.
522
523
524Proxies
525=======
526
527**urllib** will auto-detect your proxy settings and use those. This is through
528the ``ProxyHandler``, which is part of the normal handler chain when a proxy
529setting is detected.  Normally that's a good thing, but there are occasions
530when it may not be helpful [#]_. One way to do this is to setup our own
531``ProxyHandler``, with no proxies defined. This is done using similar steps to
532setting up a `Basic Authentication`_ handler: ::
533
534    >>> proxy_support = urllib.request.ProxyHandler({})
535    >>> opener = urllib.request.build_opener(proxy_support)
536    >>> urllib.request.install_opener(opener)
537
538.. note::
539
540    Currently ``urllib.request`` *does not* support fetching of ``https`` locations
541    through a proxy.  However, this can be enabled by extending urllib.request as
542    shown in the recipe [#]_.
543
544.. note::
545
546    ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
547    the documentation on :func:`~urllib.request.getproxies`.
548
549
550Sockets and Layers
551==================
552
553The Python support for fetching resources from the web is layered.  urllib uses
554the :mod:`http.client` library, which in turn uses the socket library.
555
556As of Python 2.3 you can specify how long a socket should wait for a response
557before timing out. This can be useful in applications which have to fetch web
558pages. By default the socket module has *no timeout* and can hang. Currently,
559the socket timeout is not exposed at the http.client or urllib.request levels.
560However, you can set the default timeout globally for all sockets using ::
561
562    import socket
563    import urllib.request
564
565    # timeout in seconds
566    timeout = 10
567    socket.setdefaulttimeout(timeout)
568
569    # this call to urllib.request.urlopen now uses the default timeout
570    # we have set in the socket module
571    req = urllib.request.Request('http://www.voidspace.org.uk')
572    response = urllib.request.urlopen(req)
573
574
575-------
576
577
578Footnotes
579=========
580
581This document was reviewed and revised by John Lee.
582
583.. [#] Google for example.
584.. [#] Browser sniffing is a very bad practice for website design - building
585       sites using web standards is much more sensible. Unfortunately a lot of
586       sites still send different versions to different browsers.
587.. [#] The user agent for MSIE 6 is
588       *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
589.. [#] For details of more HTTP request headers, see
590       `Quick Reference to HTTP Headers`_.
591.. [#] In my case I have to use a proxy to access the internet at work. If you
592       attempt to fetch *localhost* URLs through this proxy it blocks them. IE
593       is set to use the proxy, which urllib picks up on. In order to test
594       scripts with a localhost server, I have to prevent urllib from using
595       the proxy.
596.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
597       <https://code.activestate.com/recipes/456195/>`_.
598
599