1.. _urllib-howto: 2 3*********************************************************** 4 HOWTO Fetch Internet Resources Using The urllib Package 5*********************************************************** 6 7:Author: `Michael Foord <https://agileabstractions.com/>`_ 8 9 10Introduction 11============ 12 13.. sidebar:: Related Articles 14 15 You may also find useful the following article on fetching web resources 16 with Python: 17 18 * `Basic Authentication <https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ 19 20 A tutorial on *Basic Authentication*, with examples in Python. 21 22**urllib.request** is a Python module for fetching URLs 23(Uniform Resource Locators). It offers a very simple interface, in the form of 24the *urlopen* function. This is capable of fetching URLs using a variety of 25different protocols. It also offers a slightly more complex interface for 26handling common situations - like basic authentication, cookies, proxies and so 27on. These are provided by objects called handlers and openers. 28 29urllib.request supports fetching URLs for many "URL schemes" (identified by the string 30before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of 31``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP). 32This tutorial focuses on the most common case, HTTP. 33 34For straightforward situations *urlopen* is very easy to use. But as soon as you 35encounter errors or non-trivial cases when opening HTTP URLs, you will need some 36understanding of the HyperText Transfer Protocol. The most comprehensive and 37authoritative reference to HTTP is :rfc:`2616`. This is a technical document and 38not intended to be easy to read. This HOWTO aims to illustrate using *urllib*, 39with enough detail about HTTP to help you through. It is not intended to replace 40the :mod:`urllib.request` docs, but is supplementary to them. 41 42 43Fetching URLs 44============= 45 46The simplest way to use urllib.request is as follows:: 47 48 import urllib.request 49 with urllib.request.urlopen('http://python.org/') as response: 50 html = response.read() 51 52If you wish to retrieve a resource via URL and store it in a temporary 53location, you can do so via the :func:`shutil.copyfileobj` and 54:func:`tempfile.NamedTemporaryFile` functions:: 55 56 import shutil 57 import tempfile 58 import urllib.request 59 60 with urllib.request.urlopen('http://python.org/') as response: 61 with tempfile.NamedTemporaryFile(delete=False) as tmp_file: 62 shutil.copyfileobj(response, tmp_file) 63 64 with open(tmp_file.name) as html: 65 pass 66 67Many uses of urllib will be that simple (note that instead of an 'http:' URL we 68could have used a URL starting with 'ftp:', 'file:', etc.). However, it's the 69purpose of this tutorial to explain the more complicated cases, concentrating on 70HTTP. 71 72HTTP is based on requests and responses - the client makes requests and servers 73send responses. urllib.request mirrors this with a ``Request`` object which represents 74the HTTP request you are making. In its simplest form you create a Request 75object that specifies the URL you want to fetch. Calling ``urlopen`` with this 76Request object returns a response object for the URL requested. This response is 77a file-like object, which means you can for example call ``.read()`` on the 78response:: 79 80 import urllib.request 81 82 req = urllib.request.Request('http://python.org/') 83 with urllib.request.urlopen(req) as response: 84 the_page = response.read() 85 86Note that urllib.request makes use of the same Request interface to handle all URL 87schemes. For example, you can make an FTP request like so:: 88 89 req = urllib.request.Request('ftp://example.com/') 90 91In the case of HTTP, there are two extra things that Request objects allow you 92to do: First, you can pass data to be sent to the server. Second, you can pass 93extra information ("metadata") *about* the data or about the request itself, to 94the server - this information is sent as HTTP "headers". Let's look at each of 95these in turn. 96 97Data 98---- 99 100Sometimes you want to send data to a URL (often the URL will refer to a CGI 101(Common Gateway Interface) script or other web application). With HTTP, 102this is often done using what's known as a **POST** request. This is often what 103your browser does when you submit a HTML form that you filled in on the web. Not 104all POSTs have to come from forms: you can use a POST to transmit arbitrary data 105to your own application. In the common case of HTML forms, the data needs to be 106encoded in a standard way, and then passed to the Request object as the ``data`` 107argument. The encoding is done using a function from the :mod:`urllib.parse` 108library. :: 109 110 import urllib.parse 111 import urllib.request 112 113 url = 'http://www.someserver.com/cgi-bin/register.cgi' 114 values = {'name' : 'Michael Foord', 115 'location' : 'Northampton', 116 'language' : 'Python' } 117 118 data = urllib.parse.urlencode(values) 119 data = data.encode('ascii') # data should be bytes 120 req = urllib.request.Request(url, data) 121 with urllib.request.urlopen(req) as response: 122 the_page = response.read() 123 124Note that other encodings are sometimes required (e.g. for file upload from HTML 125forms - see `HTML Specification, Form Submission 126<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more 127details). 128 129If you do not pass the ``data`` argument, urllib uses a **GET** request. One 130way in which GET and POST requests differ is that POST requests often have 131"side-effects": they change the state of the system in some way (for example by 132placing an order with the website for a hundredweight of tinned spam to be 133delivered to your door). Though the HTTP standard makes it clear that POSTs are 134intended to *always* cause side-effects, and GET requests *never* to cause 135side-effects, nothing prevents a GET request from having side-effects, nor a 136POST requests from having no side-effects. Data can also be passed in an HTTP 137GET request by encoding it in the URL itself. 138 139This is done as follows:: 140 141 >>> import urllib.request 142 >>> import urllib.parse 143 >>> data = {} 144 >>> data['name'] = 'Somebody Here' 145 >>> data['location'] = 'Northampton' 146 >>> data['language'] = 'Python' 147 >>> url_values = urllib.parse.urlencode(data) 148 >>> print(url_values) # The order may differ from below. #doctest: +SKIP 149 name=Somebody+Here&language=Python&location=Northampton 150 >>> url = 'http://www.example.com/example.cgi' 151 >>> full_url = url + '?' + url_values 152 >>> data = urllib.request.urlopen(full_url) 153 154Notice that the full URL is created by adding a ``?`` to the URL, followed by 155the encoded values. 156 157Headers 158------- 159 160We'll discuss here one particular HTTP header, to illustrate how to add headers 161to your HTTP request. 162 163Some websites [#]_ dislike being browsed by programs, or send different versions 164to different browsers [#]_. By default urllib identifies itself as 165``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version 166numbers of the Python release, 167e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain 168not work. The way a browser identifies itself is through the 169``User-Agent`` header [#]_. When you create a Request object you can 170pass a dictionary of headers in. The following example makes the same 171request as above, but identifies itself as a version of Internet 172Explorer [#]_. :: 173 174 import urllib.parse 175 import urllib.request 176 177 url = 'http://www.someserver.com/cgi-bin/register.cgi' 178 user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' 179 values = {'name': 'Michael Foord', 180 'location': 'Northampton', 181 'language': 'Python' } 182 headers = {'User-Agent': user_agent} 183 184 data = urllib.parse.urlencode(values) 185 data = data.encode('ascii') 186 req = urllib.request.Request(url, data, headers) 187 with urllib.request.urlopen(req) as response: 188 the_page = response.read() 189 190The response also has two useful methods. See the section on `info and geturl`_ 191which comes after we have a look at what happens when things go wrong. 192 193 194Handling Exceptions 195=================== 196 197*urlopen* raises :exc:`URLError` when it cannot handle a response (though as 198usual with Python APIs, built-in exceptions such as :exc:`ValueError`, 199:exc:`TypeError` etc. may also be raised). 200 201:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of 202HTTP URLs. 203 204The exception classes are exported from the :mod:`urllib.error` module. 205 206URLError 207-------- 208 209Often, URLError is raised because there is no network connection (no route to 210the specified server), or the specified server doesn't exist. In this case, the 211exception raised will have a 'reason' attribute, which is a tuple containing an 212error code and a text error message. 213 214e.g. :: 215 216 >>> req = urllib.request.Request('http://www.pretend_server.org') 217 >>> try: urllib.request.urlopen(req) 218 ... except urllib.error.URLError as e: 219 ... print(e.reason) #doctest: +SKIP 220 ... 221 (4, 'getaddrinfo failed') 222 223 224HTTPError 225--------- 226 227Every HTTP response from the server contains a numeric "status code". Sometimes 228the status code indicates that the server is unable to fulfil the request. The 229default handlers will handle some of these responses for you (for example, if 230the response is a "redirection" that requests the client fetch the document from 231a different URL, urllib will handle that for you). For those it can't handle, 232urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not 233found), '403' (request forbidden), and '401' (authentication required). 234 235See section 10 of :rfc:`2616` for a reference on all the HTTP error codes. 236 237The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which 238corresponds to the error sent by the server. 239 240Error Codes 241~~~~~~~~~~~ 242 243Because the default handlers handle redirects (codes in the 300 range), and 244codes in the 100--299 range indicate success, you will usually only see error 245codes in the 400--599 range. 246 247:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of 248response codes in that shows all the response codes used by :rfc:`2616`. The 249dictionary is reproduced here for convenience :: 250 251 # Table mapping response codes to messages; entries have the 252 # form {code: (shortmessage, longmessage)}. 253 responses = { 254 100: ('Continue', 'Request received, please continue'), 255 101: ('Switching Protocols', 256 'Switching to new protocol; obey Upgrade header'), 257 258 200: ('OK', 'Request fulfilled, document follows'), 259 201: ('Created', 'Document created, URL follows'), 260 202: ('Accepted', 261 'Request accepted, processing continues off-line'), 262 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), 263 204: ('No Content', 'Request fulfilled, nothing follows'), 264 205: ('Reset Content', 'Clear input form for further input.'), 265 206: ('Partial Content', 'Partial content follows.'), 266 267 300: ('Multiple Choices', 268 'Object has several resources -- see URI list'), 269 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), 270 302: ('Found', 'Object moved temporarily -- see URI list'), 271 303: ('See Other', 'Object moved -- see Method and URL list'), 272 304: ('Not Modified', 273 'Document has not changed since given time'), 274 305: ('Use Proxy', 275 'You must use proxy specified in Location to access this ' 276 'resource.'), 277 307: ('Temporary Redirect', 278 'Object moved temporarily -- see URI list'), 279 280 400: ('Bad Request', 281 'Bad request syntax or unsupported method'), 282 401: ('Unauthorized', 283 'No permission -- see authorization schemes'), 284 402: ('Payment Required', 285 'No payment -- see charging schemes'), 286 403: ('Forbidden', 287 'Request forbidden -- authorization will not help'), 288 404: ('Not Found', 'Nothing matches the given URI'), 289 405: ('Method Not Allowed', 290 'Specified method is invalid for this server.'), 291 406: ('Not Acceptable', 'URI not available in preferred format.'), 292 407: ('Proxy Authentication Required', 'You must authenticate with ' 293 'this proxy before proceeding.'), 294 408: ('Request Timeout', 'Request timed out; try again later.'), 295 409: ('Conflict', 'Request conflict.'), 296 410: ('Gone', 297 'URI no longer exists and has been permanently removed.'), 298 411: ('Length Required', 'Client must specify Content-Length.'), 299 412: ('Precondition Failed', 'Precondition in headers is false.'), 300 413: ('Request Entity Too Large', 'Entity is too large.'), 301 414: ('Request-URI Too Long', 'URI is too long.'), 302 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), 303 416: ('Requested Range Not Satisfiable', 304 'Cannot satisfy request range.'), 305 417: ('Expectation Failed', 306 'Expect condition could not be satisfied.'), 307 308 500: ('Internal Server Error', 'Server got itself in trouble'), 309 501: ('Not Implemented', 310 'Server does not support this operation'), 311 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), 312 503: ('Service Unavailable', 313 'The server cannot process the request due to a high load'), 314 504: ('Gateway Timeout', 315 'The gateway server did not receive a timely response'), 316 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), 317 } 318 319When an error is raised the server responds by returning an HTTP error code 320*and* an error page. You can use the :exc:`HTTPError` instance as a response on the 321page returned. This means that as well as the code attribute, it also has read, 322geturl, and info, methods as returned by the ``urllib.response`` module:: 323 324 >>> req = urllib.request.Request('http://www.python.org/fish.html') 325 >>> try: 326 ... urllib.request.urlopen(req) 327 ... except urllib.error.HTTPError as e: 328 ... print(e.code) 329 ... print(e.read()) #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE 330 ... 331 404 332 b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 333 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html 334 ... 335 <title>Page Not Found</title>\n 336 ... 337 338Wrapping it Up 339-------------- 340 341So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two 342basic approaches. I prefer the second approach. 343 344Number 1 345~~~~~~~~ 346 347:: 348 349 350 from urllib.request import Request, urlopen 351 from urllib.error import URLError, HTTPError 352 req = Request(someurl) 353 try: 354 response = urlopen(req) 355 except HTTPError as e: 356 print('The server couldn\'t fulfill the request.') 357 print('Error code: ', e.code) 358 except URLError as e: 359 print('We failed to reach a server.') 360 print('Reason: ', e.reason) 361 else: 362 # everything is fine 363 364 365.. note:: 366 367 The ``except HTTPError`` *must* come first, otherwise ``except URLError`` 368 will *also* catch an :exc:`HTTPError`. 369 370Number 2 371~~~~~~~~ 372 373:: 374 375 from urllib.request import Request, urlopen 376 from urllib.error import URLError 377 req = Request(someurl) 378 try: 379 response = urlopen(req) 380 except URLError as e: 381 if hasattr(e, 'reason'): 382 print('We failed to reach a server.') 383 print('Reason: ', e.reason) 384 elif hasattr(e, 'code'): 385 print('The server couldn\'t fulfill the request.') 386 print('Error code: ', e.code) 387 else: 388 # everything is fine 389 390 391info and geturl 392=============== 393 394The response returned by urlopen (or the :exc:`HTTPError` instance) has two 395useful methods :meth:`info` and :meth:`geturl` and is defined in the module 396:mod:`urllib.response`.. 397 398**geturl** - this returns the real URL of the page fetched. This is useful 399because ``urlopen`` (or the opener object used) may have followed a 400redirect. The URL of the page fetched may not be the same as the URL requested. 401 402**info** - this returns a dictionary-like object that describes the page 403fetched, particularly the headers sent by the server. It is currently an 404:class:`http.client.HTTPMessage` instance. 405 406Typical headers include 'Content-length', 'Content-type', and so on. See the 407`Quick Reference to HTTP Headers <https://jkorpela.fi/http.html>`_ 408for a useful listing of HTTP headers with brief explanations of their meaning 409and use. 410 411 412Openers and Handlers 413==================== 414 415When you fetch a URL you use an opener (an instance of the perhaps 416confusingly named :class:`urllib.request.OpenerDirector`). Normally we have been using 417the default opener - via ``urlopen`` - but you can create custom 418openers. Openers use handlers. All the "heavy lifting" is done by the 419handlers. Each handler knows how to open URLs for a particular URL scheme (http, 420ftp, etc.), or how to handle an aspect of URL opening, for example HTTP 421redirections or HTTP cookies. 422 423You will want to create openers if you want to fetch URLs with specific handlers 424installed, for example to get an opener that handles cookies, or to get an 425opener that does not handle redirections. 426 427To create an opener, instantiate an ``OpenerDirector``, and then call 428``.add_handler(some_handler_instance)`` repeatedly. 429 430Alternatively, you can use ``build_opener``, which is a convenience function for 431creating opener objects with a single function call. ``build_opener`` adds 432several handlers by default, but provides a quick way to add more and/or 433override the default handlers. 434 435Other sorts of handlers you might want to can handle proxies, authentication, 436and other common but slightly specialised situations. 437 438``install_opener`` can be used to make an ``opener`` object the (global) default 439opener. This means that calls to ``urlopen`` will use the opener you have 440installed. 441 442Opener objects have an ``open`` method, which can be called directly to fetch 443urls in the same way as the ``urlopen`` function: there's no need to call 444``install_opener``, except as a convenience. 445 446 447Basic Authentication 448==================== 449 450To illustrate creating and installing a handler we will use the 451``HTTPBasicAuthHandler``. For a more detailed discussion of this subject -- 452including an explanation of how Basic Authentication works - see the `Basic 453Authentication Tutorial 454<https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`__. 455 456When authentication is required, the server sends a header (as well as the 401 457error code) requesting authentication. This specifies the authentication scheme 458and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME 459realm="REALM"``. 460 461e.g. 462 463.. code-block:: none 464 465 WWW-Authenticate: Basic realm="cPanel Users" 466 467 468The client should then retry the request with the appropriate name and password 469for the realm included as a header in the request. This is 'basic 470authentication'. In order to simplify this process we can create an instance of 471``HTTPBasicAuthHandler`` and an opener to use this handler. 472 473The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle 474the mapping of URLs and realms to passwords and usernames. If you know what the 475realm is (from the authentication header sent by the server), then you can use a 476``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that 477case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows 478you to specify a default username and password for a URL. This will be supplied 479in the absence of you providing an alternative combination for a specific 480realm. We indicate this by providing ``None`` as the realm argument to the 481``add_password`` method. 482 483The top-level URL is the first URL that requires authentication. URLs "deeper" 484than the URL you pass to .add_password() will also match. :: 485 486 # create a password manager 487 password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() 488 489 # Add the username and password. 490 # If we knew the realm, we could use it instead of None. 491 top_level_url = "http://example.com/foo/" 492 password_mgr.add_password(None, top_level_url, username, password) 493 494 handler = urllib.request.HTTPBasicAuthHandler(password_mgr) 495 496 # create "opener" (OpenerDirector instance) 497 opener = urllib.request.build_opener(handler) 498 499 # use the opener to fetch a URL 500 opener.open(a_url) 501 502 # Install the opener. 503 # Now all calls to urllib.request.urlopen use our opener. 504 urllib.request.install_opener(opener) 505 506.. note:: 507 508 In the above example we only supplied our ``HTTPBasicAuthHandler`` to 509 ``build_opener``. By default openers have the handlers for normal situations 510 -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy` 511 environment variable is set), ``UnknownHandler``, ``HTTPHandler``, 512 ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``, 513 ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``. 514 515``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme 516component and the hostname and optionally the port number) 517e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname, 518optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"`` 519(the latter example includes a port number). The authority, if present, must 520NOT contain the "userinfo" component - for example ``"joe:[email protected]"`` is 521not correct. 522 523 524Proxies 525======= 526 527**urllib** will auto-detect your proxy settings and use those. This is through 528the ``ProxyHandler``, which is part of the normal handler chain when a proxy 529setting is detected. Normally that's a good thing, but there are occasions 530when it may not be helpful [#]_. One way to do this is to setup our own 531``ProxyHandler``, with no proxies defined. This is done using similar steps to 532setting up a `Basic Authentication`_ handler: :: 533 534 >>> proxy_support = urllib.request.ProxyHandler({}) 535 >>> opener = urllib.request.build_opener(proxy_support) 536 >>> urllib.request.install_opener(opener) 537 538.. note:: 539 540 Currently ``urllib.request`` *does not* support fetching of ``https`` locations 541 through a proxy. However, this can be enabled by extending urllib.request as 542 shown in the recipe [#]_. 543 544.. note:: 545 546 ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see 547 the documentation on :func:`~urllib.request.getproxies`. 548 549 550Sockets and Layers 551================== 552 553The Python support for fetching resources from the web is layered. urllib uses 554the :mod:`http.client` library, which in turn uses the socket library. 555 556As of Python 2.3 you can specify how long a socket should wait for a response 557before timing out. This can be useful in applications which have to fetch web 558pages. By default the socket module has *no timeout* and can hang. Currently, 559the socket timeout is not exposed at the http.client or urllib.request levels. 560However, you can set the default timeout globally for all sockets using :: 561 562 import socket 563 import urllib.request 564 565 # timeout in seconds 566 timeout = 10 567 socket.setdefaulttimeout(timeout) 568 569 # this call to urllib.request.urlopen now uses the default timeout 570 # we have set in the socket module 571 req = urllib.request.Request('http://www.voidspace.org.uk') 572 response = urllib.request.urlopen(req) 573 574 575------- 576 577 578Footnotes 579========= 580 581This document was reviewed and revised by John Lee. 582 583.. [#] Google for example. 584.. [#] Browser sniffing is a very bad practice for website design - building 585 sites using web standards is much more sensible. Unfortunately a lot of 586 sites still send different versions to different browsers. 587.. [#] The user agent for MSIE 6 is 588 *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* 589.. [#] For details of more HTTP request headers, see 590 `Quick Reference to HTTP Headers`_. 591.. [#] In my case I have to use a proxy to access the internet at work. If you 592 attempt to fetch *localhost* URLs through this proxy it blocks them. IE 593 is set to use the proxy, which urllib picks up on. In order to test 594 scripts with a localhost server, I have to prevent urllib from using 595 the proxy. 596.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe 597 <https://code.activestate.com/recipes/456195/>`_. 598 599