4.3. Extracting data from HTML documents

To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.

The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.

Example 4.5. Introducing `urllib`

>>> import urllib                                       
>>> sock = urllib.urlopen("http://diveintopython.org/") 
>>> htmlSource = sock.read()                            
>>> sock.close()                                        
>>> print htmlSource                                    
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
      <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
   <title>Dive Into Python</title>
<link rel='stylesheet' href='diveintopython.css' type='text/css'>
<link rev='made' href='mailto:f8dy@diveintopython.org'>
<meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
<meta name='description' content='a free Python tutorial for experienced programmers'>
</head>
<body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
<table cellpadding='0' cellspacing='0' border='0' width='100%'>
<tr><td class='header' width='1%' valign='top'>diveintopython.org</td>
<td width='99%' align='right'><hr size='1' noshade></td></tr>
<tr><td class='tagline' colspan='2'>Python&nbsp;for&nbsp;experienced&nbsp;programmers</td></tr>

[...snip...]

	The `urllib` module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages).
	The simplest use of `urllib` is to retrieve the entire text of a web page using the `urlopen` function. Opening a URL is similar to opening a file. The return value of `urlopen` is a file-like object, which has some of the same methods as a file object.
	The simplest thing to do with the file-like object returned by `urlopen` is `read`, which reads the entire HTML of the web page into a single string. The object also supports `readlines`, which reads the text line by line into a list.
	When you’re done with the object, make sure to `close` it, just like a normal file object.
	We now have the complete HTML of the home page of `http://diveintopython.org/` in a string, and we’re ready to parse it.

Example 4.6. Introducing `urllister.py`

If you have not already done so, you can download this and other examples used in this book.


from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)

	`reset` is called by the `__init__` method of `SGMLParser`, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in `reset`, not in `__init__`, so that it will be re-initialized properly when someone re-uses a parser instance.
	`start_a` is called by `SGMLParser` whenever it finds an `<a>` tag. The tag may contain an `href` attribute, and/or other attributes, like `name` or `title`. The `attrs` parameter is a list of tuples, `[(attribute, value), (attribute, value), ...]`. Or it may be just an `<a>`, a valid (if useless) HTML tag, in which case `attrs` would be an empty list.
	We can find out whether this `<a>` tag has an `href` attribute with a simple multi-variable list comprehension.
	String comparisons like `k=='href'` are always case-sensitive, but that’s safe in this case, because `SGMLParser` converts attribute names to lowercase while building `attrs`.

Example 4.7. Using `urllister.py`

>>> import urllib, urllister
>>> usock = urllib.urlopen("http://diveintopython.org/")
>>> parser = urllister.URLLister()
>>> parser.feed(usock.read())         
>>> usock.close()                     
>>> parser.close()                    
>>> for url in parser.urls: print url 
toc.html
#download
toc.html
history.html
download/dip_pdf.zip
download/dip_pdf.tgz
download/dip_pdf.hqx
download/diveintopython.pdf
download/diveintopython.zip
download/diveintopython.tgz
download/diveintopython.hqx

[...snip...]

	Call the `feed` method, defined in `SGMLParser`, to get HTML into the parser.^[7] It takes a string, which is what `usock.read()` returns.
	Like files, you should `close` your URL objects as soon as you’re done with them.
	You should `close` your parser object, too, but for a different reason. The `feed` method isn’t guaranteed to process all the HTML you give it; it may buffer it, waiting for more. Once there isn’t any more, call `close` to flush the buffer and force everything to be fully parsed.
	Once the parser is `close`d, the parsing is complete, and `parser.urls` contains a list of all the linked URLs in the HTML document.

Footnotes

^[7]The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there’s just a dark cage with no trees or plants or evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring back at you from the far left corner, but you convince yourself that that’s just your mind playing tricks on you, and the only way you can tell that the whole thing isn’t just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that’s just me. In any event, it’s an interesting mental image.

Dive Into Python