|You are here: Home > Dive Into Python > HTML Processing > Introducing sgmllib.py||<< >>|
Python for experienced programmers
HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.
The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags and end tags. Usually you don’t work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally.
sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.
SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:
|Python 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1.|
sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments.
|In the Python IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces.|
Here is a snippet from the table of contents of the HTML version of this book, toc.html.
<h1> <a name='c40a'></a> Dive Into Python </h1> <p class='pubdate'> 28 Feb 2001 </p> <p class='copyright'> Copyright copy 2000, 2001 by <a href='mailto:firstname.lastname@example.org' title='send e-mail to the author'> Mark Pilgrim </a> </p> <p> <a name='c40ab2b4'></a> <b></b> </p> <p> This book lives at <a href='http://diveintopython.org/'> http://diveintopython.org/ </a> . If you’re reading it somewhere else, you may not have the latest version. </p>
Running this through the test suite of sgmllib.py yields this output:
start tag: <h1> start tag: <a name="c40a" > end tag: </a> data: 'Dive Into Python' end tag: </h1> start tag: <p class="pubdate" > data: '28 Feb 2001' end tag: </p> start tag: <p class="copyright" > data: 'Copyright ' *** unknown entity ref: © data: ' 2000, 2001 by ' start tag: <a href="mailto:email@example.com" title="send e-mail to the author" > data: 'Mark Pilgrim' end tag: </a> end tag: </p> start tag: <p> start tag: <a name="c40ab2b4" > end tag: </a> start tag: <b> end tag: </b> end tag: </p> start tag: <p> data: 'This book lives at ' start tag: <a href="http://diveintopython.org/" > data: 'http://diveintopython.org/' end tag: </a> data: ".\012If you’re reading it somewhere else, you may not have the lates" data: 't version.\012' end tag: </p>
Here’s the roadmap for the rest of the chapter:
<< HTML Processing
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Extracting data from HTML documents >>