4.2. Introducing sgmllib.py

4.2. Introducing `sgmllib.py`

HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.

The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags and end tags. Usually you don’t work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally.

sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.

SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:

Start tag: An HTML tag that starts a block, like <html>, <head>, <body>, or <pre>, or a standalone tag like <br> or <img>. When it finds a start tag tagname, SGMLParser will look for a method called start_tagname or do_tagname. For instance, when it finds a <pre> tag, it will look for a start_pre or do_pre method. If found, SGMLParser calls this method with a list of the tag’s attributes; otherwise, it calls unknown_starttag with the tag name and list of attributes.
End tag: An HTML tag that ends a block, like </html>, </head>, </body>, or </pre>. When it finds an end tag, SGMLParser will look for a method called end_tagname. If found, SGMLParser calls this method, otherwise it calls unknown_endtag with the tag name.
Character reference: An escaped character referenced by its decimal or hexadecimal equivalent, like  . When found, SGMLParser calls handle_charref with the text of the decimal or hexidecimal character equivalent.
Entity reference: An HTML entity, like ©. When found, SGMLParser calls handle_entityref with the name of the HTML entity.
Comment: An HTML comment, enclosed in . When found, SGMLParser calls handle_comment with the body of the comment.
Processing instruction: An HTML processing instruction, enclosed in <? ... >. When found, SGMLParser calls handle_pi with the body of the processing instruction.
Declaration: An HTML declaration, such as a DOCTYPE, enclosed in <! ... >. When found, SGMLParser calls handle_decl with the body of the declaration.
Text data: A block of text. Anything that doesn’t fit into the other 7 categories. When found, SGMLParser calls handle_data with the text.


	Python 2.0 had a bug where `SGMLParser` would not recognize declarations at all (`handle_decl` would never be called), which meant that `DOCTYPE`s were silently ignored. This is fixed in Python 2.1.

sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments.


	In the Python IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces.

Example 4.4. Sample test of `sgmllib.py`

Here is a snippet from the table of contents of the HTML version of this book, toc.html.

<h1>
  <a name='c40a'></a>
  Dive Into Python
</h1>
<p class='pubdate'>
  28 Feb 2001
</p>
<p class='copyright'>
  Copyright copy 2000, 2001 by 
  <a href='mailto:f8dy@diveintopython.org' title='send e-mail to the author'>
    Mark Pilgrim
  </a>
</p>
<p>
  <a name='c40ab2b4'></a>
  <b></b>
</p>
<p>
  This book lives at 
  <a href='http://diveintopython.org/'>
    http://diveintopython.org/
  </a>
  .
  If you’re reading it somewhere else, you may not have the latest version.
</p>

Running this through the test suite of sgmllib.py yields this output:

start tag: <h1>
start tag: <a name="c40a" >
end tag: </a>
data: 'Dive Into Python'
end tag: </h1>
start tag: <p class="pubdate" >
data: '28 Feb 2001'
end tag: </p>
start tag: <p class="copyright" >
data: 'Copyright '
*** unknown entity ref: &copy;
data: ' 2000, 2001 by '
start tag: <a href="mailto:f8dy@diveintopython.org" title="send e-mail to the author" >
data: 'Mark Pilgrim'
end tag: </a>
end tag: </p>
start tag: <p>
start tag: <a name="c40ab2b4" >
end tag: </a>
start tag: <b>
end tag: </b>
end tag: </p>
start tag: <p>
data: 'This book lives at '
start tag: <a href="http://diveintopython.org/" >
data: 'http://diveintopython.org/'
end tag: </a>
data: ".\012If you’re reading it somewhere else, you may not have the lates"
data: 't version.\012'
end tag: </p>

Here’s the roadmap for the rest of the chapter:

Subclass SGMLParser to create classes that extract interesting data out of HTML documents.
Subclass SGMLParser to create BaseHTMLProcessor, which overrides all 8 handler methods and uses them to reconstruct the original HTML from the pieces.
Subclass BaseHTMLProcessor to create Dialectizer, which adds some methods to process specific HTML tags specially, and overrides the handle_data method to provide a framework for processing the text blocks between the HTML tags.
Subclass Dialectizer to create classes that define text processing rules used by Dialectizer.handle_data.
Write a test suite that grabs a real web page from http://diveintopython.org/ and processes it.

Dive Into Python